[jira] [Assigned] (SPARK-22476) Add new function dayofweek in R
[ https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22476: Assignee: (was: Apache Spark) > Add new function dayofweek in R > --- > > Key: SPARK-22476 > URL: https://issues.apache.org/jira/browse/SPARK-22476 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > SQL was added in SPARK-20909. > Scala and Python for {{dayofweek}} were added in SPARK-22456. > Looks we should better add it in R too. > Please refer both JIRAs for more details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22476) Add new function dayofweek in R
[ https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245305#comment-16245305 ] Apache Spark commented on SPARK-22476: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19706 > Add new function dayofweek in R > --- > > Key: SPARK-22476 > URL: https://issues.apache.org/jira/browse/SPARK-22476 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > SQL was added in SPARK-20909. > Scala and Python for {{dayofweek}} were added in SPARK-22456. > Looks we should better add it in R too. > Please refer both JIRAs for more details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22476) Add new function dayofweek in R
[ https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22476: Assignee: Apache Spark > Add new function dayofweek in R > --- > > Key: SPARK-22476 > URL: https://issues.apache.org/jira/browse/SPARK-22476 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > SQL was added in SPARK-20909. > Scala and Python for {{dayofweek}} were added in SPARK-22456. > Looks we should better add it in R too. > Please refer both JIRAs for more details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22476) Add new function dayofweek in R
Hyukjin Kwon created SPARK-22476: Summary: Add new function dayofweek in R Key: SPARK-22476 URL: https://issues.apache.org/jira/browse/SPARK-22476 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Priority: Minor SQL was added in SPARK-20909. Scala and Python for {{dayofweek}} were added in SPARK-22456. Looks we should better add it in R too. Please refer both JIRAs for more details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite
[ https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245283#comment-16245283 ] Apache Spark commented on SPARK-22308: -- User 'nkronenfeld' has created a pull request for this issue: https://github.com/apache/spark/pull/19705 > Support unit tests of spark code using ScalaTest using suites other than > FunSuite > - > > Key: SPARK-22308 > URL: https://issues.apache.org/jira/browse/SPARK-22308 > Project: Spark > Issue Type: Improvement > Components: Documentation, Spark Core, SQL, Tests >Affects Versions: 2.2.0 >Reporter: Nathan Kronenfeld >Assignee: Nathan Kronenfeld >Priority: Minor > Labels: scalatest, test-suite, test_issue > > External codebases that have spark code can test it using SharedSparkContext, > no matter how they write their scalatests - basing on FunSuite, FunSpec, > FlatSpec, or WordSpec. > SharedSQLContext only supports FunSuite. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22456: Assignee: Michael Styles > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles >Assignee: Michael Styles > Fix For: 2.3.0 > > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)
[ https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245240#comment-16245240 ] Ruslan Dautkhanov commented on SPARK-21657: --- Thank you [~uzadude] - great investigative work. Would be great if this patch can make it to the 2.3 release. > Spark has exponential time complexity to explode(array of structs) > -- > > Key: SPARK-21657 > URL: https://issues.apache.org/jira/browse/SPARK-21657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ruslan Dautkhanov > Labels: cache, caching, collections, nested_types, performance, > pyspark, sparksql, sql > Attachments: ExponentialTimeGrowth.PNG, > nested-data-generator-and-test.py > > > It can take up to half a day to explode a modest-sized nested collection > (0.5m). > On a recent Xeon processors. > See attached pyspark script that reproduces this problem. > {code} > cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + > table_name).cache() > print sqlc.count() > {code} > This script generate a number of tables, with the same total number of > records across all nested collection (see `scaling` variable in loops). > `scaling` variable scales up how many nested elements in each record, but by > the same factor scales down number of records in the table. So total number > of records stays the same. > Time grows exponentially (notice log-10 vertical axis scale): > !ExponentialTimeGrowth.PNG! > At scaling of 50,000 (see attached pyspark script), it took 7 hours to > explode the nested collections (\!) of 8k records. > After 1000 elements in nested collection, time grows exponentially. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22456. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19672 [https://github.com/apache/spark/pull/19672] > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles > Fix For: 2.3.0 > > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22466) SPARK_CONF_DIR is not is set by Spark's launch scripts with default value
[ https://issues.apache.org/jira/browse/SPARK-22466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22466: Assignee: Kent Yao > SPARK_CONF_DIR is not is set by Spark's launch scripts with default value > - > > Key: SPARK-22466 > URL: https://issues.apache.org/jira/browse/SPARK-22466 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.1, 2.2.0 >Reporter: Kent Yao >Assignee: Kent Yao > Fix For: 2.3.0 > > > If we don't explicitly set `SPARK_CONF_DIR` in `spark-env.sh`, it is only set > in `sbin/spark-config.sh` with a default value which is used for spark > daemons, but applications with spark-submit will miss this environment > variable at runtime. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22466) SPARK_CONF_DIR is not is set by Spark's launch scripts with default value
[ https://issues.apache.org/jira/browse/SPARK-22466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22466. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19688 [https://github.com/apache/spark/pull/19688] > SPARK_CONF_DIR is not is set by Spark's launch scripts with default value > - > > Key: SPARK-22466 > URL: https://issues.apache.org/jira/browse/SPARK-22466 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.1, 2.2.0 >Reporter: Kent Yao > Fix For: 2.3.0 > > > If we don't explicitly set `SPARK_CONF_DIR` in `spark-env.sh`, it is only set > in `sbin/spark-config.sh` with a default value which is used for spark > daemons, but applications with spark-submit will miss this environment > variable at runtime. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs
[ https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245204#comment-16245204 ] Apache Spark commented on SPARK-22417: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/19704 > createDataFrame from a pandas.DataFrame reads datetime64 values as longs > > > Key: SPARK-22417 > URL: https://issues.apache.org/jira/browse/SPARK-22417 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.2.1, 2.3.0 > > > When trying to create a Spark DataFrame from an existing Pandas DataFrame > using {{createDataFrame}}, columns with datetime64 values are converted as > long values. This is only when the schema is not specified. > {code} > In [2]: import pandas as pd >...: from datetime import datetime >...: > In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]}) > In [4]: df = spark.createDataFrame(pdf) > In [5]: df.show() > +---+ > | ts| > +---+ > |15094116610| > +---+ > In [6]: df.schema > Out[6]: StructType(List(StructField(ts,LongType,true))) > {code} > Spark should interpret a datetime64\[D\] value to DateType and other > datetime64 values to TImestampType. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245191#comment-16245191 ] Apache Spark commented on SPARK-22403: -- User 'wypoon' has created a pull request for this issue: https://github.com/apache/spark/pull/19703 > StructuredKafkaWordCount example fails in YARN cluster mode > --- > > Key: SPARK-22403 > URL: https://issues.apache.org/jira/browse/SPARK-22403 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Wing Yew Poon > > When I run the StructuredKafkaWordCount example in YARN client mode, it runs > fine. However, when I run it in YARN cluster mode, the application errors > during initialization, and dies after the default number of YARN application > attempts. In the AM log, I see > {noformat} > 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value > AS STRING) > 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream > metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to > /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257) > ... > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960) > at > org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114) > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala) > {noformat} > Looking at StreamingQueryManager#createQuery, we have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198 > {code} > val checkpointLocation = userSpecifiedCheckpointLocation.map { ..
[jira] [Assigned] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22403: Assignee: (was: Apache Spark) > StructuredKafkaWordCount example fails in YARN cluster mode > --- > > Key: SPARK-22403 > URL: https://issues.apache.org/jira/browse/SPARK-22403 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Wing Yew Poon > > When I run the StructuredKafkaWordCount example in YARN client mode, it runs > fine. However, when I run it in YARN cluster mode, the application errors > during initialization, and dies after the default number of YARN application > attempts. In the AM log, I see > {noformat} > 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value > AS STRING) > 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream > metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to > /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257) > ... > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960) > at > org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114) > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala) > {noformat} > Looking at StreamingQueryManager#createQuery, we have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198 > {code} > val checkpointLocation = userSpecifiedCheckpointLocation.map { ... > ... > }.orElse { > ... > }.getOrElse { > if (useTempCheckpointLocation) { >
[jira] [Assigned] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22403: Assignee: Apache Spark > StructuredKafkaWordCount example fails in YARN cluster mode > --- > > Key: SPARK-22403 > URL: https://issues.apache.org/jira/browse/SPARK-22403 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Wing Yew Poon >Assignee: Apache Spark > > When I run the StructuredKafkaWordCount example in YARN client mode, it runs > fine. However, when I run it in YARN cluster mode, the application errors > during initialization, and dies after the default number of YARN application > attempts. In the AM log, I see > {noformat} > 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value > AS STRING) > 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream > metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to > /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785) > at > org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257) > ... > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960) > at > org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114) > at > org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240) > at > org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278) > at > org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79) > at > org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala) > {noformat} > Looking at StreamingQueryManager#createQuery, we have > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198 > {code} > val checkpointLocation = userSpecifiedCheckpointLocation.map { ... > ... > }.orElse { > ... > }.getOrElse { > if (useTempCheckp
[jira] [Resolved] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saniya Tech resolved SPARK-22460. - Resolution: Not A Problem > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245163#comment-16245163 ] Saniya Tech commented on SPARK-22460: - Based on the feedback I am going to close this ticket and try to resolve the issue in spark-avro code-base. Thanks! > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245158#comment-16245158 ] Liang-Chi Hsieh edited comment on SPARK-22460 at 11/9/17 3:33 AM: -- >From the output of: {code} print(s"${rawOutput.collect().head}\n") {code} The modified field is already parsed as timestamp type. was (Author: viirya): >From the output of print(s"${rawOutput.collect().head}\n"), the modified field >is already parsed as timestamp type. > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245158#comment-16245158 ] Liang-Chi Hsieh edited comment on SPARK-22460 at 11/9/17 3:32 AM: -- >From the output of print(s"${rawOutput.collect().head}\n"), the modified field >is already parsed as timestamp type. was (Author: viirya): >From the output of {{print(s"${rawOutput.collect().head}\n")}}, the modified >field is already parsed as timestamp type. > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245158#comment-16245158 ] Liang-Chi Hsieh commented on SPARK-22460: - >From the output of {{print(s"${rawOutput.collect().head}\n")}}, the modified >field is already parsed as timestamp type. > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10365: Assignee: Apache Spark > Support Parquet logical type TIMESTAMP_MICROS > - > > Key: SPARK-10365 > URL: https://issues.apache.org/jira/browse/SPARK-10365 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > Didn't assign target version for this ticket because neither the most recent > parquet-mr release (1.8.1) nor the master branch has supported > {{TIMESTAMP_MICROS}} yet. > It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} > since Parquet {{INT96}} will probably be deprecated and is only used for > compatibility reason for now. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10365: Assignee: (was: Apache Spark) > Support Parquet logical type TIMESTAMP_MICROS > - > > Key: SPARK-10365 > URL: https://issues.apache.org/jira/browse/SPARK-10365 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian > > Didn't assign target version for this ticket because neither the most recent > parquet-mr release (1.8.1) nor the master branch has supported > {{TIMESTAMP_MICROS}} yet. > It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} > since Parquet {{INT96}} will probably be deprecated and is only used for > compatibility reason for now. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS
[ https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245082#comment-16245082 ] Apache Spark commented on SPARK-10365: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19702 > Support Parquet logical type TIMESTAMP_MICROS > - > > Key: SPARK-10365 > URL: https://issues.apache.org/jira/browse/SPARK-10365 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian > > Didn't assign target version for this ticket because neither the most recent > parquet-mr release (1.8.1) nor the master branch has supported > {{TIMESTAMP_MICROS}} yet. > It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} > since Parquet {{INT96}} will probably be deprecated and is only used for > compatibility reason for now. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245063#comment-16245063 ] Apache Spark commented on SPARK-22211: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/19701 > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Assignee: Henry Robinson > Fix For: 2.2.1, 2.3.0 > > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results
[ https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245029#comment-16245029 ] Dongjoon Hyun commented on SPARK-22211: --- Hi, All. This breaks `branch-2.2`. - SBT: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/408/ - MAVEN: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/423 > LimitPushDown optimization for FullOuterJoin generates wrong results > > > Key: SPARK-22211 > URL: https://issues.apache.org/jira/browse/SPARK-22211 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 > Environment: on community.cloude.databrick.com > Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11) >Reporter: Benyi Wang >Assignee: Henry Robinson > Fix For: 2.2.1, 2.3.0 > > > LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may > generate a wrong result: > Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 > is selected, but at right side we have 100K rows including 999, the result > will be > - one row is (999, 999) > - the rest rows are (null, xxx) > Once you call show(), the row (999,999) has only 1/10th chance to be > selected by CollectLimit. > The actual optimization might be, > - push down limit > - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin. > Here is my notebook: > https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html > {code:java} > import scala.util.Random._ > val dl = shuffle(1 to 10).toDF("id") > val dr = shuffle(1 to 10).toDF("id") > println("data frame dl:") > dl.explain > println("data frame dr:") > dr.explain > val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1) > j.explain > j.show(false) > {code} > {code} > data frame dl: > == Physical Plan == > LocalTableScan [id#10] > data frame dr: > == Physical Plan == > LocalTableScan [id#16] > == Physical Plan == > CollectLimit 1 > +- SortMergeJoin [id#10], [id#16], FullOuter >:- *Sort [id#10 ASC NULLS FIRST], false, 0 >: +- Exchange hashpartitioning(id#10, 200) >: +- *LocalLimit 1 >:+- LocalTableScan [id#10] >+- *Sort [id#16 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#16, 200) > +- LocalTableScan [id#16] > import scala.util.Random._ > dl: org.apache.spark.sql.DataFrame = [id: int] > dr: org.apache.spark.sql.DataFrame = [id: int] > j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int] > ++---+ > |id |id | > ++---+ > |null|148| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22475) show histogram in DESC COLUMN command
Zhenhua Wang created SPARK-22475: Summary: show histogram in DESC COLUMN command Key: SPARK-22475 URL: https://issues.apache.org/jira/browse/SPARK-22475 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Zhenhua Wang -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244887#comment-16244887 ] Nick Dimiduk commented on SPARK-17593: -- So the fix in Hadoop 2.8 is for any variant of the s3* FileSystem? Or is it only for s3a? bq. as it really needs Spark to move to listFiles(recursive) Do we still need this change to be shipped in Spark? Thanks. > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah >Priority: Minor > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244836#comment-16244836 ] Arseniy Tashoyan commented on SPARK-22471: -- I am coming up with a simple fix, backportable to 2.2. > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png, > SQLListener_stageIdToStageMetrics_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22471: Assignee: (was: Apache Spark) > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png, > SQLListener_stageIdToStageMetrics_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244821#comment-16244821 ] Apache Spark commented on SPARK-22471: -- User 'tashoyan' has created a pull request for this issue: https://github.com/apache/spark/pull/19700 > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png, > SQLListener_stageIdToStageMetrics_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22471: Assignee: Apache Spark > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan >Assignee: Apache Spark > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png, > SQLListener_stageIdToStageMetrics_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20648) Make Jobs and Stages pages use the new app state store
[ https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20648: Assignee: Apache Spark > Make Jobs and Stages pages use the new app state store > -- > > Key: SPARK-20648 > URL: https://issues.apache.org/jira/browse/SPARK-20648 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making both the Jobs and Stages pages use the new app state > store. Because these two pages are very tightly coupled, it's easier to > modify both in one go. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20648) Make Jobs and Stages pages use the new app state store
[ https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20648: Assignee: (was: Apache Spark) > Make Jobs and Stages pages use the new app state store > -- > > Key: SPARK-20648 > URL: https://issues.apache.org/jira/browse/SPARK-20648 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making both the Jobs and Stages pages use the new app state > store. Because these two pages are very tightly coupled, it's easier to > modify both in one go. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20648) Make Jobs and Stages pages use the new app state store
[ https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244557#comment-16244557 ] Apache Spark commented on SPARK-20648: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19698 > Make Jobs and Stages pages use the new app state store > -- > > Key: SPARK-20648 > URL: https://issues.apache.org/jira/browse/SPARK-20648 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > > See spec in parent issue (SPARK-18085) for more details. > This task tracks making both the Jobs and Stages pages use the new app state > store. Because these two pages are very tightly coupled, it's easier to > modify both in one go. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244466#comment-16244466 ] Sean Owen commented on SPARK-22472: --- This doesn't look right ... https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L596 {code} def defaultValue(jt: String): String = jt match { case JAVA_BOOLEAN => "false" case JAVA_BYTE => "(byte)-1" case JAVA_SHORT => "(short)-1" case JAVA_INT => "-1" case JAVA_LONG => "-1L" case JAVA_FLOAT => "-1.0f" case JAVA_DOUBLE => "-1.0" case _ => "null" } {code} The default for any uninitialized primitive type is 0 in the JVM. [~cloud_fan] I think you were the last to touch this, but didn't write it. Is this really -1 on purpose for another reason? > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22222) Fix the ARRAY_MAX in BufferHolder and add a test
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244455#comment-16244455 ] Apache Spark commented on SPARK-2: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/19697 > Fix the ARRAY_MAX in BufferHolder and add a test > > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Feng Liu >Assignee: Feng Liu > Fix For: 2.3.0 > > > This is actually a followup for SPARK-22033, which set the `ARRAY_MAX` to > `Int.MaxValue - 8`. It is not a valid number because it will cause the > following line fail when such a large byte array is allocated: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L86 > We need to make sure the new length is a multiple of 8. > We need to add one test for the fix. Note that the test should work > independently with the heap size of the test JVM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244452#comment-16244452 ] Marcelo Vanzin commented on SPARK-22471: My patch for SPARK-20652 greatly reduces the memory usage of the SQL listener. But it's not backportable to 2.2. > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png, > SQLListener_stageIdToStageMetrics_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244438#comment-16244438 ] Kazuaki Ishizaki commented on SPARK-22472: -- >From [the initial >version|https://github.com/apache/spark/commit/9e66a53c9955285a85c19f55c3ef62db2e1b868a#diff-94a1f59bcc9b6758c4ca874652437634R227] > of this conversion (about two years ago), this conversion returns {{-1}}... > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22133) Document Mesos reject offer duration configutations
[ https://issues.apache.org/jira/browse/SPARK-22133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-22133: - Assignee: windkithk > Document Mesos reject offer duration configutations > --- > > Key: SPARK-22133 > URL: https://issues.apache.org/jira/browse/SPARK-22133 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Arthur Rand >Assignee: windkithk > Fix For: 2.3.0 > > > Mesos has multiple configurable timeouts {{spark.mesos.rejectOfferDuration}}, > {{spark.mesos.rejectOfferDurationForUnmetConstraints}}, and > {{spark.mesos.rejectOfferDurationForReachedMaxCores}} that can have a large > effect on Spark performance when sharing a Mesos cluster with other > frameworks and users. These configurations aren't documented, add > documentation and information for non-Mesos experts on how these settings > should be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22133) Document Mesos reject offer duration configutations
[ https://issues.apache.org/jira/browse/SPARK-22133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22133: -- Priority: Minor (was: Major) > Document Mesos reject offer duration configutations > --- > > Key: SPARK-22133 > URL: https://issues.apache.org/jira/browse/SPARK-22133 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Arthur Rand >Assignee: windkithk >Priority: Minor > Fix For: 2.3.0 > > > Mesos has multiple configurable timeouts {{spark.mesos.rejectOfferDuration}}, > {{spark.mesos.rejectOfferDurationForUnmetConstraints}}, and > {{spark.mesos.rejectOfferDurationForReachedMaxCores}} that can have a large > effect on Spark performance when sharing a Mesos cluster with other > frameworks and users. These configurations aren't documented, add > documentation and information for non-Mesos experts on how these settings > should be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22133) Document Mesos reject offer duration configutations
[ https://issues.apache.org/jira/browse/SPARK-22133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22133. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19555 [https://github.com/apache/spark/pull/19555] > Document Mesos reject offer duration configutations > --- > > Key: SPARK-22133 > URL: https://issues.apache.org/jira/browse/SPARK-22133 > Project: Spark > Issue Type: Improvement > Components: Mesos >Affects Versions: 2.3.0 >Reporter: Arthur Rand > Fix For: 2.3.0 > > > Mesos has multiple configurable timeouts {{spark.mesos.rejectOfferDuration}}, > {{spark.mesos.rejectOfferDurationForUnmetConstraints}}, and > {{spark.mesos.rejectOfferDurationForReachedMaxCores}} that can have a large > effect on Spark performance when sharing a Mesos cluster with other > frameworks and users. These configurations aren't documented, add > documentation and information for non-Mesos experts on how these settings > should be used. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327 ] Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:30 PM: --- Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if an Option object is {{null}} or an Option value is {{empty}}, {{-1}} is passed to a lambda function. I will check why {{-1}} was used as a value for these cases. was (Author: kiszk): Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {{empty}}, {{-1}} is passed to a lambda function. I will check why {{-1}} was used as a value for {{empty}}. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327 ] Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:22 PM: --- Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {{empty}}, {{-1}} is passed to a lambda function. I will check why {{-1}} was used as a value for {{empty}}. was (Author: kiszk): Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed to a lambda function. I will check why {{-1}} was used as a value for {{null}} or {{empty}}. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]
[ https://issues.apache.org/jira/browse/SPARK-22474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikael Valot updated SPARK-22474: - Affects Version/s: 2.1.2 > cannot read a parquet file containing a Seq[Map[MyCaseClass, String]] > - > > Key: SPARK-22474 > URL: https://issues.apache.org/jira/browse/SPARK-22474 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.2, 2.2.0 >Reporter: Mikael Valot > > The following code run in spark-shell throws an exception. It is working fine > in Spark 2.0.2 > {code:java} > case class MyId(v: String) > case class MyClass(infos: Seq[Map[MyId, String]]) > val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah" > seq.toDS().write.parquet("/tmp/myclass") > spark.read.parquet("/tmp/myclass").as[MyClass].collect() > Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected > to be a primitive type, but found: required group key { > optional binary v (UTF8); > }; > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201) > at scala.Option.fold(Option.scala:158) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) > at scala.Option.fold(Option.scala:158) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.(ParquetRowConverter.scala:180) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRecordMaterializer.(ParquetRecordMaterializer.scala:38) > at > org.apache.spark.sql.execution.datasources.
[jira] [Created] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]
Mikael Valot created SPARK-22474: Summary: cannot read a parquet file containing a Seq[Map[MyCaseClass, String]] Key: SPARK-22474 URL: https://issues.apache.org/jira/browse/SPARK-22474 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Mikael Valot The following code run in spark-shell throws an exception. It is working fine in Spark 2.0.2 {code:java} case class MyId(v: String) case class MyClass(infos: Seq[Map[MyId, String]]) val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah" seq.toDS().write.parquet("/tmp/myclass") spark.read.parquet("/tmp/myclass").as[MyClass].collect() Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected to be a primitive type, but found: required group key { optional binary v (UTF8); }; at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201) at scala.Option.fold(Option.scala:158) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201) at scala.Option.fold(Option.scala:158) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201) at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.(ParquetRowConverter.scala:180) at org.apache.spark.sql.execution.datasources.parquet.ParquetRecordMaterializer.(ParquetRecordMaterializer.scala:38) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.prepareForRead(ParquetReadSupport.scala:95) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:190) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecord
[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327 ] Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:11 PM: --- Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed to a lambda function. I will check why {{-1}} was used as a value for {{null}} or {{empty}}. was (Author: kiszk): Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed to a lambda function. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327 ] Kazuaki Ishizaki commented on SPARK-22472: -- Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {null} or {empty}, {-1} is passed to a lambda function. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327 ] Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:10 PM: --- Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed to a lambda function. was (Author: kiszk): Thank you for reporting this behavior. When I checked the generated code and source code, it is currently-expected behavior. In other words, if a value is {null} or {empty}, {-1} is passed to a lambda function. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244319#comment-16244319 ] Vladislav Kuzemchik commented on SPARK-22472: - I'm using Option[Long] as a workaround, but it is kinda scary to leave things as is and hope that you gonna catch it on review when anyone else is using datasets. I think spark should warn(or even error with some config parameter set) when you converting nullable DataFrame column into non-optional type. Currently if you do that with non-primitive type, you most likely gonna net NPE, and will have to handle this use case anyway. In my opinion current implicit behavior cause much more harm. We talking about bad results without any notification. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244310#comment-16244310 ] Marco Gaido edited comment on SPARK-22472 at 11/8/17 4:58 PM: -- Two things: 1 - if you use {{as\[Option\[Long\]\]}}, it works fine; 2 - actually if you collect the Dataset, the value for {{null}} is {{0}}, but with the transformation, there is this bad result. Then I am not sure about the right approach here. Because maybe the best thing would be to force using {{Option\[Long\]}} when the value is {{nullable}}, but this might be too restrictive and may break compatibility. was (Author: mgaido): Two things: 1 - if you use `as[Option[Long]]`, it works fine; 2 - actually if you collect the Dataset, the value for `null` is `0`, but with the transformation, there is this bad result. Then I am not sure about the right approach here. Because maybe the best thing would be to force using `Option[Long]` when the value is `nullable`, but this might be too restrictive and may break compatibility. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244310#comment-16244310 ] Marco Gaido commented on SPARK-22472: - Two things: 1 - if you use `as[Option[Long]]`, it works fine; 2 - actually if you collect the Dataset, the value for `null` is `0`, but with the transformation, there is this bad result. Then I am not sure about the right approach here. Because maybe the best thing would be to force using `Option[Long]` when the value is `nullable`, but this might be too restrictive and may break compatibility. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244273#comment-16244273 ] Vladislav Kuzemchik commented on SPARK-22472: - My assumption is that it is because it goes into Tungsten to do multiplication on Dataset, and probably tries to get value of null pointer. I would expect 0 or exception too. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244263#comment-16244263 ] Sean Owen commented on SPARK-22472: --- Not sure it's random, but, somehow the value is being construed as -1 before being doubled. It's odd because {{s.as[Long].show(false)}} is fine. Try {{s.as[Long].map(_.toString).show(false)}} to see it more directly. I admit I don't know much about the internals; is someone more informed able to weigh in on why null would become -1 here? I think part of the issue is that you're converting null to a primitive long, but would expect 0 if anything, or an exception. > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
[ https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22473: Assignee: Apache Spark > Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date > -- > > Key: SPARK-22473 > URL: https://issues.apache.org/jira/browse/SPARK-22473 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Trivial > > In `spark-sql` module tests there are deprecations warnings caused by the > usage of deprecated methods of `java.sql.Date` and the usage of the > deprecated `AsyncAssertions.Waiter` class. > This issue is to track their replacement with their respective non-deprecated > versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
[ https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22473: Assignee: (was: Apache Spark) > Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date > -- > > Key: SPARK-22473 > URL: https://issues.apache.org/jira/browse/SPARK-22473 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Trivial > > In `spark-sql` module tests there are deprecations warnings caused by the > usage of deprecated methods of `java.sql.Date` and the usage of the > deprecated `AsyncAssertions.Waiter` class. > This issue is to track their replacement with their respective non-deprecated > versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
[ https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244260#comment-16244260 ] Apache Spark commented on SPARK-22473: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19696 > Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date > -- > > Key: SPARK-22473 > URL: https://issues.apache.org/jira/browse/SPARK-22473 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Trivial > > In `spark-sql` module tests there are deprecations warnings caused by the > usage of deprecated methods of `java.sql.Date` and the usage of the > deprecated `AsyncAssertions.Waiter` class. > This issue is to track their replacement with their respective non-deprecated > versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
Marco Gaido created SPARK-22473: --- Summary: Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date Key: SPARK-22473 URL: https://issues.apache.org/jira/browse/SPARK-22473 Project: Spark Issue Type: Task Components: Tests Affects Versions: 2.3.0 Reporter: Marco Gaido Priority: Trivial In `spark-sql` module tests there are deprecations warnings caused by the usage of deprecated methods of `java.sql.Date` and the usage of the deprecated `AsyncAssertions.Waiter` class. This issue is to track their replacement with their respective non-deprecated versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22472) Datasets generate random values for null primitive types
[ https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladislav Kuzemchik updated SPARK-22472: Affects Version/s: 2.2.0 > Datasets generate random values for null primitive types > > > Key: SPARK-22472 > URL: https://issues.apache.org/jira/browse/SPARK-22472 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Vladislav Kuzemchik > > Not sure if it ever were reported. > {code} > scala> val s = > sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") > s: org.apache.spark.sql.DataFrame = [v: bigint] > scala> s.show(false) > ++ > |v | > ++ > |null| > |1 | > |5 | > ++ > scala> s.as[Long].map(v => v*2).show(false) > +-+ > |value| > +-+ > |-2 | > |2| > |10 | > +-+ > scala> s.select($"v"*2).show(false) > +---+ > |(v * 2)| > +---+ > |null | > |2 | > |10 | > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arseniy Tashoyan updated SPARK-22471: - Attachment: SQLListener_retained_size.png > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22472) Datasets generate random values for null primitive types
Vladislav Kuzemchik created SPARK-22472: --- Summary: Datasets generate random values for null primitive types Key: SPARK-22472 URL: https://issues.apache.org/jira/browse/SPARK-22472 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1 Reporter: Vladislav Kuzemchik Not sure if it ever were reported. {code} scala> val s = sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v") s: org.apache.spark.sql.DataFrame = [v: bigint] scala> s.show(false) ++ |v | ++ |null| |1 | |5 | ++ scala> s.as[Long].map(v => v*2).show(false) +-+ |value| +-+ |-2 | |2| |10 | +-+ scala> s.select($"v"*2).show(false) +---+ |(v * 2)| +---+ |null | |2 | |10 | +---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arseniy Tashoyan updated SPARK-22471: - Attachment: SQLListener_stageIdToStageMetrics_retained_size.png > SQLListener consumes much memory causing OutOfMemoryError > - > > Key: SPARK-22471 > URL: https://issues.apache.org/jira/browse/SPARK-22471 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.2.0 > Environment: Spark 2.2.0, Linux >Reporter: Arseniy Tashoyan > Labels: memory-leak, sql > Attachments: SQLListener_retained_size.png, > SQLListener_stageIdToStageMetrics_retained_size.png > > Original Estimate: 72h > Remaining Estimate: 72h > > _SQLListener_ may grow very large when Spark runs complex multi-stage > requests. The listener tracks metrics for all stages in > __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup > this hash map regularly, but this is not enough. Precisely, the method > _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not > have metrics for very old data; this method runs on each execution completion. > However, if an execution has many stages, _SQLListener_ keeps adding new > entries to __stageIdToStageMetrics_ without calling > _trimExecutionsIfNecessary_. The hash map may grow to enormous size. > Strictly speaking, it is not a memory leak, because finally > _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program > has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError
Arseniy Tashoyan created SPARK-22471: Summary: SQLListener consumes much memory causing OutOfMemoryError Key: SPARK-22471 URL: https://issues.apache.org/jira/browse/SPARK-22471 Project: Spark Issue Type: Bug Components: SQL, Web UI Affects Versions: 2.2.0 Environment: Spark 2.2.0, Linux Reporter: Arseniy Tashoyan _SQLListener_ may grow very large when Spark runs complex multi-stage requests. The listener tracks metrics for all stages in __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup this hash map regularly, but this is not enough. Precisely, the method _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not have metrics for very old data; this method runs on each execution completion. However, if an execution has many stages, _SQLListener_ keeps adding new entries to __stageIdToStageMetrics_ without calling _trimExecutionsIfNecessary_. The hash map may grow to enormous size. Strictly speaking, it is not a memory leak, because finally _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program has high odds to crash with OutOfMemoryError (and it does). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243913#comment-16243913 ] Kazuaki Ishizaki commented on SPARK-22460: -- I run the similar code with json. It can correctly decode the value. {code} // // De-serialize val rawOutput = spark.read.json(path) val output = rawOutput.as[TestRecord] print(s"${data.head}\n") print(s"${data.head.modified.getTime}\n") print(s"${rawOutput.collect().head}\n") print(s"${output.collect().head}\n") {code} {code} TestRecord(One,2017-11-08 04:57:15.537) 1510145835537 [2017-11-08T04:57:15.537-08:00,One] TestRecord(One,2017-11-08 04:57:15.537) {code} > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect
[ https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243834#comment-16243834 ] Liang-Chi Hsieh commented on SPARK-22460: - In Spark SQL, to cast a long field to timestamp field, the long value in the long field is seen as seconds. Although timestamp field is also stored internally as long value, the long value in a timestamp field is seen as microseconds. That's said you can't directly cast a long field with microseconds/milliseconds to a timestamp field and get correct timestamp... Currently spark-avro doesn't support conversion from avro's long type to Catalyst's Date/Timestamp types, with explicitly given schema, e.g., {{spark.read.schema(dataSchema).avro(path)}}. To have correct Date/Timestamp field, I think it should be fixed at spark-avro by adding support Date/Timestamp data type. It enables spark-avro to interpret loaded data as correct date/timestamp fields. > Spark De-serialization of Timestamp field is Incorrect > -- > > Key: SPARK-22460 > URL: https://issues.apache.org/jira/browse/SPARK-22460 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Saniya Tech > > We are trying to serialize Timestamp fields to Avro using spark-avro > connector. I can see the Timestamp fields are getting correctly serialized as > long (milliseconds since Epoch). I verified that the data is correctly read > back from the Avro files. It is when we encode the Dataset as a case class > that timestamp field is incorrectly converted to a long value as seconds > since Epoch. As can be seen below, this shifts the timestamp many years in > the future. > Code used to reproduce the issue: > {code:java} > import java.sql.Timestamp > import com.databricks.spark.avro._ > import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession} > case class TestRecord(name: String, modified: Timestamp) > import spark.implicits._ > val data = Seq( > TestRecord("One", new Timestamp(System.currentTimeMillis())) > ) > // Serialize: > val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> > "com.example.domain") > val path = s"s3a://some-bucket/output/" > val ds = spark.createDataset(data) > ds.write > .options(parameters) > .mode(SaveMode.Overwrite) > .avro(path) > // > // De-serialize > val output = spark.read.avro(path).as[TestRecord] > {code} > Output from the test: > {code:java} > scala> data.head > res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419) > scala> output.collect().head > res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
[ https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243789#comment-16243789 ] Apache Spark commented on SPARK-22377: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19695 > Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof > -- > > Key: SPARK-22377 > URL: https://issues.apache.org/jira/browse/SPARK-22377 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0, 2.2.0 >Reporter: Xin Lu > > It looks like multiple workers in the amplab jenkins cannot execute lsof. > Example log below: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console > spark-build/dev/create-release/release-build.sh: line 344: lsof: command not > found > usage: kill [ -s signal | -p ] [ -a ] pid ... >kill -l [ signal ] > I looked at the jobs and it looks like only amp-jenkins-worker-01 works so > you are getting a successful build every week or so. Unclear if the snapshot > is actually released. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
[ https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22377: Assignee: (was: Apache Spark) > Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof > -- > > Key: SPARK-22377 > URL: https://issues.apache.org/jira/browse/SPARK-22377 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0, 2.2.0 >Reporter: Xin Lu > > It looks like multiple workers in the amplab jenkins cannot execute lsof. > Example log below: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console > spark-build/dev/create-release/release-build.sh: line 344: lsof: command not > found > usage: kill [ -s signal | -p ] [ -a ] pid ... >kill -l [ signal ] > I looked at the jobs and it looks like only amp-jenkins-worker-01 works so > you are getting a successful build every week or so. Unclear if the snapshot > is actually released. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
[ https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22377: Assignee: Apache Spark > Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof > -- > > Key: SPARK-22377 > URL: https://issues.apache.org/jira/browse/SPARK-22377 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.1.0, 2.2.0 >Reporter: Xin Lu >Assignee: Apache Spark > > It looks like multiple workers in the amplab jenkins cannot execute lsof. > Example log below: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console > spark-build/dev/create-release/release-build.sh: line 344: lsof: command not > found > usage: kill [ -s signal | -p ] [ -a ] pid ... >kill -l [ signal ] > I looked at the jobs and it looks like only amp-jenkins-worker-01 works so > you are getting a successful build every week or so. Unclear if the snapshot > is actually released. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22446. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19662 [https://github.com/apache/spark/pull/19662] > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.0, 2.2.0 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.sc
[jira] [Assigned] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22446: --- Assignee: Liang-Chi Hsieh > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.0, 2.2.0 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) >
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however, I can't seem to reduce it to a sample. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code:python}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/pyt
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'}} Another error is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.p
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {code:python}x = a.subtract(b) y = b.subtract(a){code} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745){noformat} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'{noformat} Another error is: {noformat}File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/
[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
[ https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Porritt updated SPARK-22468: -- Description: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: {{x = a.subtract(b) y = b.subtract(a)}} I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. One of the errors I will get is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add'}} Another error is: {{ File "", line 642, in if not y.rdd.isEmpty(): File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in _jrdd File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace: py4j.Py4JException: Method asJavaRDD([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)}} Sometimes the error will complain about it not having a 'size' parameter. was: I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: x = a.subtract(b) y = b.subtract(a) I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. The error I will get is: File "", line 642, in if y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add' Sometimes the error will complain about it not having a 'size' parameter. > subtract creating empty DataFrame that isn't initialised properly > -- > > Key: SPARK-22468 > URL: https://issues.apache.org/jira/browse/SPARK-22468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: James Porritt > > I have an issue whereby a subtract between two DataFrames that will correctly > end up with an empty DataFrame, seemingly has the DataFra
[jira] [Assigned] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing
[ https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22470: Assignee: Apache Spark > Doc that functions.hash is also used internally for shuffle and bucketing > - > > Key: SPARK-22470 > URL: https://issues.apache.org/jira/browse/SPARK-22470 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Assignee: Apache Spark > > https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that > appears to be the same hash function as what Spark uses internally for > shuffle and bucketing. > One of my users would like to bake this assumption into code, but is unsure > if it's a guarantee or a coincidence that they're the same function. Would > it be considered an API break if at some point the two functions were > different, or if the implementation of both changed together? > We should add a line to the scaladoc to clarify. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing
[ https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22470: Assignee: (was: Apache Spark) > Doc that functions.hash is also used internally for shuffle and bucketing > - > > Key: SPARK-22470 > URL: https://issues.apache.org/jira/browse/SPARK-22470 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that > appears to be the same hash function as what Spark uses internally for > shuffle and bucketing. > One of my users would like to bake this assumption into code, but is unsure > if it's a guarantee or a coincidence that they're the same function. Would > it be considered an API break if at some point the two functions were > different, or if the implementation of both changed together? > We should add a line to the scaladoc to clarify. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing
[ https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243720#comment-16243720 ] Apache Spark commented on SPARK-22470: -- User 'ash211' has created a pull request for this issue: https://github.com/apache/spark/pull/19694 > Doc that functions.hash is also used internally for shuffle and bucketing > - > > Key: SPARK-22470 > URL: https://issues.apache.org/jira/browse/SPARK-22470 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.2.0 >Reporter: Andrew Ash > > https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that > appears to be the same hash function as what Spark uses internally for > shuffle and bucketing. > One of my users would like to bake this assumption into code, but is unsure > if it's a guarantee or a coincidence that they're the same function. Would > it be considered an API break if at some point the two functions were > different, or if the implementation of both changed together? > We should add a line to the scaladoc to clarify. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing
Andrew Ash created SPARK-22470: -- Summary: Doc that functions.hash is also used internally for shuffle and bucketing Key: SPARK-22470 URL: https://issues.apache.org/jira/browse/SPARK-22470 Project: Spark Issue Type: Documentation Components: Documentation, SQL Affects Versions: 2.2.0 Reporter: Andrew Ash https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that appears to be the same hash function as what Spark uses internally for shuffle and bucketing. One of my users would like to bake this assumption into code, but is unsure if it's a guarantee or a coincidence that they're the same function. Would it be considered an API break if at some point the two functions were different, or if the implementation of both changed together? We should add a line to the scaladoc to clarify. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22469) Accuracy problem in comparison with string and numeric
[ https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243600#comment-16243600 ] Apache Spark commented on SPARK-22469: -- User 'liutang123' has created a pull request for this issue: https://github.com/apache/spark/pull/19692 > Accuracy problem in comparison with string and numeric > --- > > Key: SPARK-22469 > URL: https://issues.apache.org/jira/browse/SPARK-22469 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lijia Liu > > {code:sql} > select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. > {code} > IIUC, we can cast string as double like Hive. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22469) Accuracy problem in comparison with string and numeric
[ https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22469: Assignee: Apache Spark > Accuracy problem in comparison with string and numeric > --- > > Key: SPARK-22469 > URL: https://issues.apache.org/jira/browse/SPARK-22469 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lijia Liu >Assignee: Apache Spark > > {code:sql} > select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. > {code} > IIUC, we can cast string as double like Hive. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22469) Accuracy problem in comparison with string and numeric
[ https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22469: Assignee: (was: Apache Spark) > Accuracy problem in comparison with string and numeric > --- > > Key: SPARK-22469 > URL: https://issues.apache.org/jira/browse/SPARK-22469 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Lijia Liu > > {code:sql} > select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. > {code} > IIUC, we can cast string as double like Hive. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22469) Accuracy problem in comparison with string and numeric
Lijia Liu created SPARK-22469: - Summary: Accuracy problem in comparison with string and numeric Key: SPARK-22469 URL: https://issues.apache.org/jira/browse/SPARK-22469 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Lijia Liu {code:sql} select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive. {code} IIUC, we can cast string as double like Hive. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly
James Porritt created SPARK-22468: - Summary: subtract creating empty DataFrame that isn't initialised properly Key: SPARK-22468 URL: https://issues.apache.org/jira/browse/SPARK-22468 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: James Porritt I have an issue whereby a subtract between two DataFrames that will correctly end up with an empty DataFrame, seemingly has the DataFrame not initialised properly. In my code I try and do a subtract both ways: x = a.subtract(b) y = b.subtract(a) I then do an .rdd.isEmpty() on both of them to check if I need to do something with the results. Often the 'y' subtract will fail if the 'x' subtract is non-empty. It's hard to reproduce however. The error I will get is: File "", line 642, in if y.rdd.isEmpty(): File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in isEmpty File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2455, in _jrdd File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 2390, in _wrap_function File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1386, in __call__ File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1372, in _get_args File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py", line 501, in convert AttributeError: 'NoneType' object has no attribute 'add' Sometimes the error will complain about it not having a 'size' parameter. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22281) Handle R method breaking signature changes
[ https://issues.apache.org/jira/browse/SPARK-22281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-22281. -- Resolution: Fixed Assignee: Felix Cheung Fix Version/s: 2.3.0 2.2.1 > Handle R method breaking signature changes > -- > > Key: SPARK-22281 > URL: https://issues.apache.org/jira/browse/SPARK-22281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1, 2.3.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.2.1, 2.3.0 > > > cAs discussed here > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555 > this WARNING on R-devel > * checking for code/documentation mismatches ... WARNING > Codoc mismatches from documentation object 'attach': > attach > Code: function(what, pos = 2L, name = deparse(substitute(what), > backtick = FALSE), warn.conflicts = TRUE) > Docs: function(what, pos = 2L, name = deparse(substitute(what)), > warn.conflicts = TRUE) > Mismatches in argument default values: > Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: > deparse(substitute(what)) > Codoc mismatches from documentation object 'glm': > glm > Code: function(formula, family = gaussian, data, weights, subset, > na.action, start = NULL, etastart, mustart, offset, > control = list(...), model = TRUE, method = "glm.fit", > x = FALSE, y = TRUE, singular.ok = TRUE, contrasts = > NULL, ...) > Docs: function(formula, family = gaussian, data, weights, subset, > na.action, start = NULL, etastart, mustart, offset, > control = list(...), model = TRUE, method = "glm.fit", > x = FALSE, y = TRUE, contrasts = NULL, ...) > Argument names in code not in docs: > singular.ok > Mismatches in argument names: > Position: 16 Code: singular.ok Docs: contrasts > Position: 17 Code: contrasts Docs: ... > Checked the latest release R 3.4.1 and the signature change wasn't there. > This likely indicated an upcoming change in the next R release that could > incur this new warning when we attempt to publish the package. > Not sure what we can do now since we work with multiple versions of R and > they will have different signatures then. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14922: Assignee: (was: Apache Spark) > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243488#comment-16243488 ] Apache Spark commented on SPARK-14922: -- User 'DazhuangSu' has created a pull request for this issue: https://github.com/apache/spark/pull/19691 > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk
[ https://issues.apache.org/jira/browse/SPARK-22467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243481#comment-16243481 ] Apache Spark commented on SPARK-22467: -- User '10110346' has created a pull request for this issue: https://github.com/apache/spark/pull/19690 > Added a switch to support whether `stdout_stream` and `stderr_stream` output > to disk > > > Key: SPARK-22467 > URL: https://issues.apache.org/jira/browse/SPARK-22467 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian > > We should add a switch to control the `stdout_stream` and `stdout_stream` > output to disk. > In my environment,due to disk I/O blocking, the `stdout_stream` output is > very slow, so it can not be timely cleaning,and this leads the executor > process to be blocked. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk
[ https://issues.apache.org/jira/browse/SPARK-22467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22467: Assignee: Apache Spark > Added a switch to support whether `stdout_stream` and `stderr_stream` output > to disk > > > Key: SPARK-22467 > URL: https://issues.apache.org/jira/browse/SPARK-22467 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian >Assignee: Apache Spark > > We should add a switch to control the `stdout_stream` and `stdout_stream` > output to disk. > In my environment,due to disk I/O blocking, the `stdout_stream` output is > very slow, so it can not be timely cleaning,and this leads the executor > process to be blocked. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk
liuxian created SPARK-22467: --- Summary: Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk Key: SPARK-22467 URL: https://issues.apache.org/jira/browse/SPARK-22467 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: liuxian We should add a switch to control the `stdout_stream` and `stdout_stream` output to disk. In my environment,due to disk I/O blocking, the `stdout_stream` output is very slow, so it can not be timely cleaning,and this leads the executor process to be blocked. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk
[ https://issues.apache.org/jira/browse/SPARK-22467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22467: Assignee: (was: Apache Spark) > Added a switch to support whether `stdout_stream` and `stderr_stream` output > to disk > > > Key: SPARK-22467 > URL: https://issues.apache.org/jira/browse/SPARK-22467 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: liuxian > > We should add a switch to control the `stdout_stream` and `stdout_stream` > output to disk. > In my environment,due to disk I/O blocking, the `stdout_stream` output is > very slow, so it can not be timely cleaning,and this leads the executor > process to be blocked. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org