date:20171108

[jira] [Assigned] (SPARK-22476) Add new function dayofweek in R

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22476:


Assignee: (was: Apache Spark)

> Add new function dayofweek in R
> ---
>
> Key: SPARK-22476
> URL: https://issues.apache.org/jira/browse/SPARK-22476
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SQL was added in SPARK-20909.
> Scala and Python for {{dayofweek}} were added in SPARK-22456. 
> Looks we should better add it in R too.
> Please refer both JIRAs for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22476) Add new function dayofweek in R

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245305#comment-16245305
 ] 

Apache Spark commented on SPARK-22476:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19706

> Add new function dayofweek in R
> ---
>
> Key: SPARK-22476
> URL: https://issues.apache.org/jira/browse/SPARK-22476
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> SQL was added in SPARK-20909.
> Scala and Python for {{dayofweek}} were added in SPARK-22456. 
> Looks we should better add it in R too.
> Please refer both JIRAs for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22476) Add new function dayofweek in R

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22476:


Assignee: Apache Spark

> Add new function dayofweek in R
> ---
>
> Key: SPARK-22476
> URL: https://issues.apache.org/jira/browse/SPARK-22476
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> SQL was added in SPARK-20909.
> Scala and Python for {{dayofweek}} were added in SPARK-22456. 
> Looks we should better add it in R too.
> Please refer both JIRAs for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22476) Add new function dayofweek in R

2017-11-08 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-22476:


 Summary: Add new function dayofweek in R
 Key: SPARK-22476
 URL: https://issues.apache.org/jira/browse/SPARK-22476
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Minor


SQL was added in SPARK-20909.
Scala and Python for {{dayofweek}} were added in SPARK-22456. 

Looks we should better add it in R too.

Please refer both JIRAs for more details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22308) Support unit tests of spark code using ScalaTest using suites other than FunSuite

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245283#comment-16245283
 ] 

Apache Spark commented on SPARK-22308:
--

User 'nkronenfeld' has created a pull request for this issue:
https://github.com/apache/spark/pull/19705

> Support unit tests of spark code using ScalaTest using suites other than 
> FunSuite
> -
>
> Key: SPARK-22308
> URL: https://issues.apache.org/jira/browse/SPARK-22308
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core, SQL, Tests
>Affects Versions: 2.2.0
>Reporter: Nathan Kronenfeld
>Assignee: Nathan Kronenfeld
>Priority: Minor
>  Labels: scalatest, test-suite, test_issue
>
> External codebases that have spark code can test it using SharedSparkContext, 
> no matter how they write their scalatests - basing on FunSuite, FunSpec, 
> FlatSpec, or WordSpec.
> SharedSQLContext only supports FunSuite.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22456) Add new function dayofweek

2017-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22456:


Assignee: Michael Styles

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>Assignee: Michael Styles
> Fix For: 2.3.0
>
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21657) Spark has exponential time complexity to explode(array of structs)

2017-11-08 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245240#comment-16245240
 ] 

Ruslan Dautkhanov commented on SPARK-21657:
---

Thank you [~uzadude] - great investigative work.
Would be great if this patch can make it to the 2.3 release.


> Spark has exponential time complexity to explode(array of structs)
> --
>
> Key: SPARK-21657
> URL: https://issues.apache.org/jira/browse/SPARK-21657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: cache, caching, collections, nested_types, performance, 
> pyspark, sparksql, sql
> Attachments: ExponentialTimeGrowth.PNG, 
> nested-data-generator-and-test.py
>
>
> It can take up to half a day to explode a modest-sized nested collection 
> (0.5m).
> On a recent Xeon processors.
> See attached pyspark script that reproduces this problem.
> {code}
> cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + 
> table_name).cache()
> print sqlc.count()
> {code}
> This script generate a number of tables, with the same total number of 
> records across all nested collection (see `scaling` variable in loops). 
> `scaling` variable scales up how many nested elements in each record, but by 
> the same factor scales down number of records in the table. So total number 
> of records stays the same.
> Time grows exponentially (notice log-10 vertical axis scale):
> !ExponentialTimeGrowth.PNG!
> At scaling of 50,000 (see attached pyspark script), it took 7 hours to 
> explode the nested collections (\!) of 8k records.
> After 1000 elements in nested collection, time grows exponentially.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22456) Add new function dayofweek

2017-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22456.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19672
[https://github.com/apache/spark/pull/19672]

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
> Fix For: 2.3.0
>
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22466) SPARK_CONF_DIR is not is set by Spark's launch scripts with default value

2017-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22466:


Assignee: Kent Yao

> SPARK_CONF_DIR is not is set by Spark's launch scripts with default value
> -
>
> Key: SPARK-22466
> URL: https://issues.apache.org/jira/browse/SPARK-22466
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
> Fix For: 2.3.0
>
>
> If we don't explicitly set `SPARK_CONF_DIR` in `spark-env.sh`, it is only set 
> in `sbin/spark-config.sh` with a default value which is used for spark 
> daemons, but applications with spark-submit will miss this  environment 
> variable at runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22466) SPARK_CONF_DIR is not is set by Spark's launch scripts with default value

2017-11-08 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22466.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19688
[https://github.com/apache/spark/pull/19688]

> SPARK_CONF_DIR is not is set by Spark's launch scripts with default value
> -
>
> Key: SPARK-22466
> URL: https://issues.apache.org/jira/browse/SPARK-22466
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Kent Yao
> Fix For: 2.3.0
>
>
> If we don't explicitly set `SPARK_CONF_DIR` in `spark-env.sh`, it is only set 
> in `sbin/spark-config.sh` with a default value which is used for spark 
> daemons, but applications with spark-submit will miss this  environment 
> variable at runtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22417) createDataFrame from a pandas.DataFrame reads datetime64 values as longs

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245204#comment-16245204
 ] 

Apache Spark commented on SPARK-22417:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19704

> createDataFrame from a pandas.DataFrame reads datetime64 values as longs
> 
>
> Key: SPARK-22417
> URL: https://issues.apache.org/jira/browse/SPARK-22417
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.2.1, 2.3.0
>
>
> When trying to create a Spark DataFrame from an existing Pandas DataFrame 
> using {{createDataFrame}}, columns with datetime64 values are converted as 
> long values.  This is only when the schema is not specified.  
> {code}
> In [2]: import pandas as pd
>...: from datetime import datetime
>...: 
> In [3]: pdf = pd.DataFrame({"ts": [datetime(2017, 10, 31, 1, 1, 1)]})
> In [4]: df = spark.createDataFrame(pdf)
> In [5]: df.show()
> +---+
> | ts|
> +---+
> |15094116610|
> +---+
> In [6]: df.schema
> Out[6]: StructType(List(StructField(ts,LongType,true)))
> {code}
> Spark should interpret a datetime64\[D\] value to DateType and other 
> datetime64 values to TImestampType.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245191#comment-16245191
 ] 

Apache Spark commented on SPARK-22403:
--

User 'wypoon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19703

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at StreamingQueryManager#createQuery, we have
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
> {code}
> val checkpointLocation = userSpecifiedCheckpointLocation.map { ..

[jira] [Assigned] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22403:


Assignee: (was: Apache Spark)

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at StreamingQueryManager#createQuery, we have
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
> {code}
> val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
>   ...
> }.orElse {
>   ...
> }.getOrElse {
>   if (useTempCheckpointLocation) {
>

[jira] [Assigned] (SPARK-22403) StructuredKafkaWordCount example fails in YARN cluster mode

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22403:


Assignee: Apache Spark

> StructuredKafkaWordCount example fails in YARN cluster mode
> ---
>
> Key: SPARK-22403
> URL: https://issues.apache.org/jira/browse/SPARK-22403
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Wing Yew Poon
>Assignee: Apache Spark
>
> When I run the StructuredKafkaWordCount example in YARN client mode, it runs 
> fine. However, when I run it in YARN cluster mode, the application errors 
> during initialization, and dies after the default number of YARN application 
> attempts. In the AM log, I see
> {noformat}
> 17/10/30 11:34:52 INFO execution.SparkSqlParser: Parsing command: CAST(value 
> AS STRING)
> 17/10/30 11:34:53 ERROR streaming.StreamMetadata: Error writing stream 
> metadata StreamMetadata(b71ca714-a7a1-467f-96aa-023375964429) to 
> /data/yarn/nm/usercache/systest/appcache/application_1508800814252_0047/container_1508800814252_0047_01_01/tmp/temporary-b5ced4ae-32e0-4725-b905-aad679aec9b5/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=systest, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:397)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:256)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:194)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1842)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1826)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkAncestorAccess(FSDirectory.java:1785)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.resolvePathForStartFile(FSDirWriteFileOp.java:315)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2313)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2257)
> ...
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:280)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1235)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1214)
>   at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1152)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:458)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:455)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:469)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:396)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1103)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1083)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:972)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:960)
>   at 
> org.apache.spark.sql.execution.streaming.StreamMetadata$.write(StreamMetadata.scala:76)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:116)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$6.apply(StreamExecution.scala:114)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.(StreamExecution.scala:114)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:240)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
>   at 
> org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount$.main(StructuredKafkaWordCount.scala:79)
>   at 
> org.apache.spark.examples.sql.streaming.StructuredKafkaWordCount.main(StructuredKafkaWordCount.scala)
> {noformat}
> Looking at StreamingQueryManager#createQuery, we have
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/StreamingQueryManager.scala#L198
> {code}
> val checkpointLocation = userSpecifiedCheckpointLocation.map { ...
>   ...
> }.orElse {
>   ...
> }.getOrElse {
>   if (useTempCheckp

[jira] [Resolved] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Saniya Tech (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saniya Tech resolved SPARK-22460.
-
Resolution: Not A Problem

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Saniya Tech (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245163#comment-16245163
 ] 

Saniya Tech commented on SPARK-22460:
-

Based on the feedback I am going to close this ticket and try to resolve the 
issue in spark-avro code-base. Thanks!

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245158#comment-16245158
 ] 

Liang-Chi Hsieh edited comment on SPARK-22460 at 11/9/17 3:33 AM:
--

>From the output of:
{code}
print(s"${rawOutput.collect().head}\n")
{code}

The modified field is already parsed as timestamp type.


was (Author: viirya):
>From the output of print(s"${rawOutput.collect().head}\n"), the modified field 
>is already parsed as timestamp type.

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245158#comment-16245158
 ] 

Liang-Chi Hsieh edited comment on SPARK-22460 at 11/9/17 3:32 AM:
--

>From the output of print(s"${rawOutput.collect().head}\n"), the modified field 
>is already parsed as timestamp type.


was (Author: viirya):
>From the output of {{print(s"${rawOutput.collect().head}\n")}}, the modified 
>field is already parsed as timestamp type.

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245158#comment-16245158
 ] 

Liang-Chi Hsieh commented on SPARK-22460:
-

>From the output of {{print(s"${rawOutput.collect().head}\n")}}, the modified 
>field is already parsed as timestamp type.

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10365:


Assignee: Apache Spark

> Support Parquet logical type TIMESTAMP_MICROS
> -
>
> Key: SPARK-10365
> URL: https://issues.apache.org/jira/browse/SPARK-10365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Didn't assign target version for this ticket because neither the most recent 
> parquet-mr release (1.8.1) nor the master branch has supported 
> {{TIMESTAMP_MICROS}} yet.
> It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} 
> since Parquet {{INT96}} will probably be deprecated and is only used for 
> compatibility reason for now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10365:


Assignee: (was: Apache Spark)

> Support Parquet logical type TIMESTAMP_MICROS
> -
>
> Key: SPARK-10365
> URL: https://issues.apache.org/jira/browse/SPARK-10365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> Didn't assign target version for this ticket because neither the most recent 
> parquet-mr release (1.8.1) nor the master branch has supported 
> {{TIMESTAMP_MICROS}} yet.
> It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} 
> since Parquet {{INT96}} will probably be deprecated and is only used for 
> compatibility reason for now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10365) Support Parquet logical type TIMESTAMP_MICROS

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245082#comment-16245082
 ] 

Apache Spark commented on SPARK-10365:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19702

> Support Parquet logical type TIMESTAMP_MICROS
> -
>
> Key: SPARK-10365
> URL: https://issues.apache.org/jira/browse/SPARK-10365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>
> Didn't assign target version for this ticket because neither the most recent 
> parquet-mr release (1.8.1) nor the master branch has supported 
> {{TIMESTAMP_MICROS}} yet.
> It would be nice to map Spark SQL {{TimestampType}} to {{TIMESTAMP_MICROS}} 
> since Parquet {{INT96}} will probably be deprecated and is only used for 
> compatibility reason for now.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245063#comment-16245063
 ] 

Apache Spark commented on SPARK-22211:
--

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/19701

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Assignee: Henry Robinson
> Fix For: 2.2.1, 2.3.0
>
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22211) LimitPushDown optimization for FullOuterJoin generates wrong results

2017-11-08 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245029#comment-16245029
 ] 

Dongjoon Hyun commented on SPARK-22211:
---

Hi, All.
This breaks `branch-2.2`.
- SBT: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-sbt-hadoop-2.7/408/
- MAVEN: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.2-test-maven-hadoop-2.7/423

> LimitPushDown optimization for FullOuterJoin generates wrong results
> 
>
> Key: SPARK-22211
> URL: https://issues.apache.org/jira/browse/SPARK-22211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: on community.cloude.databrick.com 
> Runtime Version 3.2 (includes Apache Spark 2.2.0, Scala 2.11)
>Reporter: Benyi Wang
>Assignee: Henry Robinson
> Fix For: 2.2.1, 2.3.0
>
>
> LimitPushDown pushes LocalLimit to one side for FullOuterJoin, but this may 
> generate a wrong result:
> Assume we use limit(1) and LocalLimit will be pushed to left side, and id=999 
> is selected, but at right side we have 100K rows including 999, the result 
> will be
> - one row is (999, 999)
> - the rest rows are (null, xxx)
> Once you call show(), the row (999,999) has only 1/10th chance to be 
> selected by CollectLimit.
> The actual optimization might be, 
> - push down limit
> - but convert the join to Broadcast LeftOuterJoin or RightOuterJoin.
> Here is my notebook:
> https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/349451637617406/2750346983121008/656075277290/latest.html
> {code:java}
> import scala.util.Random._
> val dl = shuffle(1 to 10).toDF("id")
> val dr = shuffle(1 to 10).toDF("id")
> println("data frame dl:")
> dl.explain
> println("data frame dr:")
> dr.explain
> val j = dl.join(dr, dl("id") === dr("id"), "outer").limit(1)
> j.explain
> j.show(false)
> {code}
> {code}
> data frame dl:
> == Physical Plan ==
> LocalTableScan [id#10]
> data frame dr:
> == Physical Plan ==
> LocalTableScan [id#16]
> == Physical Plan ==
> CollectLimit 1
> +- SortMergeJoin [id#10], [id#16], FullOuter
>:- *Sort [id#10 ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(id#10, 200)
>: +- *LocalLimit 1
>:+- LocalTableScan [id#10]
>+- *Sort [id#16 ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(id#16, 200)
>  +- LocalTableScan [id#16]
> import scala.util.Random._
> dl: org.apache.spark.sql.DataFrame = [id: int]
> dr: org.apache.spark.sql.DataFrame = [id: int]
> j: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, id: int]
> ++---+
> |id  |id |
> ++---+
> |null|148|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22475) show histogram in DESC COLUMN command

2017-11-08 Thread Zhenhua Wang (JIRA)

Zhenhua Wang created SPARK-22475:


 Summary: show histogram in DESC COLUMN command
 Key: SPARK-22475
 URL: https://issues.apache.org/jira/browse/SPARK-22475
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17593) list files on s3 very slow

2017-11-08 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244887#comment-16244887
 ] 

Nick Dimiduk commented on SPARK-17593:
--

So the fix in Hadoop 2.8 is for any variant of the s3* FileSystem? Or is it 
only for s3a?

bq. as it really needs Spark to move to listFiles(recursive)

Do we still need this change to be shipped in Spark? Thanks.

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>Priority: Minor
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Arseniy Tashoyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244836#comment-16244836
 ] 

Arseniy Tashoyan commented on SPARK-22471:
--

I am coming up with a simple fix, backportable to 2.2.

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22471:


Assignee: (was: Apache Spark)

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244821#comment-16244821
 ] 

Apache Spark commented on SPARK-22471:
--

User 'tashoyan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19700

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22471:


Assignee: Apache Spark

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>Assignee: Apache Spark
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20648) Make Jobs and Stages pages use the new app state store

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20648:


Assignee: Apache Spark

> Make Jobs and Stages pages use the new app state store
> --
>
> Key: SPARK-20648
> URL: https://issues.apache.org/jira/browse/SPARK-20648
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making both the Jobs and Stages pages use the new app state 
> store. Because these two pages are very tightly coupled, it's easier to 
> modify both in one go.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20648) Make Jobs and Stages pages use the new app state store

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20648:


Assignee: (was: Apache Spark)

> Make Jobs and Stages pages use the new app state store
> --
>
> Key: SPARK-20648
> URL: https://issues.apache.org/jira/browse/SPARK-20648
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making both the Jobs and Stages pages use the new app state 
> store. Because these two pages are very tightly coupled, it's easier to 
> modify both in one go.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20648) Make Jobs and Stages pages use the new app state store

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244557#comment-16244557
 ] 

Apache Spark commented on SPARK-20648:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19698

> Make Jobs and Stages pages use the new app state store
> --
>
> Key: SPARK-20648
> URL: https://issues.apache.org/jira/browse/SPARK-20648
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks making both the Jobs and Stages pages use the new app state 
> store. Because these two pages are very tightly coupled, it's easier to 
> modify both in one go.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244466#comment-16244466
 ] 

Sean Owen commented on SPARK-22472:
---

This doesn't look right ...

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L596

{code}
  def defaultValue(jt: String): String = jt match {
case JAVA_BOOLEAN => "false"
case JAVA_BYTE => "(byte)-1"
case JAVA_SHORT => "(short)-1"
case JAVA_INT => "-1"
case JAVA_LONG => "-1L"
case JAVA_FLOAT => "-1.0f"
case JAVA_DOUBLE => "-1.0"
case _ => "null"
  }
{code}

The default for any uninitialized primitive type is 0 in the JVM.

[~cloud_fan] I think you were the last to touch this, but didn't write it. Is 
this really -1 on purpose for another reason?

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22222) Fix the ARRAY_MAX in BufferHolder and add a test

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244455#comment-16244455
 ] 

Apache Spark commented on SPARK-2:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19697

> Fix the ARRAY_MAX in BufferHolder and add a test
> 
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Feng Liu
>Assignee: Feng Liu
> Fix For: 2.3.0
>
>
> This is actually a followup for SPARK-22033, which set the `ARRAY_MAX` to 
> `Int.MaxValue - 8`. It is not a valid number because it will cause the 
> following line fail when such a large byte array is allocated: 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java#L86
>  We need to make sure the new length is a multiple of 8.
> We need to add one test for the fix. Note that the test should work 
> independently with the heap size of the test JVM. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244452#comment-16244452
 ] 

Marcelo Vanzin commented on SPARK-22471:


My patch for SPARK-20652 greatly reduces the memory usage of the SQL listener. 
But it's not backportable to 2.2.

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244438#comment-16244438
 ] 

Kazuaki Ishizaki commented on SPARK-22472:
--

>From [the initial 
>version|https://github.com/apache/spark/commit/9e66a53c9955285a85c19f55c3ef62db2e1b868a#diff-94a1f59bcc9b6758c4ca874652437634R227]
> of this conversion (about two years ago), this conversion returns {{-1}}...

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22133) Document Mesos reject offer duration configutations

2017-11-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-22133:
-

Assignee: windkithk

> Document Mesos reject offer duration configutations
> ---
>
> Key: SPARK-22133
> URL: https://issues.apache.org/jira/browse/SPARK-22133
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>Assignee: windkithk
> Fix For: 2.3.0
>
>
> Mesos has multiple configurable timeouts {{spark.mesos.rejectOfferDuration}}, 
> {{spark.mesos.rejectOfferDurationForUnmetConstraints}}, and 
> {{spark.mesos.rejectOfferDurationForReachedMaxCores}} that can have a large 
> effect on Spark performance when sharing a Mesos cluster with other 
> frameworks and users. These configurations aren't documented, add 
> documentation and information for non-Mesos experts on how these settings 
> should be used. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22133) Document Mesos reject offer duration configutations

2017-11-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22133:
--
Priority: Minor  (was: Major)

> Document Mesos reject offer duration configutations
> ---
>
> Key: SPARK-22133
> URL: https://issues.apache.org/jira/browse/SPARK-22133
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
>Assignee: windkithk
>Priority: Minor
> Fix For: 2.3.0
>
>
> Mesos has multiple configurable timeouts {{spark.mesos.rejectOfferDuration}}, 
> {{spark.mesos.rejectOfferDurationForUnmetConstraints}}, and 
> {{spark.mesos.rejectOfferDurationForReachedMaxCores}} that can have a large 
> effect on Spark performance when sharing a Mesos cluster with other 
> frameworks and users. These configurations aren't documented, add 
> documentation and information for non-Mesos experts on how these settings 
> should be used. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22133) Document Mesos reject offer duration configutations

2017-11-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22133.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19555
[https://github.com/apache/spark/pull/19555]

> Document Mesos reject offer duration configutations
> ---
>
> Key: SPARK-22133
> URL: https://issues.apache.org/jira/browse/SPARK-22133
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.3.0
>Reporter: Arthur Rand
> Fix For: 2.3.0
>
>
> Mesos has multiple configurable timeouts {{spark.mesos.rejectOfferDuration}}, 
> {{spark.mesos.rejectOfferDurationForUnmetConstraints}}, and 
> {{spark.mesos.rejectOfferDurationForReachedMaxCores}} that can have a large 
> effect on Spark performance when sharing a Mesos cluster with other 
> frameworks and users. These configurations aren't documented, add 
> documentation and information for non-Mesos experts on how these settings 
> should be used. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327
 ] 

Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:30 PM:
---

Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if an Option object is {{null}} or an Option value is 
{{empty}}, {{-1}} is passed to a lambda function. I will check why {{-1}} was 
used as a value for these cases.


was (Author: kiszk):
Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {{empty}}, {{-1}} is passed to a lambda 
function. I will check why {{-1}} was used as a value for {{empty}}.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327
 ] 

Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:22 PM:
---

Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {{empty}}, {{-1}} is passed to a lambda 
function. I will check why {{-1}} was used as a value for {{empty}}.


was (Author: kiszk):
Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed 
to a lambda function. I will check why {{-1}} was used as a value for {{null}} 
or {{empty}}.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]

2017-11-08 Thread Mikael Valot (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikael Valot updated SPARK-22474:
-
Affects Version/s: 2.1.2

> cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]
> -
>
> Key: SPARK-22474
> URL: https://issues.apache.org/jira/browse/SPARK-22474
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Mikael Valot
>
> The following code run in spark-shell throws an exception. It is working fine 
> in Spark 2.0.2
> {code:java}
> case class MyId(v: String)
> case class MyClass(infos: Seq[Map[MyId, String]])
> val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah"
> seq.toDS().write.parquet("/tmp/myclass")
> spark.read.parquet("/tmp/myclass").as[MyClass].collect()
> Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected 
> to be a primitive type, but found: required group key {
>   optional binary v (UTF8);
> };
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201)
>   at scala.Option.fold(Option.scala:158)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
>   at scala.Option.fold(Option.scala:158)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.(ParquetRowConverter.scala:180)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRecordMaterializer.(ParquetRecordMaterializer.scala:38)
>   at 
> org.apache.spark.sql.execution.datasources.

[jira] [Created] (SPARK-22474) cannot read a parquet file containing a Seq[Map[MyCaseClass, String]]

2017-11-08 Thread Mikael Valot (JIRA)

Mikael Valot created SPARK-22474:


 Summary: cannot read a parquet file containing a 
Seq[Map[MyCaseClass, String]]
 Key: SPARK-22474
 URL: https://issues.apache.org/jira/browse/SPARK-22474
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Mikael Valot


The following code run in spark-shell throws an exception. It is working fine 
in Spark 2.0.2
{code:java}
case class MyId(v: String)
case class MyClass(infos: Seq[Map[MyId, String]])
val seq = Seq(MyClass(Seq(Map(MyId("1234") -> "blah"
seq.toDS().write.parquet("/tmp/myclass")
spark.read.parquet("/tmp/myclass").as[MyClass].collect()

Caused by: org.apache.spark.sql.AnalysisException: Map key type is expected to 
be a primitive type, but found: required group key {
  optional binary v (UTF8);
};
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:246)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$2.apply(ParquetSchemaConverter.scala:201)
  at scala.Option.fold(Option.scala:158)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:87)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$2.apply(ParquetSchemaConverter.scala:84)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetSchemaConverter$$convert(ParquetSchemaConverter.scala:84)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$$anonfun$convertGroupField$1.apply(ParquetSchemaConverter.scala:201)
  at scala.Option.fold(Option.scala:158)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertGroupField(ParquetSchemaConverter.scala:201)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:109)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$ParquetArrayConverter.(ParquetRowConverter.scala:483)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:298)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:183)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$$anonfun$6.apply(ParquetRowConverter.scala:180)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.(ParquetRowConverter.scala:180)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRecordMaterializer.(ParquetRecordMaterializer.scala:38)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.prepareForRead(ParquetReadSupport.scala:95)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:175)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:190)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecord

[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327
 ] 

Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:11 PM:
---

Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed 
to a lambda function. I will check why {{-1}} was used as a value for {{null}} 
or {{empty}}.


was (Author: kiszk):
Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed 
to a lambda function.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327
 ] 

Kazuaki Ishizaki commented on SPARK-22472:
--

Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {null} or {empty}, {-1} is passed to a 
lambda function.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244327#comment-16244327
 ] 

Kazuaki Ishizaki edited comment on SPARK-22472 at 11/8/17 5:10 PM:
---

Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {{null}} or {{empty}}, {{-1}} is passed 
to a lambda function.


was (Author: kiszk):
Thank you for reporting this behavior.

When I checked the generated code and source code, it is currently-expected 
behavior. In other words, if a value is {null} or {empty}, {-1} is passed to a 
lambda function.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Vladislav Kuzemchik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244319#comment-16244319
 ] 

Vladislav Kuzemchik commented on SPARK-22472:
-

I'm using Option[Long] as a workaround, but it is kinda scary to leave things 
as is and hope that you gonna catch it on review when anyone else is using 
datasets.

I think spark should warn(or even error with some config parameter set) when 
you converting nullable DataFrame column into non-optional type.

Currently if you do that with non-primitive type, you most likely gonna net 
NPE, and will have to handle this use case anyway.


In my opinion current implicit behavior cause much more harm. We talking about 
bad results without any notification.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244310#comment-16244310
 ] 

Marco Gaido edited comment on SPARK-22472 at 11/8/17 4:58 PM:
--

Two things:
 1 - if you use {{as\[Option\[Long\]\]}}, it works fine;
 2 - actually if you collect the Dataset, the value for {{null}} is {{0}}, but 
with the transformation, there is this bad result.

Then I am not sure about the right approach here. Because maybe the best thing 
would be to force using {{Option\[Long\]}} when the value is {{nullable}}, but 
this might be too restrictive and may break compatibility.


was (Author: mgaido):
Two things:
 1 - if you use `as[Option[Long]]`, it works fine;
 2 - actually if you collect the Dataset, the value for `null` is `0`, but with 
the transformation, there is this bad result.

Then I am not sure about the right approach here. Because maybe the best thing 
would be to force using `Option[Long]` when the value is `nullable`, but this 
might be too restrictive and may break compatibility.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244310#comment-16244310
 ] 

Marco Gaido commented on SPARK-22472:
-

Two things:
 1 - if you use `as[Option[Long]]`, it works fine;
 2 - actually if you collect the Dataset, the value for `null` is `0`, but with 
the transformation, there is this bad result.

Then I am not sure about the right approach here. Because maybe the best thing 
would be to force using `Option[Long]` when the value is `nullable`, but this 
might be too restrictive and may break compatibility.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Vladislav Kuzemchik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244273#comment-16244273
 ] 

Vladislav Kuzemchik commented on SPARK-22472:
-

My assumption is that it is because it goes into Tungsten to do multiplication 
on Dataset, and probably tries to get value of null pointer.

I would expect 0 or exception too.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244263#comment-16244263
 ] 

Sean Owen commented on SPARK-22472:
---

Not sure it's random, but, somehow the value is being construed as -1 before 
being doubled. It's odd because {{s.as[Long].show(false)}} is fine. Try 
{{s.as[Long].map(_.toString).show(false)}} to see it more directly.

I admit I don't know much about the internals; is someone more informed able to 
weigh in on why null would become -1 here?

I think part of the issue is that you're converting null to a primitive long, 
but would expect 0 if anything, or an exception.

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22473:


Assignee: Apache Spark

> Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
> --
>
> Key: SPARK-22473
> URL: https://issues.apache.org/jira/browse/SPARK-22473
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Trivial
>
> In `spark-sql` module tests there are deprecations warnings caused by the 
> usage of deprecated methods of `java.sql.Date` and the usage of the 
> deprecated `AsyncAssertions.Waiter` class.
> This issue is to track their replacement with their respective non-deprecated 
> versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22473:


Assignee: (was: Apache Spark)

> Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
> --
>
> Key: SPARK-22473
> URL: https://issues.apache.org/jira/browse/SPARK-22473
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> In `spark-sql` module tests there are deprecations warnings caused by the 
> usage of deprecated methods of `java.sql.Date` and the usage of the 
> deprecated `AsyncAssertions.Waiter` class.
> This issue is to track their replacement with their respective non-deprecated 
> versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244260#comment-16244260
 ] 

Apache Spark commented on SPARK-22473:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19696

> Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date
> --
>
> Key: SPARK-22473
> URL: https://issues.apache.org/jira/browse/SPARK-22473
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Trivial
>
> In `spark-sql` module tests there are deprecations warnings caused by the 
> usage of deprecated methods of `java.sql.Date` and the usage of the 
> deprecated `AsyncAssertions.Waiter` class.
> This issue is to track their replacement with their respective non-deprecated 
> versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22473) Replace deprecated AsyncAssertions.Waiter and methods of java.sql.Date

2017-11-08 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-22473:
---

 Summary: Replace deprecated AsyncAssertions.Waiter and methods of 
java.sql.Date
 Key: SPARK-22473
 URL: https://issues.apache.org/jira/browse/SPARK-22473
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 2.3.0
Reporter: Marco Gaido
Priority: Trivial


In `spark-sql` module tests there are deprecations warnings caused by the usage 
of deprecated methods of `java.sql.Date` and the usage of the deprecated 
`AsyncAssertions.Waiter` class.

This issue is to track their replacement with their respective non-deprecated 
versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Vladislav Kuzemchik (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladislav Kuzemchik updated SPARK-22472:

Affects Version/s: 2.2.0

> Datasets generate random values for null primitive types
> 
>
> Key: SPARK-22472
> URL: https://issues.apache.org/jira/browse/SPARK-22472
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Vladislav Kuzemchik
>
> Not sure if it ever were reported.
> {code}
> scala> val s = 
> sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
> s: org.apache.spark.sql.DataFrame = [v: bigint]
> scala> s.show(false)
> ++
> |v   |
> ++
> |null|
> |1   |
> |5   |
> ++
> scala> s.as[Long].map(v => v*2).show(false)
> +-+
> |value|
> +-+
> |-2   |
> |2|
> |10   |
> +-+
> scala> s.select($"v"*2).show(false)
> +---+
> |(v * 2)|
> +---+
> |null   |
> |2  |
> |10 |
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Arseniy Tashoyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-22471:
-
Attachment: SQLListener_retained_size.png

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22472) Datasets generate random values for null primitive types

2017-11-08 Thread Vladislav Kuzemchik (JIRA)

Vladislav Kuzemchik created SPARK-22472:
---

 Summary: Datasets generate random values for null primitive types
 Key: SPARK-22472
 URL: https://issues.apache.org/jira/browse/SPARK-22472
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Vladislav Kuzemchik


Not sure if it ever were reported.

{code}
scala> val s = 
sc.parallelize(Seq[Option[Long]](None,Some(1L),Some(5))).toDF("v")
s: org.apache.spark.sql.DataFrame = [v: bigint]

scala> s.show(false)
++
|v   |
++
|null|
|1   |
|5   |
++


scala> s.as[Long].map(v => v*2).show(false)
+-+
|value|
+-+
|-2   |
|2|
|10   |
+-+


scala> s.select($"v"*2).show(false)
+---+
|(v * 2)|
+---+
|null   |
|2  |
|10 |
+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Arseniy Tashoyan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arseniy Tashoyan updated SPARK-22471:
-
Attachment: SQLListener_stageIdToStageMetrics_retained_size.png

> SQLListener consumes much memory causing OutOfMemoryError
> -
>
> Key: SPARK-22471
> URL: https://issues.apache.org/jira/browse/SPARK-22471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0, Linux
>Reporter: Arseniy Tashoyan
>  Labels: memory-leak, sql
> Attachments: SQLListener_retained_size.png, 
> SQLListener_stageIdToStageMetrics_retained_size.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> _SQLListener_ may grow very large when Spark runs complex multi-stage 
> requests. The listener tracks metrics for all stages in 
> __stageIdToStageMetrics_ hash map. _SQLListener_ has some means to cleanup 
> this hash map regularly, but this is not enough. Precisely, the method 
> _trimExecutionsIfNecessary_ ensures that __stageIdToStageMetrics_ does not 
> have metrics for very old data; this method runs on each execution completion.
> However, if an execution has many stages, _SQLListener_ keeps adding new 
> entries to __stageIdToStageMetrics_ without calling 
> _trimExecutionsIfNecessary_. The hash map may grow to enormous size.
> Strictly speaking, it is not a memory leak, because finally 
> _trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
> has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22471) SQLListener consumes much memory causing OutOfMemoryError

2017-11-08 Thread Arseniy Tashoyan (JIRA)

Arseniy Tashoyan created SPARK-22471:


 Summary: SQLListener consumes much memory causing OutOfMemoryError
 Key: SPARK-22471
 URL: https://issues.apache.org/jira/browse/SPARK-22471
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 2.2.0
 Environment: Spark 2.2.0, Linux
Reporter: Arseniy Tashoyan


_SQLListener_ may grow very large when Spark runs complex multi-stage requests. 
The listener tracks metrics for all stages in __stageIdToStageMetrics_ hash 
map. _SQLListener_ has some means to cleanup this hash map regularly, but this 
is not enough. Precisely, the method _trimExecutionsIfNecessary_ ensures that 
__stageIdToStageMetrics_ does not have metrics for very old data; this method 
runs on each execution completion.
However, if an execution has many stages, _SQLListener_ keeps adding new 
entries to __stageIdToStageMetrics_ without calling 
_trimExecutionsIfNecessary_. The hash map may grow to enormous size.
Strictly speaking, it is not a memory leak, because finally 
_trimExecutionsIfNecessary_ cleans the hash map. However, the driver program 
has high odds to crash with OutOfMemoryError (and it does).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243913#comment-16243913
 ] 

Kazuaki Ishizaki commented on SPARK-22460:
--

I run the similar code with json. It can correctly decode the value.

{code}
  
  //
  // De-serialize
  val rawOutput = spark.read.json(path)
  val output = rawOutput.as[TestRecord]
  print(s"${data.head}\n")
  print(s"${data.head.modified.getTime}\n")
  print(s"${rawOutput.collect().head}\n")
  print(s"${output.collect().head}\n")
{code}

{code}
TestRecord(One,2017-11-08 04:57:15.537)
1510145835537
[2017-11-08T04:57:15.537-08:00,One]
TestRecord(One,2017-11-08 04:57:15.537)
{code}

> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22460) Spark De-serialization of Timestamp field is Incorrect

2017-11-08 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243834#comment-16243834
 ] 

Liang-Chi Hsieh commented on SPARK-22460:
-

In Spark SQL, to cast a long field to timestamp field, the long value in the 
long field is seen as seconds. Although timestamp field is also stored 
internally as long value, the long value in a timestamp field is seen as 
microseconds. That's said you can't directly cast a long field with 
microseconds/milliseconds to a timestamp field and get correct timestamp...

Currently spark-avro doesn't support conversion from avro's long type to 
Catalyst's Date/Timestamp types, with explicitly given schema, e.g., 
{{spark.read.schema(dataSchema).avro(path)}}. To have correct Date/Timestamp 
field, I think it should be fixed at spark-avro by adding support 
Date/Timestamp data type. It enables spark-avro to interpret loaded data as 
correct date/timestamp fields.


> Spark De-serialization of Timestamp field is Incorrect
> --
>
> Key: SPARK-22460
> URL: https://issues.apache.org/jira/browse/SPARK-22460
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Saniya Tech
>
> We are trying to serialize Timestamp fields to Avro using spark-avro 
> connector. I can see the Timestamp fields are getting correctly serialized as 
> long (milliseconds since Epoch). I verified that the data is correctly read 
> back from the Avro files. It is when we encode the Dataset as a case class 
> that timestamp field is incorrectly converted to a long value as seconds 
> since Epoch. As can be seen below, this shifts the timestamp many years in 
> the future.
> Code used to reproduce the issue:
> {code:java}
> import java.sql.Timestamp
> import com.databricks.spark.avro._
> import org.apache.spark.sql.{Dataset, Row, SaveMode, SparkSession}
> case class TestRecord(name: String, modified: Timestamp)
> import spark.implicits._
> val data = Seq(
>   TestRecord("One", new Timestamp(System.currentTimeMillis()))
> )
> // Serialize:
> val parameters = Map("recordName" -> "TestRecord", "recordNamespace" -> 
> "com.example.domain")
> val path = s"s3a://some-bucket/output/"
> val ds = spark.createDataset(data)
> ds.write
>   .options(parameters)
>   .mode(SaveMode.Overwrite)
>   .avro(path)
> //
> // De-serialize
> val output = spark.read.avro(path).as[TestRecord]
> {code}
> Output from the test:
> {code:java}
> scala> data.head
> res4: TestRecord = TestRecord(One,2017-11-06 20:06:19.419)
> scala> output.collect().head
> res5: TestRecord = TestRecord(One,49819-12-16 17:23:39.0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243789#comment-16243789
 ] 

Apache Spark commented on SPARK-22377:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19695

> Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
> --
>
> Key: SPARK-22377
> URL: https://issues.apache.org/jira/browse/SPARK-22377
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Xin Lu
>
> It looks like multiple workers in the amplab jenkins cannot execute lsof.  
> Example log below:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
> spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
> found
> usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]
> I looked at the jobs and it looks like only  amp-jenkins-worker-01 works so 
> you are getting a successful build every week or so.  Unclear if the snapshot 
> is actually released.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22377:


Assignee: (was: Apache Spark)

> Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
> --
>
> Key: SPARK-22377
> URL: https://issues.apache.org/jira/browse/SPARK-22377
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Xin Lu
>
> It looks like multiple workers in the amplab jenkins cannot execute lsof.  
> Example log below:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
> spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
> found
> usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]
> I looked at the jobs and it looks like only  amp-jenkins-worker-01 works so 
> you are getting a successful build every week or so.  Unclear if the snapshot 
> is actually released.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22377) Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22377:


Assignee: Apache Spark

> Maven nightly snapshot jenkins jobs are broken on multiple workers due to lsof
> --
>
> Key: SPARK-22377
> URL: https://issues.apache.org/jira/browse/SPARK-22377
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Xin Lu
>Assignee: Apache Spark
>
> It looks like multiple workers in the amplab jenkins cannot execute lsof.  
> Example log below:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.1-maven-snapshots/182/console
> spark-build/dev/create-release/release-build.sh: line 344: lsof: command not 
> found
> usage: kill [ -s signal | -p ] [ -a ] pid ...
>kill -l [ signal ]
> I looked at the jobs and it looks like only  amp-jenkins-worker-01 works so 
> you are getting a successful build every week or so.  Unclear if the snapshot 
> is actually released.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2017-11-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22446.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19662
[https://github.com/apache/spark/pull/19662]

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.0, 2.2.0
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
> Fix For: 2.3.0
>
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.sc

[jira] [Assigned] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2017-11-08 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22446:
---

Assignee: Liang-Chi Hsieh

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.0, 2.2.0
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however, I can't seem to reduce it to a 
sample. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code:python}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/pyt

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'}}

Another error is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.p

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{code:python}x = a.subtract(b)
y = b.subtract(a){code}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745){noformat}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'{noformat}

Another error is:

{noformat}File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/

[jira] [Updated] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Porritt updated SPARK-22468:
--
Description: 
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

{{x = a.subtract(b)
y = b.subtract(a)}}

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. One of the errors I will get is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'}}

Another error is:

{{  File "", line 642, in 
if not y.rdd.isEmpty():
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1377, in 
isEmpty
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/python/lib/pyspark.zip/pyspark/context.py", line 992, in 
runJob
  File "/python/lib/pyspark.zip/pyspark/rdd.py", line 2458, in 
_jrdd
  File "/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in 
deco
  File "/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o5751.asJavaRDD. Trace:
py4j.Py4JException: Method asJavaRDD([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)}}

Sometimes the error will complain about it not having a 'size' parameter.




  was:
I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

x = a.subtract(b)
y = b.subtract(a)

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. The error I will get is:

  File "", line 642, in 
if y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'

Sometimes the error will complain about it not having a 'size' parameter.




> subtract creating empty DataFrame that isn't initialised properly 
> --
>
> Key: SPARK-22468
> URL: https://issues.apache.org/jira/browse/SPARK-22468
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: James Porritt
>
> I have an issue whereby a subtract between two DataFrames that will correctly 
> end up with an empty DataFrame, seemingly has the DataFra

[jira] [Assigned] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22470:


Assignee: Apache Spark

> Doc that functions.hash is also used internally for shuffle and bucketing
> -
>
> Key: SPARK-22470
> URL: https://issues.apache.org/jira/browse/SPARK-22470
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Assignee: Apache Spark
>
> https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that 
> appears to be the same hash function as what Spark uses internally for 
> shuffle and bucketing.
> One of my users would like to bake this assumption into code, but is unsure 
> if it's a guarantee or a coincidence that they're the same function.  Would 
> it be considered an API break if at some point the two functions were 
> different, or if the implementation of both changed together?
> We should add a line to the scaladoc to clarify.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22470:


Assignee: (was: Apache Spark)

> Doc that functions.hash is also used internally for shuffle and bucketing
> -
>
> Key: SPARK-22470
> URL: https://issues.apache.org/jira/browse/SPARK-22470
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that 
> appears to be the same hash function as what Spark uses internally for 
> shuffle and bucketing.
> One of my users would like to bake this assumption into code, but is unsure 
> if it's a guarantee or a coincidence that they're the same function.  Would 
> it be considered an API break if at some point the two functions were 
> different, or if the implementation of both changed together?
> We should add a line to the scaladoc to clarify.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243720#comment-16243720
 ] 

Apache Spark commented on SPARK-22470:
--

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/19694

> Doc that functions.hash is also used internally for shuffle and bucketing
> -
>
> Key: SPARK-22470
> URL: https://issues.apache.org/jira/browse/SPARK-22470
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>
> https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that 
> appears to be the same hash function as what Spark uses internally for 
> shuffle and bucketing.
> One of my users would like to bake this assumption into code, but is unsure 
> if it's a guarantee or a coincidence that they're the same function.  Would 
> it be considered an API break if at some point the two functions were 
> different, or if the implementation of both changed together?
> We should add a line to the scaladoc to clarify.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22470) Doc that functions.hash is also used internally for shuffle and bucketing

2017-11-08 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-22470:
--

 Summary: Doc that functions.hash is also used internally for 
shuffle and bucketing
 Key: SPARK-22470
 URL: https://issues.apache.org/jira/browse/SPARK-22470
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 2.2.0
Reporter: Andrew Ash


https://issues.apache.org/jira/browse/SPARK-12480 added a hash function that 
appears to be the same hash function as what Spark uses internally for shuffle 
and bucketing.

One of my users would like to bake this assumption into code, but is unsure if 
it's a guarantee or a coincidence that they're the same function.  Would it be 
considered an API break if at some point the two functions were different, or 
if the implementation of both changed together?

We should add a line to the scaladoc to clarify.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22469) Accuracy problem in comparison with string and numeric

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243600#comment-16243600
 ] 

Apache Spark commented on SPARK-22469:
--

User 'liutang123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19692

> Accuracy problem in comparison with string and numeric 
> ---
>
> Key: SPARK-22469
> URL: https://issues.apache.org/jira/browse/SPARK-22469
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lijia Liu
>
> {code:sql}
> select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
> {code}
> IIUC, we can cast string as double like Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22469) Accuracy problem in comparison with string and numeric

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22469:


Assignee: Apache Spark

> Accuracy problem in comparison with string and numeric 
> ---
>
> Key: SPARK-22469
> URL: https://issues.apache.org/jira/browse/SPARK-22469
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lijia Liu
>Assignee: Apache Spark
>
> {code:sql}
> select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
> {code}
> IIUC, we can cast string as double like Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22469) Accuracy problem in comparison with string and numeric

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22469:


Assignee: (was: Apache Spark)

> Accuracy problem in comparison with string and numeric 
> ---
>
> Key: SPARK-22469
> URL: https://issues.apache.org/jira/browse/SPARK-22469
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Lijia Liu
>
> {code:sql}
> select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
> {code}
> IIUC, we can cast string as double like Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22469) Accuracy problem in comparison with string and numeric

2017-11-08 Thread Lijia Liu (JIRA)

Lijia Liu created SPARK-22469:
-

 Summary: Accuracy problem in comparison with string and numeric 
 Key: SPARK-22469
 URL: https://issues.apache.org/jira/browse/SPARK-22469
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Lijia Liu



{code:sql}
select '1.5' > 0.5; // Result is NULL in Spark but is true in Hive.
{code}

IIUC, we can cast string as double like Hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22468) subtract creating empty DataFrame that isn't initialised properly

2017-11-08 Thread James Porritt (JIRA)

James Porritt created SPARK-22468:
-

 Summary: subtract creating empty DataFrame that isn't initialised 
properly 
 Key: SPARK-22468
 URL: https://issues.apache.org/jira/browse/SPARK-22468
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: James Porritt


I have an issue whereby a subtract between two DataFrames that will correctly 
end up with an empty DataFrame, seemingly has the DataFrame not initialised 
properly.

In my code I try and do a subtract both ways:

x = a.subtract(b)
y = b.subtract(a)

I then do an .rdd.isEmpty() on both of them to check if I need to do something 
with the results. Often the 'y' subtract will fail if the 'x' subtract is 
non-empty. It's hard to reproduce however. The error I will get is:

  File "", line 642, in 
if y.rdd.isEmpty():
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1377, in isEmpty
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
1343, in take
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 
992, in runJob
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2455, in _jrdd
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/rdd.py", line 
2390, in _wrap_function
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1386, in __call__
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1372, in _get_args
  File "/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_collections.py",
 line 501, in convert
AttributeError: 'NoneType' object has no attribute 'add'

Sometimes the error will complain about it not having a 'size' parameter.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22281) Handle R method breaking signature changes

2017-11-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-22281.
--
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 2.3.0
   2.2.1

> Handle R method breaking signature changes
> --
>
> Key: SPARK-22281
> URL: https://issues.apache.org/jira/browse/SPARK-22281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.2.1, 2.3.0
>
>
> cAs discussed here
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-1-2-RC2-tt22540.html#a22555
> this WARNING on R-devel
> * checking for code/documentation mismatches ... WARNING
> Codoc mismatches from documentation object 'attach':
> attach
>   Code: function(what, pos = 2L, name = deparse(substitute(what),
>  backtick = FALSE), warn.conflicts = TRUE)
>   Docs: function(what, pos = 2L, name = deparse(substitute(what)),
>  warn.conflicts = TRUE)
>   Mismatches in argument default values:
> Name: 'name' Code: deparse(substitute(what), backtick = FALSE) Docs: 
> deparse(substitute(what))
> Codoc mismatches from documentation object 'glm':
> glm
>   Code: function(formula, family = gaussian, data, weights, subset,
>  na.action, start = NULL, etastart, mustart, offset,
>  control = list(...), model = TRUE, method = "glm.fit",
>  x = FALSE, y = TRUE, singular.ok = TRUE, contrasts =
>  NULL, ...)
>   Docs: function(formula, family = gaussian, data, weights, subset,
>  na.action, start = NULL, etastart, mustart, offset,
>  control = list(...), model = TRUE, method = "glm.fit",
>  x = FALSE, y = TRUE, contrasts = NULL, ...)
>   Argument names in code not in docs:
> singular.ok
>   Mismatches in argument names:
> Position: 16 Code: singular.ok Docs: contrasts
> Position: 17 Code: contrasts Docs: ...
> Checked the latest release R 3.4.1 and the signature change wasn't there. 
> This likely indicated an upcoming change in the next R release that could 
> incur this new warning when we attempt to publish the package.
> Not sure what we can do now since we work with multiple versions of R and 
> they will have different signatures then.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14922:


Assignee: (was: Apache Spark)

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243488#comment-16243488
 ] 

Apache Spark commented on SPARK-14922:
--

User 'DazhuangSu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19691

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk

2017-11-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16243481#comment-16243481
 ] 

Apache Spark commented on SPARK-22467:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/19690

> Added a switch to support whether `stdout_stream` and `stderr_stream` output 
> to disk
> 
>
> Key: SPARK-22467
> URL: https://issues.apache.org/jira/browse/SPARK-22467
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>
> We should add a switch to control the `stdout_stream` and `stdout_stream` 
> output to disk.
> In my environment,due to disk I/O blocking, the `stdout_stream` output is 
> very slow, so it can not be timely cleaning，and this leads the executor 
> process to be blocked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22467:


Assignee: Apache Spark

> Added a switch to support whether `stdout_stream` and `stderr_stream` output 
> to disk
> 
>
> Key: SPARK-22467
> URL: https://issues.apache.org/jira/browse/SPARK-22467
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: Apache Spark
>
> We should add a switch to control the `stdout_stream` and `stdout_stream` 
> output to disk.
> In my environment,due to disk I/O blocking, the `stdout_stream` output is 
> very slow, so it can not be timely cleaning，and this leads the executor 
> process to be blocked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk

2017-11-08 Thread liuxian (JIRA)

liuxian created SPARK-22467:
---

 Summary: Added a switch to support whether `stdout_stream` and 
`stderr_stream` output to disk
 Key: SPARK-22467
 URL: https://issues.apache.org/jira/browse/SPARK-22467
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: liuxian


We should add a switch to control the `stdout_stream` and `stdout_stream` 
output to disk.
In my environment,due to disk I/O blocking, the `stdout_stream` output is very 
slow, so it can not be timely cleaning，and this leads the executor process to 
be blocked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22467) Added a switch to support whether `stdout_stream` and `stderr_stream` output to disk

2017-11-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22467:


Assignee: (was: Apache Spark)

> Added a switch to support whether `stdout_stream` and `stderr_stream` output 
> to disk
> 
>
> Key: SPARK-22467
> URL: https://issues.apache.org/jira/browse/SPARK-22467
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: liuxian
>
> We should add a switch to control the `stdout_stream` and `stdout_stream` 
> output to disk.
> In my environment,due to disk I/O blocking, the `stdout_stream` output is 
> very slow, so it can not be timely cleaning，and this leads the executor 
> process to be blocked.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

89 matches

Mail list logo