from:"\"Davies Liu \\\\\\\(JIRA\\\\\\\)\""

[jira] [Resolved] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-15 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19872.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.1, 2.2.0
>
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition

2017-03-15 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-19872:
--

Assignee: Hyukjin Kwon

> UnicodeDecodeError in Pyspark on sc.textFile read with repartition
> --
>
> Key: SPARK-19872
> URL: https://issues.apache.org/jira/browse/SPARK-19872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
> Environment: Mac and EC2
>Reporter: Brian Bruggeman
>Assignee: Hyukjin Kwon
> Fix For: 2.1.1, 2.2.0
>
>
> I'm receiving the following traceback:
> {code}
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> I created a textfile (text.txt) with standard linux newlines:
> {code}
> a
> b
> d
> e
> f
> g
> h
> i
> j
> k
> l
> {code}
> I think ran pyspark:
> {code}
> $ pyspark
> Python 2.7.13 (default, Dec 18 2016, 07:03:39)
> [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 2.7.13 (default, Dec 18 2016 07:03:39)
> SparkSession available as 'spark'.
> >>> sc.textFile('test.txt').collect()
> [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l']
> >>> sc.textFile('test.txt', use_unicode=False).collect()
> ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l']
> >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect()
> ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', 
> '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.']
> >>> sc.textFile('test.txt').repartition(10).collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", 
> line 140, in _load_from_socket
> for item in serializer.load_stream(rf):
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 539, in load_stream
> yield self.loads(stream)
>   File 
> "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", 
> line 534, in loads
> return s.decode("utf-8") if self.use_unicode else s
>   File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", 
> line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
> invalid start byte
> {code}
> This really looks like a bug in the `serializers.py` code.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch

2017-03-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19561.

   Resolution: Fixed
 Assignee: Jason White
Fix Version/s: 2.2.0
   2.1.1

> Pyspark Dataframes don't allow timestamps near epoch
> 
>
> Key: SPARK-19561
> URL: https://issues.apache.org/jira/browse/SPARK-19561
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Jason White
>Assignee: Jason White
> Fix For: 2.1.1, 2.2.0
>
>
> Pyspark does not allow timestamps at or near the epoch to be created in a 
> DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299
> TimestampType.toInternal converts a datetime object to a number representing 
> microseconds since the epoch. For all times more than 2148 seconds before or 
> after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J 
> automatically serializes it as a long.
> However, for times within this range (~35 minutes before or after the epoch), 
> Py4J serializes it as an int. When creating the object on the Scala side, 
> ints are not recognized and the value goes to null. This leads to null values 
> in non-nullable fields, and corrupted Parquet files.
> The solution is trivial - force TimestampType.toInternal to always return a 
> long.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used

2017-02-17 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19500.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.3
   2.1.1

Issue resolved by pull request 16844
[https://github.com/apache/spark/pull/16844]

> Fail to spill the aggregated hash map when radix sort is used
> -
>
> Key: SPARK-19500
> URL: https://issues.apache.org/jira/browse/SPARK-19500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> Radix sort requires that only half of the array could be occupied. But the 
> aggregated hash map have a off-by-1 bug that could have 1 more item than half 
> of the array, when this happen, the spilling will fail as:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
> in stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage 
> 10.0 (TID 23899, 10.145.253.180, executor 24): 
> java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>  
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
>  
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
> at org.apache.spark.scheduler.Task.run(Task.scala:99) 
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at scala.Option.foreach(Option.scala:257) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137)
>  
> ... 32 more 
> Caused by: java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> org.apache.spar

[jira] [Resolved] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner

2017-02-09 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19481.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16825
[https://github.com/apache/spark/pull/16825]

> Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in 
> ClosureCleaner
> -
>
> Key: SPARK-19481
> URL: https://issues.apache.org/jira/browse/SPARK-19481
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the 
> tests unstable. See:
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite&test_name=should+clone+and+clean+line+object+in+ClosureCleaner



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used

2017-02-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-19500:
--

Assignee: Davies Liu

> Fail to spill the aggregated hash map when radix sort is used
> -
>
> Key: SPARK-19500
> URL: https://issues.apache.org/jira/browse/SPARK-19500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Radix sort requires that only half of the array could be occupied. But the 
> aggregated hash map have a off-by-1 bug that could have 1 more item than half 
> of the array, when this happen, the spilling will fail as:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 
> in stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage 
> 10.0 (TID 23899, 10.145.253.180, executor 24): 
> java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
>  
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
>  
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
> at org.apache.spark.scheduler.Task.run(Task.scala:99) 
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: 
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>  
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>  
> at scala.Option.foreach(Option.scala:257) 
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>  
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>  
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) 
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) 
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137)
>  
> ... 32 more 
> Caused by: java.lang.IllegalStateException: There is no space for new record 
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
>  
> at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
>  
> at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$Generated

[jira] [Created] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used

2017-02-07 Thread Davies Liu (JIRA)

Davies Liu created SPARK-19500:
--

 Summary: Fail to spill the aggregated hash map when radix sort is 
used
 Key: SPARK-19500
 URL: https://issues.apache.org/jira/browse/SPARK-19500
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Davies Liu


Radix sort requires that only half of the array could be occupied. But the 
aggregated hash map have a off-by-1 bug that could have 1 more item than half 
of the array, when this happen, the spilling will fail as:

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in 
stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage 10.0 
(TID 23899, 10.145.253.180, executor 24): java.lang.IllegalStateException: 
There is no space for new record 
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
 
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
 
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250)
 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:99) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: 
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
 
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
 
at scala.Option.foreach(Option.scala:257) 
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
 
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
 
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) 
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137)
 
... 32 more 
Caused by: java.lang.IllegalStateException: There is no space for new record 
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227)
 
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
 
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250)
 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
 Source) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396)
 
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) 
at

[jira] [Resolved] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss

2017-01-31 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19415.

   Resolution: Duplicate
Fix Version/s: 2.2.0

> Improve the implicit type conversion between numeric type and string to avoid 
> precesion loss
> 
>
> Key: SPARK-19415
> URL: https://issues.apache.org/jira/browse/SPARK-19415
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.2.0
>
>
> Currently, Spark SQL will convert both numeric type and string into 
> DoubleType, if the two children of a expression does not match (for example, 
> comparing a LongType again StringType), this will cause precision loss in 
> some cases.
> Some database does better job one this (for example, SQL Server [1]), we 
> should follow that.
> [1] https://msdn.microsoft.com/en-us/library/ms191530.aspx 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss

2017-01-31 Thread Davies Liu (JIRA)

Davies Liu created SPARK-19415:
--

 Summary: Improve the implicit type conversion between numeric type 
and string to avoid precesion loss
 Key: SPARK-19415
 URL: https://issues.apache.org/jira/browse/SPARK-19415
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu


Currently, Spark SQL will convert both numeric type and string into DoubleType, 
if the two children of a expression does not match (for example, comparing a 
LongType again StringType), this will cause precision loss in some cases.

Some database does better job one this (for example, SQL Server [1]), we should 
follow that.

[1] https://msdn.microsoft.com/en-us/library/ms191530.aspx 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2017-01-30 Thread Davies Liu (JIRA)

Title: Message Title
 
 
 
 
 
 
 
 
 
 
  
 
 Davies Liu commented on  SPARK-18105 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
  Re: LZ4 failed to decompress a stream of shuffled data  
 
 
 
 
 
 
 
 
 
 
There is a workaround merged into Spark 2.1 for these type of failures (decompress it and try again), can you try that? 
 
 
 
 
 
 
 
 
 
 
 
 

 
 Add Comment 
 
 
 
 
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 

 This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

[jira] [Reopened] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-14480:


This patch have a regression: A column that have escaped newline can't be 
correctly parsed anymore.

> Remove meaningless StringIteratorReader for CSV data source for better 
> performance
> --
>
> Key: SPARK-14480
> URL: https://issues.apache.org/jira/browse/SPARK-14480
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, CSV data source reads and parses CSV data bytes by bytes (not line 
> by line).
> In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think 
> is made like this for better performance. However, it looks there are two 
> problems.
> Firstly, it was actually not faster than processing line by line with 
> {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}.
> Secondly, this brought a bit of complexity because it needs additional logics 
> to allow every line to be read bytes by bytes. So, it was pretty difficult to 
> figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes 
> in {{CSVParser}} might not be needed.
> I made a rough patch and tested this. The test results for the first problem 
> are below:
> h4. Results
> - Original codes with {{Reader}} wrapping {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 14116265034 | 2008277960 |
> - New codes with {{Iterator}}
> ||End-to-end (ns)||Parse Time (ns)||
> | 13451699644 | 1549050564 |
> In more details,
> h4. Method
> - TCP-H lineitem table is being tested.
> - The results are collected only by 100.
> - End-to-end tests and parsing time tests are performed 10 times and averages 
> are calculated for each.
> h4. Environment
> - Machine: MacBook Pro Retina
> - CPU: 4
> - Memory: 8GB
> h4. Dataset
> - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 
> ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) 
> - Size : 724.66 MB
> h4.  Test Codes
> - Function to measure time
> {code}
> def time[A](f: => A) = {
>   val s = System.nanoTime
>   val ret = f
>   println("time: "+(System.nanoTime-s)/1e6+"ms")
>   ret
> }
> {code}
> - End-to-end test
> {code}
> val path = "lineitem.tbl"
> val df = sqlContext
>   .read
>   .format("csv")
>   .option("header", "false")
>   .option("delimiter", "|")
>   .load(path)
> time(df.take(100))
> {code}
> - Parsing time test for original (in {{BulkCsvParser}})
> {code}
> ...
> // `reader` is a wrapper for an Iterator.
> private val reader = new StringIteratorReader(iter)
> parser.beginParsing(reader)
> ...
> time(parser.parseNext())
> ...
> {code}
> - Parsing time test for new (in {{BulkCsvParser}})
> {code}
> ...
> time(parser.parseLine(iter.next()))
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-19375) na.fill() should not change the data type of column

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-19375.
--
Resolution: Duplicate

> na.fill() should not change the data type of column
> ---
>
> Key: SPARK-19375
> URL: https://issues.apache.org/jira/browse/SPARK-19375
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Davies Liu
>
> When these function are called, a column of numeric type will be converted 
> into DoubleType, which will cause unexpected type casting and also precision 
> loss for LongType.
> {code}
> def fill(value: Double)
> def fill(value: Double, cols: Array[String])
> def fill(value: Double, cols: Seq[String])
> {code}
> Workaround
> {code}
> fill(Map("name" -> v))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19375) na.fill() should not change the data type of column

2017-01-26 Thread Davies Liu (JIRA)

Davies Liu created SPARK-19375:
--

 Summary: na.fill() should not change the data type of column
 Key: SPARK-19375
 URL: https://issues.apache.org/jira/browse/SPARK-19375
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2, 1.6.3, 1.5.2, 1.4.1, 1.3.1, 2.2.0
Reporter: Davies Liu


When these function are called, a column of numeric type will be converted into 
DoubleType, which will cause unexpected type casting and also precision loss 
for LongType.

{code}
def fill(value: Double)
def fill(value: Double, cols: Array[String])
def fill(value: Double, cols: Seq[String])
{code}

Workaround
{code}
fill(Map("name" -> v))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19370) Flaky test: MetadataCacheSuite

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-19370:
---
Affects Version/s: 2.1.0

> Flaky test: MetadataCacheSuite
> --
>
> Key: SPARK-19370
> URL: https://issues.apache.org/jira/browse/SPARK-19370
> Project: Spark
>  Issue Type: Test
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull
> {code}
> MetadataCacheSuite:
> - SPARK-16336 Suggest doing table refresh when encountering 
> FileNotFoundException
> Exception encountered when invoking run on a nested suite - There are 1 
> possibly leaked file streams. *** ABORTED ***
>   java.lang.RuntimeException: There are 1 possibly leaked file streams.
>   at 
> org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
>   at 
> org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19370) Flaky test: MetadataCacheSuite

2017-01-26 Thread Davies Liu (JIRA)

Davies Liu created SPARK-19370:
--

 Summary: Flaky test: MetadataCacheSuite
 Key: SPARK-19370
 URL: https://issues.apache.org/jira/browse/SPARK-19370
 Project: Spark
  Issue Type: Test
Reporter: Davies Liu
Assignee: Shixiong Zhu


{code}
MetadataCacheSuite:
- SPARK-16336 Suggest doing table refresh when encountering 
FileNotFoundException
Exception encountered when invoking run on a nested suite - There are 1 
possibly leaked file streams. *** ABORTED ***
  java.lang.RuntimeException: There are 1 possibly leaked file streams.
  at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
  at 
org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
  at 
org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  ...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19370) Flaky test: MetadataCacheSuite

2017-01-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-19370:
---
Description: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull
{code}
MetadataCacheSuite:
- SPARK-16336 Suggest doing table refresh when encountering 
FileNotFoundException
Exception encountered when invoking run on a nested suite - There are 1 
possibly leaked file streams. *** ABORTED ***
  java.lang.RuntimeException: There are 1 possibly leaked file streams.
  at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
  at 
org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
  at 
org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  ...
{code}

  was:
{code}
MetadataCacheSuite:
- SPARK-16336 Suggest doing table refresh when encountering 
FileNotFoundException
Exception encountered when invoking run on a nested suite - There are 1 
possibly leaked file streams. *** ABORTED ***
  java.lang.RuntimeException: There are 1 possibly leaked file streams.
  at 
org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
  at 
org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
  at 
org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
  at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
  at 
org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
  ...
{code}


> Flaky test: MetadataCacheSuite
> --
>
> Key: SPARK-19370
> URL: https://issues.apache.org/jira/browse/SPARK-19370
> Project: Spark
>  Issue Type: Test
>Affects Versions: 2.1.0
>Reporter: Davies Liu
>Assignee: Shixiong Zhu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull
> {code}
> MetadataCacheSuite:
> - SPARK-16336 Suggest doing table refresh when encountering 
> FileNotFoundException
> Exception encountered when invoking run on a nested suite - There are 1 
> possibly leaked file streams. *** ABORTED ***
>   java.lang.RuntimeException: There are 1 possibly leaked file streams.
>   at 
> org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47)
>   at 
> org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264)
>   at 
> org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
>   ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch

2017-01-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17912.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 15467
[https://github.com/apache/spark/pull/15467]

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
> Fix For: 3.0.0
>
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch

2017-01-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17912:
---
Fix Version/s: (was: 3.0.0)
   2.2.0

> Refactor code generation to get data for ColumnVector/ColumnarBatch
> ---
>
> Key: SPARK-17912
> URL: https://issues.apache.org/jira/browse/SPARK-17912
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is 
> becoming pervasive. The code generation part can be reused by multiple 
> components (e.g. parquet reader, data cache, and so on).
> This JIRA refactors the code generation part as a trait for ease of reuse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-01-19 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830644#comment-15830644
 ] 

Davies Liu commented on SPARK-17602:


The Python workers are reused by default, could you re-run the benchmark while 
re-use the workers?

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19019) PySpark does not work with Python 3.6.0

2017-01-17 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19019.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16429
[https://github.com/apache/spark/pull/16429]

> PySpark does not work with Python 3.6.0
> ---
>
> Key: SPARK-19019
> URL: https://issues.apache.org/jira/browse/SPARK-19019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.1.1, 2.2.0
>
>
> Currently, PySpark does not work with Python 3.6.0.
> Running {{./bin/pyspark}} simply throws the error as below:
> {code}
> Traceback (most recent call last):
>   File ".../spark/python/pyspark/shell.py", line 30, in 
> import pyspark
>   File ".../spark/python/pyspark/__init__.py", line 46, in 
> from pyspark.context import SparkContext
>   File ".../spark/python/pyspark/context.py", line 36, in 
> from pyspark.java_gateway import launch_gateway
>   File ".../spark/python/pyspark/java_gateway.py", line 31, in 
> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
>   File "", line 961, in _find_and_load
>   File "", line 950, in _find_and_load_unlocked
>   File "", line 646, in _load_unlocked
>   File "", line 616, in _load_backward_compatible
>   File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 18, in 
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py",
>  line 62, in 
> import pkgutil
>   File 
> "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py",
>  line 22, in 
> ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
>   File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
> cls = _old_namedtuple(*args, **kwargs)
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> The problem is in 
> https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394
>  as the error says and the cause seems because the arguments of 
> {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 
> (See https://bugs.python.org/issue25628).
> We currently copy this function via {{types.FunctionType}} which does not set 
> the default values of keyword-only arguments (meaning 
> {{namedtuple.__kwdefaults__}}) and this seems causing internally missing 
> values in the function (non-bound arguments).
> This ends up as below:
> {code}
> import types
> import collections
> def _copy_func(f):
> return types.FunctionType(f.__code__, f.__globals__, f.__name__,
> f.__defaults__, f.__closure__)
> _old_namedtuple = _copy_func(collections.namedtuple)
> _old_namedtuple(, "b")
> _old_namedtuple("a")
> {code}
> If we call as below:
> {code}
> >>> _old_namedtuple("a", "b")
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 
> 'rename', and 'module'
> {code}
> It throws an exception as above becuase {{__kwdefaults__}} for required 
> keyword arguments seem unset in the copied function. So, if we give explicit 
> value for these,
> {code}
> >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None)
> 
> {code}
> It works fine.
> It seems now we should properly set these into the hijected one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts

2017-01-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19180.

   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1

Issue resolved by pull request 16555
[https://github.com/apache/spark/pull/16555]

> the offset of short is 4 in OffHeapColumnVector's putShorts
> ---
>
> Key: SPARK-19180
> URL: https://issues.apache.org/jira/browse/SPARK-19180
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: yucai
> Fix For: 2.1.1, 2.0.3, 2.2.0
>
>
> the offset of short is 4 in OffHeapColumnVector's putShorts, actually it 
> should be 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18589:
---
Priority: Critical  (was: Minor)

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Critical
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.sc

[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"

2017-01-13 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-18589:
--

Assignee: Davies Liu

> persist() resolves "java.lang.RuntimeException: Invalid PythonUDF 
> (...), requires attributes from more than one child"
> --
>
> Key: SPARK-18589
> URL: https://issues.apache.org/jira/browse/SPARK-18589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.2, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>Priority: Minor
>
> Smells like another optimizer bug that's similar to SPARK-17100 and 
> SPARK-18254. I'm seeing this on 2.0.2 and on master at commit 
> {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}.
> I don't have a minimal repro for this yet, but the error I'm seeing is:
> {code}
> py4j.protocol.Py4JJavaError: An error occurred while calling o247.count.
> : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires 
> attributes from more than one child.
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149)
> at scala.collection.immutable.Stream.foreach(Stream.scala:594)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113)
> at 
> org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93)
> at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
> at scala.collection.immutable.List.foldLeft(List.scala:84)
> at 
> org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:93)

[jira] [Resolved] (SPARK-18281) toLocalIterator yields time out error on pyspark2

2016-12-20 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-18281.

   Resolution: Fixed
Fix Version/s: 2.0.3
   2.1.1

Issue resolved by pull request 16263
[https://github.com/apache/spark/pull/16263]

> toLocalIterator yields time out error on pyspark2
> -
>
> Key: SPARK-18281
> URL: https://issues.apache.org/jira/browse/SPARK-18281
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.1
> Environment: Ubuntu 14.04.5 LTS
> Driver: AWS M4.XLARGE
> Slaves: AWS M4.4.XLARGE
> mesos 1.0.1
> spark 2.0.1
> pyspark
>Reporter: Luke Miner
> Fix For: 2.1.1, 2.0.3
>
>
> I run the example straight out of the api docs for toLocalIterator and it 
> gives a time out exception:
> {code}
> from pyspark import SparkContext
> sc = SparkContext()
> rdd = sc.parallelize(range(10))
> [x for x in rdd.toLocalIterator()]
> {code}
> conf file:
> spark.driver.maxResultSize 6G
> spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G 
> -XX:+HeapDumpOnOutOfMemoryError
> spark.executor.memory   16G
> spark.executor.uri  foo/spark-2.0.1-bin-hadoop2.7.tgz
> spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
> spark.hadoop.fs.s3a.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3n.buffer.dir  /raid0/spark
> spark.hadoop.fs.s3a.connection.timeout 50
> spark.hadoop.fs.s3n.multipart.uploads.enabled   true
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
> spark.hadoop.parquet.block.size 2147483648
> spark.hadoop.parquet.enable.summary-metadatafalse
> spark.jars.packages 
> com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34
> spark.local.dir /raid0/spark
> spark.mesos.coarse  false
> spark.mesos.constraints  priority:1
> spark.network.timeout   600
> spark.rpc.message.maxSize500
> spark.speculation   false
> spark.sql.parquet.mergeSchema   false
> spark.sql.planner.externalSort  true
> spark.submit.deployMode client
> spark.task.cpus 1
> Exception here:
> {code}
> ---
> timeout   Traceback (most recent call last)
>  in ()
>   2 sc = SparkContext()
>   3 rdd = sc.parallelize(range(10))
> > 4 [x for x in rdd.toLocalIterator()]
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in 
> _load_from_socket(port, serializer)
> 140 try:
> 141 rf = sock.makefile("rb", 65536)
> --> 142 for item in serializer.load_stream(rf):
> 143 yield item
> 144 finally:
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> load_stream(self, stream)
> 137 while True:
> 138 try:
> --> 139 yield self._read_with_length(stream)
> 140 except EOFError:
> 141 return
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> _read_with_length(self, stream)
> 154 
> 155 def _read_with_length(self, stream):
> --> 156 length = read_int(stream)
> 157 if length == SpecialLengths.END_OF_DATA_SECTION:
> 158 raise EOFError
> /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in 
> read_int(stream)
> 541 
> 542 def read_int(stream):
> --> 543 length = stream.read(4)
> 544 if not length:
> 545 raise EOFError
> /usr/lib/python2.7/socket.pyc in read(self, size)
> 378 # fragmentation issues on many platforms.
> 379 try:
> --> 380 data = self._sock.recv(left)
> 381 except error, e:
> 382 if e.args[0] == EINTR:
> timeout: timed out
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-13 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746370#comment-15746370
 ] 

Davies Liu edited comment on SPARK-18676 at 12/13/16 9:47 PM:
--

I had a working prototype, but it introduce some weird behavior, for example, 
the actual plan will not match the one showed in explain or web ui.

Currently, I'm still not sure that the right direction or not.


was (Author: davies):
I had a working prototype, but in introduce some weird behavior, for example, 
the actual plan will not match the one showed in explain or web ui.

Currently, I'm still not sure that the right direction or not.

> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-13 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746370#comment-15746370
 ] 

Davies Liu commented on SPARK-18676:


I had a working prototype, but in introduce some weird behavior, for example, 
the actual plan will not match the one showed in explain or web ui.

Currently, I'm still not sure that the right direction or not.

> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-08 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15733569#comment-15733569
 ] 

Davies Liu commented on SPARK-18676:


Yes, it can, see WholeStageCodegen.doExecute() as an example.

Phisical  plan is used by generate RDD (which is the actual physical plan), we 
build another SparkPlan to generate the RDD in the middle.

> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in {{UnaryNode}} 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-12-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16589:
---
Fix Version/s: (was: 2.1.0)
   2.1.1

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Andrew Ray
>  Labels: correctness
> Fix For: 2.0.3, 2.1.1
>
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-12-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16589.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.3

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Andrew Ray
>  Labels: correctness
> Fix For: 2.0.3, 2.1.0
>
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16589) Chained cartesian produces incorrect number of records

2016-12-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16589:
---
Assignee: Andrew Ray

> Chained cartesian produces incorrect number of records
> --
>
> Key: SPARK-16589
> URL: https://issues.apache.org/jira/browse/SPARK-16589
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Andrew Ray
>  Labels: correctness
>
> Chaining cartesian calls in PySpark results in the number of records lower 
> than expected. It can be reproduced as follows:
> {code}
> rdd = sc.parallelize(range(10), 1)
> rdd.cartesian(rdd).cartesian(rdd).count()
> ## 355
> rdd.cartesian(rdd).cartesian(rdd).distinct().count()
> ## 251
> {code}
> It looks like it is related to serialization. If we reserialize after initial 
> cartesian:
> {code}
> rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), 
> 1)).cartesian(rdd).count()
> ## 1000
> {code}
> or insert identity map:
> {code}
> rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count()
> ## 1000
> {code}
> it yields correct results.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x

2016-12-06 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726247#comment-15726247
 ] 

Davies Liu commented on SPARK-18676:


What's the schema and plan of the child looks like? It's possible that the 
schema of parquet file is wide, also highly compressed, only a few column are 
used in the query, then the estimation will be much smaller than actual data 
size. The estimation of string could also be wrong.

Using the bytes of parquet file as the metric for broadcasting is bad, we also 
saw some cases that the parquet file is only a few MB, but the broadcast is a 
few GB.

The estimation could easily be wrong for many reasons, maybe we could switch to 
ShuffleJoin when it realize that the actual data is larger than thought, will 
that work?

> Spark 2.x query plan data size estimation can crash join queries versus 1.x
> ---
>
> Key: SPARK-18676
> URL: https://issues.apache.org/jira/browse/SPARK-18676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Michael Allman
>
> Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly 
> modified the way Spark SQL estimates the output data size of query plans. 
> I've found that—with the new table query partition pruning support in 
> 2.1—this has lead to in some cases underestimation of join plan child size 
> statistics to a degree that makes executing such queries impossible without 
> disabling automatic broadcast conversion.
> In one case we debugged, the query planner had estimated the size of a join 
> child to be 3,854 bytes. In the execution of this child query, Spark reads 20 
> million rows in 1 GB of data from parquet files and shuffles 722.9 MB of 
> data, outputting 17 million rows. In planning the original join query, Spark 
> converts the child to a {{BroadcastExchange}}. This query execution fails 
> unless automatic broadcast conversion is disabled.
> This particular query is complex and very specific to our data and schema. I 
> have not yet developed a reproducible test case that can be shared. I realize 
> this ticket does not give the Spark team a lot to work with to reproduce and 
> test this issue, but I'm available to help. At the moment I can suggest 
> running a join where one side is an aggregation selecting a few fields over a 
> large table with a wide schema including many string columns.
> This issue exists in Spark 2.0, but we never encountered it because in that 
> version it only manifests itself for partitioned relations read from the 
> filesystem, and we rarely use this feature. We've encountered this issue in 
> 2.1 because 2.1 does partition pruning for metastore tables now.
> As a back stop, we've patched our branch of Spark 2.1 to revert the 
> reductions in default data type size for string, binary and user-defined 
> types. We also removed the override of the statistics method in `LogicalPlan` 
> which reduces the output size of a plan based on the ratio of that plan's 
> output schema size versus its children's. We have not had this problem since.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18719) Document spark.ui.showConsoleProgress

2016-12-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18719:
---
Assignee: Nicholas

> Document spark.ui.showConsoleProgress
> -
>
> Key: SPARK-18719
> URL: https://issues.apache.org/jira/browse/SPARK-18719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Assignee: Nicholas
>Priority: Minor
> Fix For: 2.2.0
>
>
> There is currently no documentation for {{spark.ui.showConsoleProgress}}. The 
> only way to find out about it is via Stack Overflow or by searching through 
> the code.
> We should add documentation for this setting to [the config table on our 
> Configuration 
> docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18719) Document spark.ui.showConsoleProgress

2016-12-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-18719.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16151
[https://github.com/apache/spark/pull/16151]

> Document spark.ui.showConsoleProgress
> -
>
> Key: SPARK-18719
> URL: https://issues.apache.org/jira/browse/SPARK-18719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Priority: Minor
> Fix For: 2.2.0
>
>
> There is currently no documentation for {{spark.ui.showConsoleProgress}}. The 
> only way to find out about it is via Stack Overflow or by searching through 
> the code.
> We should add documentation for this setting to [the config table on our 
> Configuration 
> docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18719) Document spark.ui.showConsoleProgress

2016-12-05 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18719:
---
Assignee: Nicholas Chammas  (was: Nicholas)

> Document spark.ui.showConsoleProgress
> -
>
> Key: SPARK-18719
> URL: https://issues.apache.org/jira/browse/SPARK-18719
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 2.2.0
>
>
> There is currently no documentation for {{spark.ui.showConsoleProgress}}. The 
> only way to find out about it is via Stack Overflow or by searching through 
> the code.
> We should add documentation for this setting to [the config table on our 
> Configuration 
> docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18188) Add checksum for block of broadcast

2016-11-18 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18188:
---
Description: 
There is an understanding issue for a long time: 
https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the 
blocks, it's very hard for us to identify where is the bug came from.

Shuffle blocks are compressed separate (have checksum in it), but broadcast 
blocks are compressed together, we should add checksum for each of separately.

  was:
There is an understanding issue for a long time: 
https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the 
blocks, it's very hard for us to identify where is the bug came from.

We should have a way the check a block from remote node or disk that is correct 
or not.


> Add checksum for block of broadcast
> ---
>
> Key: SPARK-18188
> URL: https://issues.apache.org/jira/browse/SPARK-18188
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There is an understanding issue for a long time: 
> https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for 
> the blocks, it's very hard for us to identify where is the bug came from.
> Shuffle blocks are compressed separate (have checksum in it), but broadcast 
> blocks are compressed together, we should add checksum for each of separately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18188) Add checksum for block of broadcast

2016-11-18 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18188:
---
Summary: Add checksum for block of broadcast  (was: Add checksum for block 
in Spark)

> Add checksum for block of broadcast
> ---
>
> Key: SPARK-18188
> URL: https://issues.apache.org/jira/browse/SPARK-18188
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There is an understanding issue for a long time: 
> https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for 
> the blocks, it's very hard for us to identify where is the bug came from.
> We should have a way the check a block from remote node or disk that is 
> correct or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-11-17 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-4105:
-

Assignee: Davies Liu  (was: Josh Rosen)

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Davies Liu
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here's another occurrence of a similar error:
> {co

[jira] [Commented] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-11-10 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654590#comment-15654590
 ] 

Davies Liu commented on SPARK-18097:


I have no idea why the schema is corrupt, we could catch the exception and try 
to drop the table blindly.

> Can't drop a table from Hive if the schema is corrupt
> -
>
> Key: SPARK-18097
> URL: https://issues.apache.org/jira/browse/SPARK-18097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>
> When the schema of Hive table is broken, we can't drop the table using Spark 
> SQL, for example
> {code}
> Error in SQL statement: QueryExecutionException: FAILED: 
> IllegalArgumentException Error: > expected at the position 10 of 
> 'ss:string:struct<>' but ':' is found.
> at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:336)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:480)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
>   at 
> org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:104)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
>   at 
> org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
>   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:353)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:351)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269)
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:351)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:228)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:227)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:255)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:126)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:267)
>   at 
> org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:753)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationS

[jira] [Resolved] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-18254.

Resolution: Fixed
  Assignee: Eyal Farago  (was: Davies Liu)

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Eyal Farago
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634416#comment-15634416
 ] 

Davies Liu commented on SPARK-18254:


Could you also try 2.0.2?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634333#comment-15634333
 ] 

Davies Liu commented on SPARK-18254:


I tried the following in master (2.1), it works

{code}
from pyspark.sql.functions import udf, col, struct
myadd = udf(lambda s: s.a + s.b, IntegerType())
df = self.spark.range(10).selectExpr("id as a", "id as b")\
.select(struct(col("a"), col("b")).alias('s'))
df = df.select(df.s, myadd(df.s).alias("a"))
df.explain(True)
rs = df.collect()
{code}

[~nchammas] Could you also try yours on master?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634228#comment-15634228
 ] 

Davies Liu commented on SPARK-18254:


I doubt it's a bug in ExtractPythonUDFs, not operator push down, not verified 
yet.

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-18254:
--

Assignee: Davies Liu

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18233) Failed to deserialize the task

2016-11-02 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18233:
--

 Summary: Failed to deserialize the task
 Key: SPARK-18233
 URL: https://issues.apache.org/jira/browse/SPARK-18233
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}

16/11/02 18:36:32 ERROR Executor: Exception in task 652.0 in stage 27.0 (TID 
21101)
java.io.InvalidClassException: org.apache.spark.executor.TaskMet; serializable 
and externalizable flags conflict
at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:698)
at 
java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:831)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1602)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4549) Support BigInt -> Decimal in convertToCatalyst in SparkSQL

2016-11-02 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-4549.
-
Resolution: Incomplete

> Support BigInt -> Decimal in convertToCatalyst in SparkSQL
> --
>
> Key: SPARK-4549
> URL: https://issues.apache.org/jira/browse/SPARK-4549
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Jianshi Huang
>Priority: Minor
>
> Since BigDecimal is just a wrapper around BigInt, let's also convert to 
> BigInt to Decimal.
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-01 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626935#comment-15626935
 ] 

Davies Liu commented on SPARK-18212:


cc [~zsxwing]

> Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from 
> specific offsets
> ---
>
> Key: SPARK-18212
> URL: https://issues.apache.org/jira/browse/SPARK-18212
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Davies Liu
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-01 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18212:
--

 Summary: Flaky test: 
org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets
 Key: SPARK-18212
 URL: https://issues.apache.org/jira/browse/SPARK-18212
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Davies Liu


https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18188) Add checksum for block in Spark

2016-10-31 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18188:
--

 Summary: Add checksum for block in Spark
 Key: SPARK-18188
 URL: https://issues.apache.org/jira/browse/SPARK-18188
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu
Assignee: Davies Liu


There is an understanding issue for a long time: 
https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the 
blocks, it's very hard for us to identify where is the bug came from.

We should have a way the check a block from remote node or disk that is correct 
or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-27 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612686#comment-15612686
 ] 

Davies Liu commented on SPARK-18105:


It turned out that the bug in LZ4 is a false alarm, so close the upstream issue.

Can't reproduce the behavior now.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-27 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18105:
---
Priority: Major  (was: Blocker)

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16078) from_utc_timestamp/to_utc_timestamp may give different result in different timezone

2016-10-27 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16078:
---
Fix Version/s: 1.6.3

> from_utc_timestamp/to_utc_timestamp may give different result in different 
> timezone
> ---
>
> Key: SPARK-16078
> URL: https://issues.apache.org/jira/browse/SPARK-16078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.3, 2.0.0
>
>
> from_utc_timestamp/to_utc_timestamp should return determistic result in any 
> timezone (system default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18105:
---
Description: 
When lz4 is used to compress the shuffle files, it may fail to decompress it as 
"stream is corrupt"

{code}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
at 
org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
at 
org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
at java.io.DataInputStream.read(DataInputStream.java:149)
at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

https://github.com/jpountz/lz4-java/issues/89

  was:
When lz4 is used to compress the shuffle files, it may fail to decompress it as 
"stream is corrupt"

https://github.com/jpountz/lz4-java/issues/89


> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apac

[jira] [Commented] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-26 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609078#comment-15609078
 ] 

Davies Liu commented on SPARK-18100:


[~viirya] Jackson does not support it either

> Improve the performance of get_json_object using Gson
> -
>
> Key: SPARK-18100
> URL: https://issues.apache.org/jira/browse/SPARK-18100
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> Based on some benchmark here: 
> http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
>  which said Gson could be much faster than Jackson, maybe it could be used to 
> improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-26 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18105:
---
Priority: Blocker  (was: Major)

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2016-10-25 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18105:
--

 Summary: LZ4 failed to decompress a stream of shuffled data
 Key: SPARK-18105
 URL: https://issues.apache.org/jira/browse/SPARK-18105
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu
Assignee: Davies Liu


When lz4 is used to compress the shuffle files, it may fail to decompress it as 
"stream is corrupt"

https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18102) Failed to deserialize the result of task

2016-10-25 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18102:
--

 Summary: Failed to deserialize the result of task
 Key: SPARK-18102
 URL: https://issues.apache.org/jira/browse/SPARK-18102
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


{code}
16/10/25 15:17:04 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() for one-way message.
java.lang.ClassNotFoundException: org.apache.spark.util*SerializableBuffer not 
found in 
com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@3d98d138
at 
com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:259)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:308)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:258)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:257)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:578)
at 
org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570)
at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel

[jira] [Updated] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-25 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18100:
---
Issue Type: Improvement  (was: Bug)

> Improve the performance of get_json_object using Gson
> -
>
> Key: SPARK-18100
> URL: https://issues.apache.org/jira/browse/SPARK-18100
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> Based on some benchmark here: 
> http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
>  which said Gson could be much faster than Jackson, maybe it could be used to 
> improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18100) Improve the performance of get_json_object using Gson

2016-10-25 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18100:
--

 Summary: Improve the performance of get_json_object using Gson
 Key: SPARK-18100
 URL: https://issues.apache.org/jira/browse/SPARK-18100
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


Based on some benchmark here: 
http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/,
 which said Gson could be much faster than Jackson, maybe it could be used to 
improve the performance of get_json_object



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-10-25 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18097:
---
Description: 
When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 10 of 
'ss:string:struct<>' but ':' is found.
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:336)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:480)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754)
at 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:104)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:353)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:351)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:351)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:227)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:255)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:126)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:267)
at 
org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:753)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.(Dataset.scala:186)
at org.apache.spark.sql.Dataset.(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at org.apache.spark.sql.SparkSession.

[jira] [Updated] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-10-25 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18097:
---
Description: 
When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 10 of 
'ss:string:struct<>' but ':' is found.
{code}

  was:
When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 4443 of 
'struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_EDDIRECT:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_THIRDPARTY:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DOMINION:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,EBIZAUTOS:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,VAST_HOSTED:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CO:string:struct'
 but ':' is found.
{code}


> Can't drop a table from Hive if the schema is corrupt
> -
>
> Key: SPARK-18097
> URL: https://issues.apache.org/jira/browse/SPARK-18097
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>
> When the schema of Hive table is broken, we can't drop the table using Spark 
> SQL, for example
> {code}
> Error in SQL statement: QueryExecutionException: FAILED: 
> IllegalArgumentException Error: > expected at the position 10 of 
> 'ss:string:struct<>' but ':' is found.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt

2016-10-25 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18097:
--

 Summary: Can't drop a table from Hive if the schema is corrupt
 Key: SPARK-18097
 URL: https://issues.apache.org/jira/browse/SPARK-18097
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Davies Liu


When the schema of Hive table is broken, we can't drop the table using Spark 
SQL, for example
{code}
Error in SQL statement: QueryExecutionException: FAILED: 
IllegalArgumentException Error: > expected at the position 4443 of 
'struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_EDDIRECT:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_THIRDPARTY:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DOMINION:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,EBIZAUTOS:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,VAST_HOSTED:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CO:string:struct'
 but ':' is found.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-10-21 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18055:
---
Description: 
Try to apply flatMap() on Dataset column which of of type
com.A.B

Here's a schema of a dataset:
{code}
root
 |-- id: string (nullable = true)
 |-- outputs: array (nullable = true)
 ||-- element: string
{code}

flatMap works on RDD

{code}
 ds.rdd.flatMap(_.outputs)
{code}

flatMap doesnt work on dataset and gives the following error
{code}
ds.flatMap(_.outputs)
{code}

The exception:
{code}
scala.ScalaReflectionException: class 
com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at 
line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
{code}

Spoke to Michael Armbrust and he confirmed it as a Dataset bug.

There is a workaround using explode()
{code}
ds.select(explode(col("outputs")))
{code}

  was:
Try to apply flatMap() on Dataset column which of of type
com.A.B

Here's a schema of a dataset:

root
 |-- id: string (nullable = true)
 |-- outputs: array (nullable = true)
 ||-- element: string

flatMap works on RDD

{code}
 ds.rdd.flatMap(_.outputs)
{code}

flatMap doesnt work on dataset and gives the following error
{code}
ds.flatMap(_.outputs)
{code}

The exception:
{code}
scala.ScalaReflectionException: class 
com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at 
line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
{code}

Spoke to Michael Armbrust and he confirmed it as a Dataset bug.

There is a workaround using explode()
{code}
ds.select(explode(col("outputs")))
{code}


> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class 
> com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.e

[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-10-21 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18055:
---
Description: 
Try to apply flatMap() on Dataset column which of of type
com.A.B

Here's a schema of a dataset:
{code}
root
 |-- id: string (nullable = true)
 |-- outputs: array (nullable = true)
 ||-- element: string
{code}

flatMap works on RDD

{code}
 ds.rdd.flatMap(_.outputs)
{code}

flatMap doesnt work on dataset and gives the following error
{code}
ds.flatMap(_.outputs)
{code}

The exception:
{code}
scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at 
line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
{code}

Spoke to Michael Armbrust and he confirmed it as a Dataset bug.

There is a workaround using explode()
{code}
ds.select(explode(col("outputs")))
{code}

  was:
Try to apply flatMap() on Dataset column which of of type
com.A.B

Here's a schema of a dataset:
{code}
root
 |-- id: string (nullable = true)
 |-- outputs: array (nullable = true)
 ||-- element: string
{code}

flatMap works on RDD

{code}
 ds.rdd.flatMap(_.outputs)
{code}

flatMap doesnt work on dataset and gives the following error
{code}
ds.flatMap(_.outputs)
{code}

The exception:
{code}
scala.ScalaReflectionException: class 
com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at 
line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
{code}

Spoke to Michael Armbrust and he confirmed it as a Dataset bug.

There is a workaround using explode()
{code}
ds.select(explode(col("outputs")))
{code}


> Dataset.flatMap can't work with types from customized jar
> -
>
> Key: SPARK-18055
> URL: https://issues.apache.org/jira/browse/SPARK-18055
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Davies Liu
>
> Try to apply flatMap() on Dataset column which of of type
> com.A.B
> Here's a schema of a dataset:
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- outputs: array (nullable = true)
>  ||-- element: string
> {code}
> flatMap works on RDD
> {code}
>  ds.rdd.flatMap(_.outputs)
> {code}
> flatMap doesnt work on dataset and gives the following error
> {code}
> ds.flatMap(_.outputs)
> {code}
> The exception:
> {code}
> scala.ScalaReflectionException: class com.A.B in JavaMirror … not found
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
> at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
> at 
> line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
> at 
> scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
> at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
> at

[jira] [Created] (SPARK-18055) Dataset.flatMap can't work with types from customized jar

2016-10-21 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18055:
--

 Summary: Dataset.flatMap can't work with types from customized jar
 Key: SPARK-18055
 URL: https://issues.apache.org/jira/browse/SPARK-18055
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Davies Liu


Try to apply flatMap() on Dataset column which of of type
com.A.B

Here's a schema of a dataset:

root
 |-- id: string (nullable = true)
 |-- outputs: array (nullable = true)
 ||-- element: string

flatMap works on RDD

{code}
 ds.rdd.flatMap(_.outputs)
{code}

flatMap doesnt work on dataset and gives the following error
{code}
ds.flatMap(_.outputs)
{code}

The exception:
{code}
scala.ScalaReflectionException: class 
com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at 
line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at 
org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125)
{code}

Spoke to Michael Armbrust and he confirmed it as a Dataset bug.

There is a workaround using explode()
{code}
ds.select(explode(col("outputs")))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-21 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-18053:
---
Assignee: Wenchen Fan

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18037) Event listener should be aware of multiple tries of same stage

2016-10-20 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18037:
--

 Summary: Event listener should be aware of multiple tries of same 
stage
 Key: SPARK-18037
 URL: https://issues.apache.org/jira/browse/SPARK-18037
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu


A stage could be resubmitted before all the task from previous submit had 
finished, then event listen will mess them up, cause confusing number of active 
tasks (become negative).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-20 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592514#comment-15592514
 ] 

Davies Liu commented on SPARK-10915:


[~jason.white] When a aggregate function is applied, the order of input rows is 
not defined (even you have a order by before the aggregate). In case that the 
order matters, you will have to use collect_list and UDF.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18032) Spark test failed as OOM in jenkins

2016-10-20 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18032:
--

 Summary: Spark test failed as OOM in jenkins
 Key: SPARK-18032
 URL: https://issues.apache.org/jira/browse/SPARK-18032
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Davies Liu
Assignee: Josh Rosen


I saw some tests failed as OOM recently, for example, 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/1998/console#l10n-footer

Maybe we should increase the heapsize, since we are continue to add more stuff 
into Spark/tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality

2016-10-20 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18031:
--

 Summary: Flaky test: 
org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic 
functionality
 Key: SPARK-18031
 URL: https://issues.apache.org/jira/browse/SPARK-18031
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite&test_name=basic+functionality



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite

2016-10-20 Thread Davies Liu (JIRA)

Davies Liu created SPARK-18030:
--

 Summary: Flaky test: 
org.apache.spark.sql.streaming.FileStreamSourceSuite 
 Key: SPARK-18030
 URL: https://issues.apache.org/jira/browse/SPARK-18030
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Davies Liu
Assignee: Tathagata Das


https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17388) Support for inferring type date/timestamp/decimal for partition column

2016-10-18 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17388.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14947
[https://github.com/apache/spark/pull/14947]

> Support for inferring type date/timestamp/decimal for partition column
> --
>
> Key: SPARK-17388
> URL: https://issues.apache.org/jira/browse/SPARK-17388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
> Fix For: 2.1.0
>
>
> Currently, Spark only supports to infer {{IntegerType}}, {{LongType}}, 
> {{DoubleType}} and {{StringType}}.
> {{DecimalType}} is being tried but it seems it never infers type as 
> {{DecimalType}} as {{DoubleType}} is being tried first.
> Also, {{DateType}} and {{TimestampType}} can be inferred. It seems it is a 
> pretty common to use both for a partition column.
> It'd be great if they can be inferred as both rather than just {{StringType}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-17 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583638#comment-15583638
 ] 

Davies Liu commented on SPARK-10915:


Currently all the aggregate functions are implemented in Scala, which execute 
one row at a time. This will not work for Python UDAF, the overhead between JVM 
and Python process will make it super slow.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2016-10-17 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582918#comment-15582918
 ] 

Davies Liu commented on SPARK-10915:


Python UDF is executed in batch mode to have reasonable performance. UDAF could 
be much harder to implement in batch mode, especially when it's used together 
with other aggregate functions.

One possible solution could be apply a Python UDF after CollectList, you 
already could do this as a workaround today.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-12 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17845:
---
Fix Version/s: 2.1.0

> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-12 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17845.

Resolution: Fixed

> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-10-11 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566858#comment-15566858
 ] 

Davies Liu commented on SPARK-15621:


[~rezasafi] We usually do not backport this kind of improvements, it's too 
large and risky for maintain release (2.0.X), sorry for that.

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17856) JVM Crash during tests: pyspark.mllib.linalg.distributed

2016-10-10 Thread Davies Liu (JIRA)

Davies Liu created SPARK-17856:
--

 Summary: JVM Crash during tests: pyspark.mllib.linalg.distributed
 Key: SPARK-17856
 URL: https://issues.apache.org/jira/browse/SPARK-17856
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu


https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66531/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17806) Incorrect result when work with data from parquet

2016-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17806.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Issue resolved by pull request 15390
[https://github.com/apache/spark/pull/15390]

> Incorrect result when work with data from parquet
> -
>
> Key: SPARK-17806
> URL: https://issues.apache.org/jira/browse/SPARK-17806
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>Assignee: Davies Liu
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.0.2, 2.1.0
>
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.{StructField, StructType}
>   import org.apache.spark.sql.types.DataTypes._
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq(
> """{"a":1,"b":1,"c":1}""",
> """{"a":1,"b":1,"c":2}"""
>   ))
>   sc.read.schema(StructType(Seq(
> StructField("a", IntegerType),
> StructField("b", IntegerType),
> StructField("c", LongType)
>   ))).json(jsonRDD).write.parquet("/tmp/test")
>   val df = sc.read.load("/tmp/test")
>   df.join(df, Seq("a", "b", "c"), "left_outer").show()
> {code}
> returns:
> {code}
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  1|  1|
> |  1|  1|  1|
> |  1|  1|  2|
> |  1|  1|  2|
> +---+---+---+
> {code}
> Expected result:
> {code}
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  1|  1|
> |  1|  1|  2|
> +---+---+---+
> {code}
> If I use this code without saving to parquet it works fine. If you change 
> type of `c` column to `IntegerType` it also works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-10-07 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1759#comment-1759
 ] 

Davies Liu commented on SPARK-17738:


I will looking into that.

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17806) Incorrect result when work with data from parquet

2016-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-17806:
--

Assignee: Davies Liu

> Incorrect result when work with data from parquet
> -
>
> Key: SPARK-17806
> URL: https://issues.apache.org/jira/browse/SPARK-17806
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>Assignee: Davies Liu
>Priority: Blocker
>  Labels: correctness
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.{StructField, StructType}
>   import org.apache.spark.sql.types.DataTypes._
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq(
> """{"a":1,"b":1,"c":1}""",
> """{"a":1,"b":1,"c":2}"""
>   ))
>   sc.read.schema(StructType(Seq(
> StructField("a", IntegerType),
> StructField("b", IntegerType),
> StructField("c", LongType)
>   ))).json(jsonRDD).write.parquet("/tmp/test")
>   val df = sc.read.load("/tmp/test")
>   df.join(df, Seq("a", "b", "c"), "left_outer").show()
> {code}
> returns:
> {code}
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  1|  1|
> |  1|  1|  1|
> |  1|  1|  2|
> |  1|  1|  2|
> +---+---+---+
> {code}
> Expected result:
> {code}
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  1|  1|
> |  1|  1|  2|
> +---+---+---+
> {code}
> If I use this code without saving to parquet it works fine. If you change 
> type of `c` column to `IntegerType` it also works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17806) Incorrect result when work with data from parquet

2016-10-07 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17806:
---
Priority: Blocker  (was: Critical)

> Incorrect result when work with data from parquet
> -
>
> Key: SPARK-17806
> URL: https://issues.apache.org/jira/browse/SPARK-17806
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Vitaly Gerasimov
>Priority: Blocker
>  Labels: correctness
>
> {code}
>   import org.apache.spark.SparkConf
>   import org.apache.spark.sql.SparkSession
>   import org.apache.spark.sql.types.{StructField, StructType}
>   import org.apache.spark.sql.types.DataTypes._
>   val sc = SparkSession.builder().config(new 
> SparkConf().setMaster("local")).getOrCreate()
>   val jsonRDD = sc.sparkContext.parallelize(Seq(
> """{"a":1,"b":1,"c":1}""",
> """{"a":1,"b":1,"c":2}"""
>   ))
>   sc.read.schema(StructType(Seq(
> StructField("a", IntegerType),
> StructField("b", IntegerType),
> StructField("c", LongType)
>   ))).json(jsonRDD).write.parquet("/tmp/test")
>   val df = sc.read.load("/tmp/test")
>   df.join(df, Seq("a", "b", "c"), "left_outer").show()
> {code}
> returns:
> {code}
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  1|  1|
> |  1|  1|  1|
> |  1|  1|  2|
> |  1|  1|  2|
> +---+---+---+
> {code}
> Expected result:
> {code}
> +---+---+---+
> |  a|  b|  c|
> +---+---+---+
> |  1|  1|  1|
> |  1|  1|  2|
> +---+---+---+
> {code}
> If I use this code without saving to parquet it works fine. If you change 
> type of `c` column to `IntegerType` it also works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0

2016-10-05 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15549494#comment-15549494
 ] 

Davies Liu commented on SPARK-16922:


Thanks for the feedback, that's reasonable.

> Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
> -
>
> Key: SPARK-16922
> URL: https://issues.apache.org/jira/browse/SPARK-16922
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> A query which used to work in Spark 1.6 fails with executor OOM in 2.0.
> Stack trace - 
> {code}
>   at 
> org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Query plan in Spark 1.6
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3])
> +- TungstenExchange hashpartitioning(field1#101,200), None
>+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / 
> 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111])
>   +- Project [field1#101,field2#74]
>  +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as 
> decimal(20,0)) as bigint)], BuildRight
> :- ConvertToUnsafe
> :  +- HiveTableScan [field2#74,field5#63L], MetastoreRelation 
> foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)]
> +- ConvertToUnsafe
>+- HiveTableScan [field1#101,field4#97], MetastoreRelation 
> foo, table2, Some(b)
> {code}
> Query plan in 2.0
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))])
> +- Exchange hashpartitioning(field1#160, 200)
>+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / 
> 100.0))])
>   +- *Project [field2#133, field1#160]
>  +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as 
> decimal(20,0)) as bigint)], Inner, BuildRight
> :- *Filter isnotnull(field5#122L)
> :  +- HiveTableScan [field5#122L, field2#133], MetastoreRelation 
> foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= 
> 2013-12-31)]
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as 
> decimal(20,0)) as bigint)))
>+- *Filter isnotnull(field4#156)
>   +- HiveTableScan [field4#156, field1#160], 
> MetastoreRelation foo, table2, b
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15390.

Resolution: Fixed

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNe

[jira] [Updated] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15390:
---
Fix Version/s: 2.0.1

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withN

[jira] [Comment Edited] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546675#comment-15546675
 ] 

Davies Liu edited comment on SPARK-15390 at 10/4/16 9:11 PM:
-

@lulian Dragos I think this is a different issue, fixed by 
https://github.com/apache/spark/pull/14373  in 2.0.1.


was (Author: davies):
@lulian Dragos I think this is a different issue, fixed by 
https://github.com/apache/spark/pull/14373 and 
https://github.com/apache/spark/pull/14464/files in 2.0.1.

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByte

[jira] [Updated] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15390:
---
Fix Version/s: (was: 2.0.0)

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.1
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Da

[jira] [Commented] (SPARK-15390) Memory management issue in complex DataFrame join and filter

2016-10-04 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546675#comment-15546675
 ] 

Davies Liu commented on SPARK-15390:


@lulian Dragos I think this is a different issue, fixed by 
https://github.com/apache/spark/pull/14373 and 
https://github.com/apache/spark/pull/14464/files in 2.0.1.

> Memory management issue in complex DataFrame join and filter
> 
>
> Key: SPARK-15390
> URL: https://issues.apache.org/jira/browse/SPARK-15390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: branch-2.0, 16 workers
>Reporter: Joseph K. Bradley
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> See [SPARK-15389] for a description of the code which produces this bug.  I 
> am filing this as a separate JIRA since the bug in 2.0 is different.
> In 2.0, the code fails with some memory management error.  Here is the 
> stacktrace:
> {code}
> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support 
> was removed in 8.0
> 16/05/18 19:23:16 ERROR Uncaught throwable from user code: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
> Exchange SinglePartition, None
> +- WholeStageCodegen
>:  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L])
>: +- Project
>:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None
>:   :- INPUT
>:   +- Project [id#110L]
>:  +- Filter (degree#115 > 200)
>: +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Final,isDistinct=false)], 
> output=[id#110L,degree#115])
>:+- INPUT
>:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint]))
>:  +- WholeStageCodegen
>: :  +- Project [row#66.id AS id#70L]
>: : +- Filter isnotnull(row#66.id)
>: :+- INPUT
>: +- Scan ExistingRDD[row#66,uniq_id#67]
>+- Exchange hashpartitioning(id#110L, 200), None
>   +- WholeStageCodegen
>  :  +- TungstenAggregate(key=[id#110L], 
> functions=[(count(1),mode=Partial,isDistinct=false)], 
> output=[id#110L,count#136L])
>  : +- Filter isnotnull(id#110L)
>  :+- INPUT
>  +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L]
> +- WholeStageCodegen
>:  +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT 
> (src#2L = dst#3L))
>: +- INPUT
>+- InMemoryTableScan [src#2L,dst#3L], 
> [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation 
> [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, 
> offheap=false, deserialized=true, replication=1), WholeStageCodegen, None
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
>   at 
> org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spa

[jira] [Resolved] (SPARK-17679) Remove unnecessary Py4J ListConverter patch

2016-10-03 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17679.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15254
[https://github.com/apache/spark/pull/15254]

> Remove unnecessary Py4J ListConverter patch
> ---
>
> Key: SPARK-17679
> URL: https://issues.apache.org/jira/browse/SPARK-17679
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Jason White
>Priority: Minor
> Fix For: 2.1.0
>
>
> In SPARK-6949 davies documented a couple of bugs with Py4J that prevented 
> Spark from registering a converter for date and datetime objects. Patched in 
> https://github.com/apache/spark/pull/5570.
> Specifically https://github.com/bartdag/py4j/issues/160 dealt with 
> ListConverter automatically converting bytearrays into ArrayList instead of 
> leaving it alone.
> Py4J #160 has since been fixed in Py4J, since the 0.9 release a couple of 
> months after Spark #5570. According to spark-core's pom.xml, we're using 
> 0.10.3.
> We should remove this patch on ListConverter since the upstream package no 
> longer has this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-30 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17738:
---
Fix Version/s: (was: 2.2.0)
   2.1.0

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-30 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17738.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15305
[https://github.com/apache/spark/pull/15305]

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.2.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-29 Thread Davies Liu (JIRA)

Davies Liu created SPARK-17738:
--

 Summary: Flaky test: 
org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
 Key: SPARK-17738
 URL: https://issues.apache.org/jira/browse/SPARK-17738
 Project: Spark
  Issue Type: Bug
Reporter: Davies Liu
Assignee: Davies Liu


https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17494) Floor/ceil of decimal returns wrong result if it's in compact format

2016-09-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17494:
---
Summary: Floor/ceil of decimal returns wrong result if it's in compact 
format  (was: Floor function rounds up during join)

> Floor/ceil of decimal returns wrong result if it's in compact format
> 
>
> Key: SPARK-17494
> URL: https://issues.apache.org/jira/browse/SPARK-17494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Gokhan Civan
>Assignee: Davies Liu
>  Labels: correctness
>
> If you create tables as follows:
> create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num;
> create table b as select 'A' as str;
> Then
> select floor(num) from a;
> returns 10
> but
> select floor(num) from a join b on a.str = b.str;
> returns 11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-09-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17100:
---
Fix Version/s: (was: 2.2.0)
   2.1.0

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-09-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17100.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.0.1

Issue resolved by pull request 15103
[https://github.com/apache/spark/pull/15103]

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.2.0
>
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16439) Incorrect information in SQL Query details

2016-09-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16439:
---
Fix Version/s: (was: 2.2.0)
   2.1.0

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17494) Floor function rounds up during join

2016-09-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-17494:
--

Assignee: Davies Liu

> Floor function rounds up during join
> 
>
> Key: SPARK-17494
> URL: https://issues.apache.org/jira/browse/SPARK-17494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Gokhan Civan
>Assignee: Davies Liu
>  Labels: correctness
>
> If you create tables as follows:
> create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num;
> create table b as select 'A' as str;
> Then
> select floor(num) from a;
> returns 10
> but
> select floor(num) from a join b on a.str = b.str;
> returns 11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16439) Incorrect information in SQL Query details

2016-09-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16439:
---
Assignee: Davies Liu  (was: Maciej Bryński)

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Davies Liu
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16439) Incorrect information in SQL Query details

2016-09-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-16439.

   Resolution: Fixed
Fix Version/s: (was: 2.0.0)
   2.2.0
   2.0.1

Issue resolved by pull request 15106
[https://github.com/apache/spark/pull/15106]

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.1, 2.2.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-16439) Incorrect information in SQL Query details

2016-09-14 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-16439:


We could bring the seperator back for better readability. 

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-09-14 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491744#comment-15491744
 ] 

Davies Liu commented on SPARK-16439:


The separator was added on purpose, otherwise it's very difficult to read the 
numbers (especially the number of rows),  we need to count the number of digits 
to realize how large the number is. I think we should still keep that and fix 
the local issue (using English should be enough), I will send a PR to add it 
back.

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2172 matches

Mail list logo