[jira] [Resolved] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19872. Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman >Assignee: Hyukjin Kwon > Fix For: 2.1.1, 2.2.0 > > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19872) UnicodeDecodeError in Pyspark on sc.textFile read with repartition
[ https://issues.apache.org/jira/browse/SPARK-19872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-19872: -- Assignee: Hyukjin Kwon > UnicodeDecodeError in Pyspark on sc.textFile read with repartition > -- > > Key: SPARK-19872 > URL: https://issues.apache.org/jira/browse/SPARK-19872 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 > Environment: Mac and EC2 >Reporter: Brian Bruggeman >Assignee: Hyukjin Kwon > Fix For: 2.1.1, 2.2.0 > > > I'm receiving the following traceback: > {code} > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > I created a textfile (text.txt) with standard linux newlines: > {code} > a > b > d > e > f > g > h > i > j > k > l > {code} > I think ran pyspark: > {code} > $ pyspark > Python 2.7.13 (default, Dec 18 2016, 07:03:39) > [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/03/08 13:59:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/03/08 13:59:32 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Dec 18 2016 07:03:39) > SparkSession available as 'spark'. > >>> sc.textFile('test.txt').collect() > [u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l'] > >>> sc.textFile('test.txt', use_unicode=False).collect() > ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l'] > >>> sc.textFile('test.txt', use_unicode=False).repartition(10).collect() > ['\x80\x02]q\x01(U\x01aU\x01bU\x01cU\x01dU\x01eU\x01fU\x01ge.', > '\x80\x02]q\x01(U\x01hU\x01iU\x01jU\x01kU\x01le.'] > >>> sc.textFile('test.txt').repartition(10).collect() > Traceback (most recent call last): > File "", line 1, in > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 810, in collect > return list(_load_from_socket(port, self._jrdd_deserializer)) > File "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/rdd.py", > line 140, in _load_from_socket > for item in serializer.load_stream(rf): > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 539, in load_stream > yield self.loads(stream) > File > "/usr/local/Cellar/apache-spark/2.1.0/libexec/python/pyspark/serializers.py", > line 534, in loads > return s.decode("utf-8") if self.use_unicode else s > File "/Users/brianbruggeman/.envs/dg/lib/python2.7/encodings/utf_8.py", > line 16, in decode > return codecs.utf_8_decode(input, errors, True) > UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: > invalid start byte > {code} > This really looks like a bug in the `serializers.py` code. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch
[ https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19561. Resolution: Fixed Assignee: Jason White Fix Version/s: 2.2.0 2.1.1 > Pyspark Dataframes don't allow timestamps near epoch > > > Key: SPARK-19561 > URL: https://issues.apache.org/jira/browse/SPARK-19561 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Jason White >Assignee: Jason White > Fix For: 2.1.1, 2.2.0 > > > Pyspark does not allow timestamps at or near the epoch to be created in a > DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299 > TimestampType.toInternal converts a datetime object to a number representing > microseconds since the epoch. For all times more than 2148 seconds before or > after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J > automatically serializes it as a long. > However, for times within this range (~35 minutes before or after the epoch), > Py4J serializes it as an int. When creating the object on the Scala side, > ints are not recognized and the value goes to null. This leads to null values > in non-nullable fields, and corrupted Parquet files. > The solution is trivial - force TimestampType.toInternal to always return a > long. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used
[ https://issues.apache.org/jira/browse/SPARK-19500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19500. Resolution: Fixed Fix Version/s: 2.2.0 2.0.3 2.1.1 Issue resolved by pull request 16844 [https://github.com/apache/spark/pull/16844] > Fail to spill the aggregated hash map when radix sort is used > - > > Key: SPARK-19500 > URL: https://issues.apache.org/jira/browse/SPARK-19500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.1.1, 2.0.3, 2.2.0 > > > Radix sort requires that only half of the array could be occupied. But the > aggregated hash map have a off-by-1 bug that could have 1 more item than half > of the array, when this happen, the spilling will fail as: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 > in stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage > 10.0 (TID 23899, 10.145.253.180, executor 24): > java.lang.IllegalStateException: There is no space for new record > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227) > > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130) > > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166) > > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137) > > ... 32 more > Caused by: java.lang.IllegalStateException: There is no space for new record > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227) > > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130) > > at > org.apache.spar
[jira] [Resolved] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19481. Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16825 [https://github.com/apache/spark/pull/16825] > Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in > ClosureCleaner > - > > Key: SPARK-19481 > URL: https://issues.apache.org/jira/browse/SPARK-19481 > Project: Spark > Issue Type: Test > Components: Spark Shell >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the > tests unstable. See: > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite&test_name=should+clone+and+clean+line+object+in+ClosureCleaner -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used
[ https://issues.apache.org/jira/browse/SPARK-19500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-19500: -- Assignee: Davies Liu > Fail to spill the aggregated hash map when radix sort is used > - > > Key: SPARK-19500 > URL: https://issues.apache.org/jira/browse/SPARK-19500 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Davies Liu >Assignee: Davies Liu > > Radix sort requires that only half of the array could be occupied. But the > aggregated hash map have a off-by-1 bug that could have 1 more item than half > of the array, when this happen, the spilling will fail as: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 > in stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage > 10.0 (TID 23899, 10.145.253.180, executor 24): > java.lang.IllegalStateException: There is no space for new record > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227) > > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130) > > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) > > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166) > > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137) > > ... 32 more > Caused by: java.lang.IllegalStateException: There is no space for new record > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227) > > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130) > > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250) > > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$Generated
[jira] [Created] (SPARK-19500) Fail to spill the aggregated hash map when radix sort is used
Davies Liu created SPARK-19500: -- Summary: Fail to spill the aggregated hash map when radix sort is used Key: SPARK-19500 URL: https://issues.apache.org/jira/browse/SPARK-19500 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Davies Liu Radix sort requires that only half of the array could be occupied. But the aggregated hash map have a off-by-1 bug that could have 1 more item than half of the array, when this happen, the spilling will fail as: {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 171 in stage 10.0 failed 4 times, most recent failure: Lost task 171.3 in stage 10.0 (TID 23899, 10.145.253.180, executor 24): java.lang.IllegalStateException: There is no space for new record at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130) at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:137) ... 32 more Caused by: java.lang.IllegalStateException: There is no space for new record at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:227) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130) at org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:250) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$2.hasNext(WholeStageCodegenExec.scala:396) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at
[jira] [Resolved] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss
[ https://issues.apache.org/jira/browse/SPARK-19415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19415. Resolution: Duplicate Fix Version/s: 2.2.0 > Improve the implicit type conversion between numeric type and string to avoid > precesion loss > > > Key: SPARK-19415 > URL: https://issues.apache.org/jira/browse/SPARK-19415 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Fix For: 2.2.0 > > > Currently, Spark SQL will convert both numeric type and string into > DoubleType, if the two children of a expression does not match (for example, > comparing a LongType again StringType), this will cause precision loss in > some cases. > Some database does better job one this (for example, SQL Server [1]), we > should follow that. > [1] https://msdn.microsoft.com/en-us/library/ms191530.aspx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss
Davies Liu created SPARK-19415: -- Summary: Improve the implicit type conversion between numeric type and string to avoid precesion loss Key: SPARK-19415 URL: https://issues.apache.org/jira/browse/SPARK-19415 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Currently, Spark SQL will convert both numeric type and string into DoubleType, if the two children of a expression does not match (for example, comparing a LongType again StringType), this will cause precision loss in some cases. Some database does better job one this (for example, SQL Server [1]), we should follow that. [1] https://msdn.microsoft.com/en-us/library/ms191530.aspx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
Title: Message Title Davies Liu commented on SPARK-18105 Re: LZ4 failed to decompress a stream of shuffled data There is a workaround merged into Spark 2.1 for these type of failures (decompress it and try again), can you try that? Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] [Reopened] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-14480: This patch have a regression: A column that have escaped newline can't be correctly parsed anymore. > Remove meaningless StringIteratorReader for CSV data source for better > performance > -- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think > is made like this for better performance. However, it looks there are two > problems. > Firstly, it was actually not faster than processing line by line with > {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. > Secondly, this brought a bit of complexity because it needs additional logics > to allow every line to be read bytes by bytes. So, it was pretty difficult to > figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes > in {{CSVParser}} might not be needed. > I made a rough patch and tested this. The test results for the first problem > are below: > h4. Results > - Original codes with {{Reader}} wrapping {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 14116265034 | 2008277960 | > - New codes with {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 13451699644 | 1549050564 | > In more details, > h4. Method > - TCP-H lineitem table is being tested. > - The results are collected only by 100. > - End-to-end tests and parsing time tests are performed 10 times and averages > are calculated for each. > h4. Environment > - Machine: MacBook Pro Retina > - CPU: 4 > - Memory: 8GB > h4. Dataset > - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 > ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) > - Size : 724.66 MB > h4. Test Codes > - Function to measure time > {code} > def time[A](f: => A) = { > val s = System.nanoTime > val ret = f > println("time: "+(System.nanoTime-s)/1e6+"ms") > ret > } > {code} > - End-to-end test > {code} > val path = "lineitem.tbl" > val df = sqlContext > .read > .format("csv") > .option("header", "false") > .option("delimiter", "|") > .load(path) > time(df.take(100)) > {code} > - Parsing time test for original (in {{BulkCsvParser}}) > {code} > ... > // `reader` is a wrapper for an Iterator. > private val reader = new StringIteratorReader(iter) > parser.beginParsing(reader) > ... > time(parser.parseNext()) > ... > {code} > - Parsing time test for new (in {{BulkCsvParser}}) > {code} > ... > time(parser.parseLine(iter.next())) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19375) na.fill() should not change the data type of column
[ https://issues.apache.org/jira/browse/SPARK-19375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-19375. -- Resolution: Duplicate > na.fill() should not change the data type of column > --- > > Key: SPARK-19375 > URL: https://issues.apache.org/jira/browse/SPARK-19375 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Davies Liu > > When these function are called, a column of numeric type will be converted > into DoubleType, which will cause unexpected type casting and also precision > loss for LongType. > {code} > def fill(value: Double) > def fill(value: Double, cols: Array[String]) > def fill(value: Double, cols: Seq[String]) > {code} > Workaround > {code} > fill(Map("name" -> v)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19375) na.fill() should not change the data type of column
Davies Liu created SPARK-19375: -- Summary: na.fill() should not change the data type of column Key: SPARK-19375 URL: https://issues.apache.org/jira/browse/SPARK-19375 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 1.6.3, 1.5.2, 1.4.1, 1.3.1, 2.2.0 Reporter: Davies Liu When these function are called, a column of numeric type will be converted into DoubleType, which will cause unexpected type casting and also precision loss for LongType. {code} def fill(value: Double) def fill(value: Double, cols: Array[String]) def fill(value: Double, cols: Seq[String]) {code} Workaround {code} fill(Map("name" -> v)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19370) Flaky test: MetadataCacheSuite
[ https://issues.apache.org/jira/browse/SPARK-19370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-19370: --- Affects Version/s: 2.1.0 > Flaky test: MetadataCacheSuite > -- > > Key: SPARK-19370 > URL: https://issues.apache.org/jira/browse/SPARK-19370 > Project: Spark > Issue Type: Test >Affects Versions: 2.1.0 >Reporter: Davies Liu >Assignee: Shixiong Zhu > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull > {code} > MetadataCacheSuite: > - SPARK-16336 Suggest doing table refresh when encountering > FileNotFoundException > Exception encountered when invoking run on a nested suite - There are 1 > possibly leaked file streams. *** ABORTED *** > java.lang.RuntimeException: There are 1 possibly leaked file streams. > at > org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47) > at > org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90) > at > org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) > at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205) > at > org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) > at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220) > at > org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264) > at > org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19370) Flaky test: MetadataCacheSuite
Davies Liu created SPARK-19370: -- Summary: Flaky test: MetadataCacheSuite Key: SPARK-19370 URL: https://issues.apache.org/jira/browse/SPARK-19370 Project: Spark Issue Type: Test Reporter: Davies Liu Assignee: Shixiong Zhu {code} MetadataCacheSuite: - SPARK-16336 Suggest doing table refresh when encountering FileNotFoundException Exception encountered when invoking run on a nested suite - There are 1 possibly leaked file streams. *** ABORTED *** java.lang.RuntimeException: There are 1 possibly leaked file streams. at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47) at org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264) at org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19370) Flaky test: MetadataCacheSuite
[ https://issues.apache.org/jira/browse/SPARK-19370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-19370: --- Description: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull {code} MetadataCacheSuite: - SPARK-16336 Suggest doing table refresh when encountering FileNotFoundException Exception encountered when invoking run on a nested suite - There are 1 possibly leaked file streams. *** ABORTED *** java.lang.RuntimeException: There are 1 possibly leaked file streams. at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47) at org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264) at org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) ... {code} was: {code} MetadataCacheSuite: - SPARK-16336 Suggest doing table refresh when encountering FileNotFoundException Exception encountered when invoking run on a nested suite - There are 1 possibly leaked file streams. *** ABORTED *** java.lang.RuntimeException: There are 1 possibly leaked file streams. at org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47) at org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220) at org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264) at org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) ... {code} > Flaky test: MetadataCacheSuite > -- > > Key: SPARK-19370 > URL: https://issues.apache.org/jira/browse/SPARK-19370 > Project: Spark > Issue Type: Test >Affects Versions: 2.1.0 >Reporter: Davies Liu >Assignee: Shixiong Zhu > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.4/2703/consoleFull > {code} > MetadataCacheSuite: > - SPARK-16336 Suggest doing table refresh when encountering > FileNotFoundException > Exception encountered when invoking run on a nested suite - There are 1 > possibly leaked file streams. *** ABORTED *** > java.lang.RuntimeException: There are 1 possibly leaked file streams. > at > org.apache.spark.DebugFilesystem$.assertNoOpenStreams(DebugFilesystem.scala:47) > at > org.apache.spark.sql.test.SharedSQLContext$class.afterEach(SharedSQLContext.scala:90) > at > org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) > at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205) > at > org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) > at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220) > at > org.apache.spark.sql.MetadataCacheSuite.afterEach(MetadataCacheSuite.scala:29) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:264) > at > org.apache.spark.sql.MetadataCacheSuite.runTest(MetadataCacheSuite.scala:29) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17912. Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 15467 [https://github.com/apache/spark/pull/15467] > Refactor code generation to get data for ColumnVector/ColumnarBatch > --- > > Key: SPARK-17912 > URL: https://issues.apache.org/jira/browse/SPARK-17912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kazuaki Ishizaki > Fix For: 3.0.0 > > > Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is > becoming pervasive. The code generation part can be reused by multiple > components (e.g. parquet reader, data cache, and so on). > This JIRA refactors the code generation part as a trait for ease of reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17912) Refactor code generation to get data for ColumnVector/ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-17912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-17912: --- Fix Version/s: (was: 3.0.0) 2.2.0 > Refactor code generation to get data for ColumnVector/ColumnarBatch > --- > > Key: SPARK-17912 > URL: https://issues.apache.org/jira/browse/SPARK-17912 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kazuaki Ishizaki > Fix For: 2.2.0 > > > Code generation to get data from {{ColumnVector}} and {{ColumnarBatch}} is > becoming pervasive. The code generation part can be reused by multiple > components (e.g. parquet reader, data cache, and so on). > This JIRA refactors the code generation part as a trait for ease of reuse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830644#comment-15830644 ] Davies Liu commented on SPARK-17602: The Python workers are reused by default, could you re-run the benchmark while re-use the workers? > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19019) PySpark does not work with Python 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-19019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19019. Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16429 [https://github.com/apache/spark/pull/16429] > PySpark does not work with Python 3.6.0 > --- > > Key: SPARK-19019 > URL: https://issues.apache.org/jira/browse/SPARK-19019 > Project: Spark > Issue Type: Bug > Components: PySpark >Reporter: Hyukjin Kwon >Priority: Critical > Fix For: 2.1.1, 2.2.0 > > > Currently, PySpark does not work with Python 3.6.0. > Running {{./bin/pyspark}} simply throws the error as below: > {code} > Traceback (most recent call last): > File ".../spark/python/pyspark/shell.py", line 30, in > import pyspark > File ".../spark/python/pyspark/__init__.py", line 46, in > from pyspark.context import SparkContext > File ".../spark/python/pyspark/context.py", line 36, in > from pyspark.java_gateway import launch_gateway > File ".../spark/python/pyspark/java_gateway.py", line 31, in > from py4j.java_gateway import java_import, JavaGateway, GatewayClient > File "", line 961, in _find_and_load > File "", line 950, in _find_and_load_unlocked > File "", line 646, in _load_unlocked > File "", line 616, in _load_backward_compatible > File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line > 18, in > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", > line 62, in > import pkgutil > File > "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", > line 22, in > ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg') > File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple > cls = _old_namedtuple(*args, **kwargs) > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > The problem is in > https://github.com/apache/spark/blob/3c68944b229aaaeeaee3efcbae3e3be9a2914855/python/pyspark/serializers.py#L386-L394 > as the error says and the cause seems because the arguments of > {{namedtuple}} are now completely keyword-only arguments from Python 3.6.0 > (See https://bugs.python.org/issue25628). > We currently copy this function via {{types.FunctionType}} which does not set > the default values of keyword-only arguments (meaning > {{namedtuple.__kwdefaults__}}) and this seems causing internally missing > values in the function (non-bound arguments). > This ends up as below: > {code} > import types > import collections > def _copy_func(f): > return types.FunctionType(f.__code__, f.__globals__, f.__name__, > f.__defaults__, f.__closure__) > _old_namedtuple = _copy_func(collections.namedtuple) > _old_namedtuple(, "b") > _old_namedtuple("a") > {code} > If we call as below: > {code} > >>> _old_namedtuple("a", "b") > Traceback (most recent call last): > File "", line 1, in > TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', > 'rename', and 'module' > {code} > It throws an exception as above becuase {{__kwdefaults__}} for required > keyword arguments seem unset in the copied function. So, if we give explicit > value for these, > {code} > >>> _old_namedtuple("a", "b", verbose=False, rename=False, module=None) > > {code} > It works fine. > It seems now we should properly set these into the hijected one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19180) the offset of short is 4 in OffHeapColumnVector's putShorts
[ https://issues.apache.org/jira/browse/SPARK-19180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19180. Resolution: Fixed Fix Version/s: 2.0.3 2.1.1 Issue resolved by pull request 16555 [https://github.com/apache/spark/pull/16555] > the offset of short is 4 in OffHeapColumnVector's putShorts > --- > > Key: SPARK-19180 > URL: https://issues.apache.org/jira/browse/SPARK-19180 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: yucai > Fix For: 2.1.1, 2.0.3, 2.2.0 > > > the offset of short is 4 in OffHeapColumnVector's putShorts, actually it > should be 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18589: --- Priority: Critical (was: Minor) > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu >Priority: Critical > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.sc
[jira] [Assigned] (SPARK-18589) persist() resolves "java.lang.RuntimeException: Invalid PythonUDF (...), requires attributes from more than one child"
[ https://issues.apache.org/jira/browse/SPARK-18589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-18589: -- Assignee: Davies Liu > persist() resolves "java.lang.RuntimeException: Invalid PythonUDF > (...), requires attributes from more than one child" > -- > > Key: SPARK-18589 > URL: https://issues.apache.org/jira/browse/SPARK-18589 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu >Priority: Minor > > Smells like another optimizer bug that's similar to SPARK-17100 and > SPARK-18254. I'm seeing this on 2.0.2 and on master at commit > {{fb07bbe575aabe68422fd3a31865101fb7fa1722}}. > I don't have a minimal repro for this yet, but the error I'm seeing is: > {code} > py4j.protocol.Py4JJavaError: An error occurred while calling o247.count. > : java.lang.RuntimeException: Invalid PythonUDF <...>(...), requires > attributes from more than one child. > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:150) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract$2.apply(ExtractPythonUDFs.scala:149) > at scala.collection.immutable.Stream.foreach(Stream.scala:594) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.org$apache$spark$sql$execution$python$ExtractPythonUDFs$$extract(ExtractPythonUDFs.scala:149) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:114) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$$anonfun$apply$2.applyOrElse(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:113) > at > org.apache.spark.sql.execution.python.ExtractPythonUDFs$.apply(ExtractPythonUDFs.scala:93) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$prepareForExecution$1.apply(QueryExecution.scala:93) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.execution.QueryExecution.prepareForExecution(QueryExecution.scala:93)
[jira] [Resolved] (SPARK-18281) toLocalIterator yields time out error on pyspark2
[ https://issues.apache.org/jira/browse/SPARK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-18281. Resolution: Fixed Fix Version/s: 2.0.3 2.1.1 Issue resolved by pull request 16263 [https://github.com/apache/spark/pull/16263] > toLocalIterator yields time out error on pyspark2 > - > > Key: SPARK-18281 > URL: https://issues.apache.org/jira/browse/SPARK-18281 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.1 > Environment: Ubuntu 14.04.5 LTS > Driver: AWS M4.XLARGE > Slaves: AWS M4.4.XLARGE > mesos 1.0.1 > spark 2.0.1 > pyspark >Reporter: Luke Miner > Fix For: 2.1.1, 2.0.3 > > > I run the example straight out of the api docs for toLocalIterator and it > gives a time out exception: > {code} > from pyspark import SparkContext > sc = SparkContext() > rdd = sc.parallelize(range(10)) > [x for x in rdd.toLocalIterator()] > {code} > conf file: > spark.driver.maxResultSize 6G > spark.executor.extraJavaOptions -XX:+UseG1GC -XX:MaxPermSize=1G > -XX:+HeapDumpOnOutOfMemoryError > spark.executor.memory 16G > spark.executor.uri foo/spark-2.0.1-bin-hadoop2.7.tgz > spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem > spark.hadoop.fs.s3a.buffer.dir /raid0/spark > spark.hadoop.fs.s3n.buffer.dir /raid0/spark > spark.hadoop.fs.s3a.connection.timeout 50 > spark.hadoop.fs.s3n.multipart.uploads.enabled true > spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2 > spark.hadoop.parquet.block.size 2147483648 > spark.hadoop.parquet.enable.summary-metadatafalse > spark.jars.packages > com.databricks:spark-avro_2.11:3.0.1,com.amazonaws:aws-java-sdk-pom:1.10.34 > spark.local.dir /raid0/spark > spark.mesos.coarse false > spark.mesos.constraints priority:1 > spark.network.timeout 600 > spark.rpc.message.maxSize500 > spark.speculation false > spark.sql.parquet.mergeSchema false > spark.sql.planner.externalSort true > spark.submit.deployMode client > spark.task.cpus 1 > Exception here: > {code} > --- > timeout Traceback (most recent call last) > in () > 2 sc = SparkContext() > 3 rdd = sc.parallelize(range(10)) > > 4 [x for x in rdd.toLocalIterator()] > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/rdd.pyc in > _load_from_socket(port, serializer) > 140 try: > 141 rf = sock.makefile("rb", 65536) > --> 142 for item in serializer.load_stream(rf): > 143 yield item > 144 finally: > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > load_stream(self, stream) > 137 while True: > 138 try: > --> 139 yield self._read_with_length(stream) > 140 except EOFError: > 141 return > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > _read_with_length(self, stream) > 154 > 155 def _read_with_length(self, stream): > --> 156 length = read_int(stream) > 157 if length == SpecialLengths.END_OF_DATA_SECTION: > 158 raise EOFError > /foo/spark-2.0.1-bin-hadoop2.7/python/pyspark/serializers.pyc in > read_int(stream) > 541 > 542 def read_int(stream): > --> 543 length = stream.read(4) > 544 if not length: > 545 raise EOFError > /usr/lib/python2.7/socket.pyc in read(self, size) > 378 # fragmentation issues on many platforms. > 379 try: > --> 380 data = self._sock.recv(left) > 381 except error, e: > 382 if e.args[0] == EINTR: > timeout: timed out > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x
[ https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746370#comment-15746370 ] Davies Liu edited comment on SPARK-18676 at 12/13/16 9:47 PM: -- I had a working prototype, but it introduce some weird behavior, for example, the actual plan will not match the one showed in explain or web ui. Currently, I'm still not sure that the right direction or not. was (Author: davies): I had a working prototype, but in introduce some weird behavior, for example, the actual plan will not match the one showed in explain or web ui. Currently, I'm still not sure that the right direction or not. > Spark 2.x query plan data size estimation can crash join queries versus 1.x > --- > > Key: SPARK-18676 > URL: https://issues.apache.org/jira/browse/SPARK-18676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Michael Allman > > Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly > modified the way Spark SQL estimates the output data size of query plans. > I've found that—with the new table query partition pruning support in > 2.1—this has lead to in some cases underestimation of join plan child size > statistics to a degree that makes executing such queries impossible without > disabling automatic broadcast conversion. > In one case we debugged, the query planner had estimated the size of a join > child to be 3,854 bytes. In the execution of this child query, Spark reads 20 > million rows in 1 GB of data from parquet files and shuffles 722.9 MB of > data, outputting 17 million rows. In planning the original join query, Spark > converts the child to a {{BroadcastExchange}}. This query execution fails > unless automatic broadcast conversion is disabled. > This particular query is complex and very specific to our data and schema. I > have not yet developed a reproducible test case that can be shared. I realize > this ticket does not give the Spark team a lot to work with to reproduce and > test this issue, but I'm available to help. At the moment I can suggest > running a join where one side is an aggregation selecting a few fields over a > large table with a wide schema including many string columns. > This issue exists in Spark 2.0, but we never encountered it because in that > version it only manifests itself for partitioned relations read from the > filesystem, and we rarely use this feature. We've encountered this issue in > 2.1 because 2.1 does partition pruning for metastore tables now. > As a back stop, we've patched our branch of Spark 2.1 to revert the > reductions in default data type size for string, binary and user-defined > types. We also removed the override of the statistics method in {{UnaryNode}} > which reduces the output size of a plan based on the ratio of that plan's > output schema size versus its children's. We have not had this problem since. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x
[ https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746370#comment-15746370 ] Davies Liu commented on SPARK-18676: I had a working prototype, but in introduce some weird behavior, for example, the actual plan will not match the one showed in explain or web ui. Currently, I'm still not sure that the right direction or not. > Spark 2.x query plan data size estimation can crash join queries versus 1.x > --- > > Key: SPARK-18676 > URL: https://issues.apache.org/jira/browse/SPARK-18676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Michael Allman > > Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly > modified the way Spark SQL estimates the output data size of query plans. > I've found that—with the new table query partition pruning support in > 2.1—this has lead to in some cases underestimation of join plan child size > statistics to a degree that makes executing such queries impossible without > disabling automatic broadcast conversion. > In one case we debugged, the query planner had estimated the size of a join > child to be 3,854 bytes. In the execution of this child query, Spark reads 20 > million rows in 1 GB of data from parquet files and shuffles 722.9 MB of > data, outputting 17 million rows. In planning the original join query, Spark > converts the child to a {{BroadcastExchange}}. This query execution fails > unless automatic broadcast conversion is disabled. > This particular query is complex and very specific to our data and schema. I > have not yet developed a reproducible test case that can be shared. I realize > this ticket does not give the Spark team a lot to work with to reproduce and > test this issue, but I'm available to help. At the moment I can suggest > running a join where one side is an aggregation selecting a few fields over a > large table with a wide schema including many string columns. > This issue exists in Spark 2.0, but we never encountered it because in that > version it only manifests itself for partitioned relations read from the > filesystem, and we rarely use this feature. We've encountered this issue in > 2.1 because 2.1 does partition pruning for metastore tables now. > As a back stop, we've patched our branch of Spark 2.1 to revert the > reductions in default data type size for string, binary and user-defined > types. We also removed the override of the statistics method in {{UnaryNode}} > which reduces the output size of a plan based on the ratio of that plan's > output schema size versus its children's. We have not had this problem since. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x
[ https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15733569#comment-15733569 ] Davies Liu commented on SPARK-18676: Yes, it can, see WholeStageCodegen.doExecute() as an example. Phisical plan is used by generate RDD (which is the actual physical plan), we build another SparkPlan to generate the RDD in the middle. > Spark 2.x query plan data size estimation can crash join queries versus 1.x > --- > > Key: SPARK-18676 > URL: https://issues.apache.org/jira/browse/SPARK-18676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Michael Allman > > Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly > modified the way Spark SQL estimates the output data size of query plans. > I've found that—with the new table query partition pruning support in > 2.1—this has lead to in some cases underestimation of join plan child size > statistics to a degree that makes executing such queries impossible without > disabling automatic broadcast conversion. > In one case we debugged, the query planner had estimated the size of a join > child to be 3,854 bytes. In the execution of this child query, Spark reads 20 > million rows in 1 GB of data from parquet files and shuffles 722.9 MB of > data, outputting 17 million rows. In planning the original join query, Spark > converts the child to a {{BroadcastExchange}}. This query execution fails > unless automatic broadcast conversion is disabled. > This particular query is complex and very specific to our data and schema. I > have not yet developed a reproducible test case that can be shared. I realize > this ticket does not give the Spark team a lot to work with to reproduce and > test this issue, but I'm available to help. At the moment I can suggest > running a join where one side is an aggregation selecting a few fields over a > large table with a wide schema including many string columns. > This issue exists in Spark 2.0, but we never encountered it because in that > version it only manifests itself for partitioned relations read from the > filesystem, and we rarely use this feature. We've encountered this issue in > 2.1 because 2.1 does partition pruning for metastore tables now. > As a back stop, we've patched our branch of Spark 2.1 to revert the > reductions in default data type size for string, binary and user-defined > types. We also removed the override of the statistics method in {{UnaryNode}} > which reduces the output size of a plan based on the ratio of that plan's > output schema size versus its children's. We have not had this problem since. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16589: --- Fix Version/s: (was: 2.1.0) 2.1.1 > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Andrew Ray > Labels: correctness > Fix For: 2.0.3, 2.1.1 > > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-16589. Resolution: Fixed Fix Version/s: 2.1.0 2.0.3 > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Andrew Ray > Labels: correctness > Fix For: 2.0.3, 2.1.0 > > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16589) Chained cartesian produces incorrect number of records
[ https://issues.apache.org/jira/browse/SPARK-16589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16589: --- Assignee: Andrew Ray > Chained cartesian produces incorrect number of records > -- > > Key: SPARK-16589 > URL: https://issues.apache.org/jira/browse/SPARK-16589 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0 >Reporter: Maciej Szymkiewicz >Assignee: Andrew Ray > Labels: correctness > > Chaining cartesian calls in PySpark results in the number of records lower > than expected. It can be reproduced as follows: > {code} > rdd = sc.parallelize(range(10), 1) > rdd.cartesian(rdd).cartesian(rdd).count() > ## 355 > rdd.cartesian(rdd).cartesian(rdd).distinct().count() > ## 251 > {code} > It looks like it is related to serialization. If we reserialize after initial > cartesian: > {code} > rdd.cartesian(rdd)._reserialize(BatchedSerializer(PickleSerializer(), > 1)).cartesian(rdd).count() > ## 1000 > {code} > or insert identity map: > {code} > rdd.cartesian(rdd).map(lambda x: x).cartesian(rdd).count() > ## 1000 > {code} > it yields correct results. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18676) Spark 2.x query plan data size estimation can crash join queries versus 1.x
[ https://issues.apache.org/jira/browse/SPARK-18676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726247#comment-15726247 ] Davies Liu commented on SPARK-18676: What's the schema and plan of the child looks like? It's possible that the schema of parquet file is wide, also highly compressed, only a few column are used in the query, then the estimation will be much smaller than actual data size. The estimation of string could also be wrong. Using the bytes of parquet file as the metric for broadcasting is bad, we also saw some cases that the parquet file is only a few MB, but the broadcast is a few GB. The estimation could easily be wrong for many reasons, maybe we could switch to ShuffleJoin when it realize that the actual data is larger than thought, will that work? > Spark 2.x query plan data size estimation can crash join queries versus 1.x > --- > > Key: SPARK-18676 > URL: https://issues.apache.org/jira/browse/SPARK-18676 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Michael Allman > > Commit [c481bdf|https://github.com/apache/spark/commit/c481bdf] significantly > modified the way Spark SQL estimates the output data size of query plans. > I've found that—with the new table query partition pruning support in > 2.1—this has lead to in some cases underestimation of join plan child size > statistics to a degree that makes executing such queries impossible without > disabling automatic broadcast conversion. > In one case we debugged, the query planner had estimated the size of a join > child to be 3,854 bytes. In the execution of this child query, Spark reads 20 > million rows in 1 GB of data from parquet files and shuffles 722.9 MB of > data, outputting 17 million rows. In planning the original join query, Spark > converts the child to a {{BroadcastExchange}}. This query execution fails > unless automatic broadcast conversion is disabled. > This particular query is complex and very specific to our data and schema. I > have not yet developed a reproducible test case that can be shared. I realize > this ticket does not give the Spark team a lot to work with to reproduce and > test this issue, but I'm available to help. At the moment I can suggest > running a join where one side is an aggregation selecting a few fields over a > large table with a wide schema including many string columns. > This issue exists in Spark 2.0, but we never encountered it because in that > version it only manifests itself for partitioned relations read from the > filesystem, and we rarely use this feature. We've encountered this issue in > 2.1 because 2.1 does partition pruning for metastore tables now. > As a back stop, we've patched our branch of Spark 2.1 to revert the > reductions in default data type size for string, binary and user-defined > types. We also removed the override of the statistics method in `LogicalPlan` > which reduces the output size of a plan based on the ratio of that plan's > output schema size versus its children's. We have not had this problem since. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18719) Document spark.ui.showConsoleProgress
[ https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18719: --- Assignee: Nicholas > Document spark.ui.showConsoleProgress > - > > Key: SPARK-18719 > URL: https://issues.apache.org/jira/browse/SPARK-18719 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Nicholas Chammas >Assignee: Nicholas >Priority: Minor > Fix For: 2.2.0 > > > There is currently no documentation for {{spark.ui.showConsoleProgress}}. The > only way to find out about it is via Stack Overflow or by searching through > the code. > We should add documentation for this setting to [the config table on our > Configuration > docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18719) Document spark.ui.showConsoleProgress
[ https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-18719. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16151 [https://github.com/apache/spark/pull/16151] > Document spark.ui.showConsoleProgress > - > > Key: SPARK-18719 > URL: https://issues.apache.org/jira/browse/SPARK-18719 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Nicholas Chammas >Priority: Minor > Fix For: 2.2.0 > > > There is currently no documentation for {{spark.ui.showConsoleProgress}}. The > only way to find out about it is via Stack Overflow or by searching through > the code. > We should add documentation for this setting to [the config table on our > Configuration > docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18719) Document spark.ui.showConsoleProgress
[ https://issues.apache.org/jira/browse/SPARK-18719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18719: --- Assignee: Nicholas Chammas (was: Nicholas) > Document spark.ui.showConsoleProgress > - > > Key: SPARK-18719 > URL: https://issues.apache.org/jira/browse/SPARK-18719 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > Fix For: 2.2.0 > > > There is currently no documentation for {{spark.ui.showConsoleProgress}}. The > only way to find out about it is via Stack Overflow or by searching through > the code. > We should add documentation for this setting to [the config table on our > Configuration > docs|https://spark.apache.org/docs/latest/configuration.html#spark-ui]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18188) Add checksum for block of broadcast
[ https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18188: --- Description: There is an understanding issue for a long time: https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the blocks, it's very hard for us to identify where is the bug came from. Shuffle blocks are compressed separate (have checksum in it), but broadcast blocks are compressed together, we should add checksum for each of separately. was: There is an understanding issue for a long time: https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the blocks, it's very hard for us to identify where is the bug came from. We should have a way the check a block from remote node or disk that is correct or not. > Add checksum for block of broadcast > --- > > Key: SPARK-18188 > URL: https://issues.apache.org/jira/browse/SPARK-18188 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > There is an understanding issue for a long time: > https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for > the blocks, it's very hard for us to identify where is the bug came from. > Shuffle blocks are compressed separate (have checksum in it), but broadcast > blocks are compressed together, we should add checksum for each of separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18188) Add checksum for block of broadcast
[ https://issues.apache.org/jira/browse/SPARK-18188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18188: --- Summary: Add checksum for block of broadcast (was: Add checksum for block in Spark) > Add checksum for block of broadcast > --- > > Key: SPARK-18188 > URL: https://issues.apache.org/jira/browse/SPARK-18188 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu >Assignee: Davies Liu > > There is an understanding issue for a long time: > https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for > the blocks, it's very hard for us to identify where is the bug came from. > We should have a way the check a block from remote node or disk that is > correct or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-4105: - Assignee: Davies Liu (was: Josh Rosen) > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0 >Reporter: Josh Rosen >Assignee: Davies Liu >Priority: Critical > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Here's another occurrence of a similar error: > {co
[jira] [Commented] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt
[ https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654590#comment-15654590 ] Davies Liu commented on SPARK-18097: I have no idea why the schema is corrupt, we could catch the exception and try to drop the table blindly. > Can't drop a table from Hive if the schema is corrupt > - > > Key: SPARK-18097 > URL: https://issues.apache.org/jira/browse/SPARK-18097 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > > When the schema of Hive table is broken, we can't drop the table using Spark > SQL, for example > {code} > Error in SQL statement: QueryExecutionException: FAILED: > IllegalArgumentException Error: > expected at the position 10 of > 'ss:string:struct<>' but ':' is found. > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:336) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:480) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) > at > org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754) > at > org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:104) > at > org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339) > at > org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288) > at > org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194) > at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:353) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:351) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:351) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:228) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72) > at > org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:227) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:255) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:126) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:267) > at > org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:753) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationS
[jira] [Resolved] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-18254. Resolution: Fixed Assignee: Eyal Farago (was: Davies Liu) > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Eyal Farago > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634416#comment-15634416 ] Davies Liu commented on SPARK-18254: Could you also try 2.0.2? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634333#comment-15634333 ] Davies Liu commented on SPARK-18254: I tried the following in master (2.1), it works {code} from pyspark.sql.functions import udf, col, struct myadd = udf(lambda s: s.a + s.b, IntegerType()) df = self.spark.range(10).selectExpr("id as a", "id as b")\ .select(struct(col("a"), col("b")).alias('s')) df = df.select(df.s, myadd(df.s).alias("a")) df.explain(True) rs = df.collect() {code} [~nchammas] Could you also try yours on master? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634228#comment-15634228 ] Davies Liu commented on SPARK-18254: I doubt it's a bug in ExtractPythonUDFs, not operator push down, not verified yet. > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-18254: -- Assignee: Davies Liu > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18233) Failed to deserialize the task
Davies Liu created SPARK-18233: -- Summary: Failed to deserialize the task Key: SPARK-18233 URL: https://issues.apache.org/jira/browse/SPARK-18233 Project: Spark Issue Type: Bug Reporter: Davies Liu {code} 16/11/02 18:36:32 ERROR Executor: Exception in task 652.0 in stage 27.0 (TID 21101) java.io.InvalidClassException: org.apache.spark.executor.TaskMet; serializable and externalizable flags conflict at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:698) at java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:831) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1602) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:253) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4549) Support BigInt -> Decimal in convertToCatalyst in SparkSQL
[ https://issues.apache.org/jira/browse/SPARK-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-4549. - Resolution: Incomplete > Support BigInt -> Decimal in convertToCatalyst in SparkSQL > -- > > Key: SPARK-4549 > URL: https://issues.apache.org/jira/browse/SPARK-4549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.3.0 >Reporter: Jianshi Huang >Priority: Minor > > Since BigDecimal is just a wrapper around BigInt, let's also convert to > BigInt to Decimal. > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15626935#comment-15626935 ] Davies Liu commented on SPARK-18212: cc [~zsxwing] > Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from > specific offsets > --- > > Key: SPARK-18212 > URL: https://issues.apache.org/jira/browse/SPARK-18212 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Davies Liu > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets
Davies Liu created SPARK-18212: -- Summary: Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets Key: SPARK-18212 URL: https://issues.apache.org/jira/browse/SPARK-18212 Project: Spark Issue Type: Bug Components: Structured Streaming Reporter: Davies Liu https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18188) Add checksum for block in Spark
Davies Liu created SPARK-18188: -- Summary: Add checksum for block in Spark Key: SPARK-18188 URL: https://issues.apache.org/jira/browse/SPARK-18188 Project: Spark Issue Type: Improvement Reporter: Davies Liu Assignee: Davies Liu There is an understanding issue for a long time: https://issues.apache.org/jira/browse/SPARK-4105, without any checksum for the blocks, it's very hard for us to identify where is the bug came from. We should have a way the check a block from remote node or disk that is correct or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15612686#comment-15612686 ] Davies Liu commented on SPARK-18105: It turned out that the bug in LZ4 is a false alarm, so close the upstream issue. Can't reproduce the behavior now. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18105: --- Priority: Major (was: Blocker) > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16078) from_utc_timestamp/to_utc_timestamp may give different result in different timezone
[ https://issues.apache.org/jira/browse/SPARK-16078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16078: --- Fix Version/s: 1.6.3 > from_utc_timestamp/to_utc_timestamp may give different result in different > timezone > --- > > Key: SPARK-16078 > URL: https://issues.apache.org/jira/browse/SPARK-16078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.6.3, 2.0.0 > > > from_utc_timestamp/to_utc_timestamp should return determistic result in any > timezone (system default). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18105: --- Description: When lz4 is used to compress the shuffle files, it may fail to decompress it as "stream is corrupt" {code} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted at org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) at org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) at java.io.BufferedInputStream.read(BufferedInputStream.java:353) at java.io.DataInputStream.read(DataInputStream.java:149) at com.google.common.io.ByteStreams.read(ByteStreams.java:828) at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} https://github.com/jpountz/lz4-java/issues/89 was: When lz4 is used to compress the shuffle files, it may fail to decompress it as "stream is corrupt" https://github.com/jpountz/lz4-java/issues/89 > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apac
[jira] [Commented] (SPARK-18100) Improve the performance of get_json_object using Gson
[ https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609078#comment-15609078 ] Davies Liu commented on SPARK-18100: [~viirya] Jackson does not support it either > Improve the performance of get_json_object using Gson > - > > Key: SPARK-18100 > URL: https://issues.apache.org/jira/browse/SPARK-18100 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > Based on some benchmark here: > http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/, > which said Gson could be much faster than Jackson, maybe it could be used to > improve the performance of get_json_object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18105: --- Priority: Blocker (was: Major) > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Davies Liu >Assignee: Davies Liu >Priority: Blocker > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
Davies Liu created SPARK-18105: -- Summary: LZ4 failed to decompress a stream of shuffled data Key: SPARK-18105 URL: https://issues.apache.org/jira/browse/SPARK-18105 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu Assignee: Davies Liu When lz4 is used to compress the shuffle files, it may fail to decompress it as "stream is corrupt" https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18102) Failed to deserialize the result of task
Davies Liu created SPARK-18102: -- Summary: Failed to deserialize the result of task Key: SPARK-18102 URL: https://issues.apache.org/jira/browse/SPARK-18102 Project: Spark Issue Type: Bug Reporter: Davies Liu {code} 16/10/25 15:17:04 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. java.lang.ClassNotFoundException: org.apache.spark.util*SerializableBuffer not found in com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader@3d98d138 at com.databricks.backend.daemon.driver.ClassLoaders$MultiReplClassLoader.loadClass(ClassLoaders.scala:115) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:259) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:308) at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:258) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:257) at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:578) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:570) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:180) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:109) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel
[jira] [Updated] (SPARK-18100) Improve the performance of get_json_object using Gson
[ https://issues.apache.org/jira/browse/SPARK-18100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18100: --- Issue Type: Improvement (was: Bug) > Improve the performance of get_json_object using Gson > - > > Key: SPARK-18100 > URL: https://issues.apache.org/jira/browse/SPARK-18100 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > Based on some benchmark here: > http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/, > which said Gson could be much faster than Jackson, maybe it could be used to > improve the performance of get_json_object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18100) Improve the performance of get_json_object using Gson
Davies Liu created SPARK-18100: -- Summary: Improve the performance of get_json_object using Gson Key: SPARK-18100 URL: https://issues.apache.org/jira/browse/SPARK-18100 Project: Spark Issue Type: Bug Reporter: Davies Liu Based on some benchmark here: http://www.doublecloud.org/2015/03/gson-vs-jackson-which-to-use-for-json-in-java/, which said Gson could be much faster than Jackson, maybe it could be used to improve the performance of get_json_object -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt
[ https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18097: --- Description: When the schema of Hive table is broken, we can't drop the table using Spark SQL, for example {code} Error in SQL statement: QueryExecutionException: FAILED: IllegalArgumentException Error: > expected at the position 10 of 'ss:string:struct<>' but ':' is found. at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:336) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:480) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:447) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:481) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:754) at org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.initialize(ParquetHiveSerDe.java:104) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:339) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:194) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:353) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:351) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:280) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:269) at org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:351) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:228) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$tableExists$1.apply(HiveExternalCatalog.scala:228) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:72) at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:227) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:255) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:126) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:267) at org.apache.spark.sql.execution.command.ShowCreateTableCommand.run(tables.scala:753) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65) at org.apache.spark.sql.SparkSession.
[jira] [Updated] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt
[ https://issues.apache.org/jira/browse/SPARK-18097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18097: --- Description: When the schema of Hive table is broken, we can't drop the table using Spark SQL, for example {code} Error in SQL statement: QueryExecutionException: FAILED: IllegalArgumentException Error: > expected at the position 10 of 'ss:string:struct<>' but ':' is found. {code} was: When the schema of Hive table is broken, we can't drop the table using Spark SQL, for example {code} Error in SQL statement: QueryExecutionException: FAILED: IllegalArgumentException Error: > expected at the position 4443 of 'struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_EDDIRECT:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_THIRDPARTY:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DOMINION:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,EBIZAUTOS:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,VAST_HOSTED:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CO:string:struct' but ':' is found. {code} > Can't drop a table from Hive if the schema is corrupt > - > > Key: SPARK-18097 > URL: https://issues.apache.org/jira/browse/SPARK-18097 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > > When the schema of Hive table is broken, we can't drop the table using Spark > SQL, for example > {code} > Error in SQL statement: QueryExecutionException: FAILED: > IllegalArgumentException Error: > expected at the position 10 of > 'ss:string:struct<>' but ':' is found. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18097) Can't drop a table from Hive if the schema is corrupt
Davies Liu created SPARK-18097: -- Summary: Can't drop a table from Hive if the schema is corrupt Key: SPARK-18097 URL: https://issues.apache.org/jira/browse/SPARK-18097 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Davies Liu When the schema of Hive table is broken, we can't drop the table using Spark SQL, for example {code} Error in SQL statement: QueryExecutionException: FAILED: IllegalArgumentException Error: > expected at the position 4443 of 'struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_EDDIRECT:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DMI_THIRDPARTY:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,DOMINION:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,EBIZAUTOS:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CODE:struct,EXT_COLOR_NAME:struct,INT_COLOR_CODE:struct,INT_COLOR_NAME:struct,OEM_CODE:struct,TRIM:struct>>,VAST_HOSTED:struct,similarities:struct,AVG_OPTION_DETAIL:struct,EXT_COLOR_CO:string:struct' but ':' is found. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18055: --- Description: Try to apply flatMap() on Dataset column which of of type com.A.B Here's a schema of a dataset: {code} root |-- id: string (nullable = true) |-- outputs: array (nullable = true) ||-- element: string {code} flatMap works on RDD {code} ds.rdd.flatMap(_.outputs) {code} flatMap doesnt work on dataset and gives the following error {code} ds.flatMap(_.outputs) {code} The exception: {code} scala.ScalaReflectionException: class com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) at line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) at org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) {code} Spoke to Michael Armbrust and he confirmed it as a Dataset bug. There is a workaround using explode() {code} ds.select(explode(col("outputs"))) {code} was: Try to apply flatMap() on Dataset column which of of type com.A.B Here's a schema of a dataset: root |-- id: string (nullable = true) |-- outputs: array (nullable = true) ||-- element: string flatMap works on RDD {code} ds.rdd.flatMap(_.outputs) {code} flatMap doesnt work on dataset and gives the following error {code} ds.flatMap(_.outputs) {code} The exception: {code} scala.ScalaReflectionException: class com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) at line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) at org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) {code} Spoke to Michael Armbrust and he confirmed it as a Dataset bug. There is a workaround using explode() {code} ds.select(explode(col("outputs"))) {code} > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class > com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.e
[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18055: --- Description: Try to apply flatMap() on Dataset column which of of type com.A.B Here's a schema of a dataset: {code} root |-- id: string (nullable = true) |-- outputs: array (nullable = true) ||-- element: string {code} flatMap works on RDD {code} ds.rdd.flatMap(_.outputs) {code} flatMap doesnt work on dataset and gives the following error {code} ds.flatMap(_.outputs) {code} The exception: {code} scala.ScalaReflectionException: class com.A.B in JavaMirror … not found at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) at line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) at org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) {code} Spoke to Michael Armbrust and he confirmed it as a Dataset bug. There is a workaround using explode() {code} ds.select(explode(col("outputs"))) {code} was: Try to apply flatMap() on Dataset column which of of type com.A.B Here's a schema of a dataset: {code} root |-- id: string (nullable = true) |-- outputs: array (nullable = true) ||-- element: string {code} flatMap works on RDD {code} ds.rdd.flatMap(_.outputs) {code} flatMap doesnt work on dataset and gives the following error {code} ds.flatMap(_.outputs) {code} The exception: {code} scala.ScalaReflectionException: class com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) at line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) at org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) {code} Spoke to Michael Armbrust and he confirmed it as a Dataset bug. There is a workaround using explode() {code} ds.select(explode(col("outputs"))) {code} > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at
[jira] [Created] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
Davies Liu created SPARK-18055: -- Summary: Dataset.flatMap can't work with types from customized jar Key: SPARK-18055 URL: https://issues.apache.org/jira/browse/SPARK-18055 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Davies Liu Try to apply flatMap() on Dataset column which of of type com.A.B Here's a schema of a dataset: root |-- id: string (nullable = true) |-- outputs: array (nullable = true) ||-- element: string flatMap works on RDD {code} ds.rdd.flatMap(_.outputs) {code} flatMap doesnt work on dataset and gives the following error {code} ds.flatMap(_.outputs) {code} The exception: {code} scala.ScalaReflectionException: class com.relateiq.company.CompanySourceOutputMessage in JavaMirror … not found at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) at line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) at org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) {code} Spoke to Michael Armbrust and he confirmed it as a Dataset bug. There is a workaround using explode() {code} ds.select(explode(col("outputs"))) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18053) ARRAY equality is broken in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-18053: --- Assignee: Wenchen Fan > ARRAY equality is broken in Spark 2.0 > - > > Key: SPARK-18053 > URL: https://issues.apache.org/jira/browse/SPARK-18053 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Cheng Lian >Assignee: Wenchen Fan > Labels: correctness > > The following Spark shell reproduces this issue: > {code} > case class Test(a: Seq[Int]) > Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t") > sql("SELECT a FROM t WHERE a = array(1)").show() > // +---+ > // | a| > // +---+ > // +---+ > sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show() > // +---+ > // | a| > // +---+ > // |[1]| > // +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18037) Event listener should be aware of multiple tries of same stage
Davies Liu created SPARK-18037: -- Summary: Event listener should be aware of multiple tries of same stage Key: SPARK-18037 URL: https://issues.apache.org/jira/browse/SPARK-18037 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu A stage could be resubmitted before all the task from previous submit had finished, then event listen will mess them up, cause confusing number of active tasks (become negative). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15592514#comment-15592514 ] Davies Liu commented on SPARK-10915: [~jason.white] When a aggregate function is applied, the order of input rows is not defined (even you have a order by before the aggregate). In case that the order matters, you will have to use collect_list and UDF. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18032) Spark test failed as OOM in jenkins
Davies Liu created SPARK-18032: -- Summary: Spark test failed as OOM in jenkins Key: SPARK-18032 URL: https://issues.apache.org/jira/browse/SPARK-18032 Project: Spark Issue Type: Bug Components: Tests Reporter: Davies Liu Assignee: Josh Rosen I saw some tests failed as OOM recently, for example, https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/1998/console#l10n-footer Maybe we should increase the heapsize, since we are continue to add more stuff into Spark/tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18031) Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality
Davies Liu created SPARK-18031: -- Summary: Flaky test: org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite basic functionality Key: SPARK-18031 URL: https://issues.apache.org/jira/browse/SPARK-18031 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.scheduler.ExecutorAllocationManagerSuite&test_name=basic+functionality -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18030) Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite
Davies Liu created SPARK-18030: -- Summary: Flaky test: org.apache.spark.sql.streaming.FileStreamSourceSuite Key: SPARK-18030 URL: https://issues.apache.org/jira/browse/SPARK-18030 Project: Spark Issue Type: Bug Components: Streaming Reporter: Davies Liu Assignee: Tathagata Das https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.streaming.FileStreamSourceSuite&test_name=when+schema+inference+is+turned+on%2C+should+read+partition+data -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17388) Support for inferring type date/timestamp/decimal for partition column
[ https://issues.apache.org/jira/browse/SPARK-17388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17388. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14947 [https://github.com/apache/spark/pull/14947] > Support for inferring type date/timestamp/decimal for partition column > -- > > Key: SPARK-17388 > URL: https://issues.apache.org/jira/browse/SPARK-17388 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, Spark only supports to infer {{IntegerType}}, {{LongType}}, > {{DoubleType}} and {{StringType}}. > {{DecimalType}} is being tried but it seems it never infers type as > {{DecimalType}} as {{DoubleType}} is being tried first. > Also, {{DateType}} and {{TimestampType}} can be inferred. It seems it is a > pretty common to use both for a partition column. > It'd be great if they can be inferred as both rather than just {{StringType}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15583638#comment-15583638 ] Davies Liu commented on SPARK-10915: Currently all the aggregate functions are implemented in Scala, which execute one row at a time. This will not work for Python UDAF, the overhead between JVM and Python process will make it super slow. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python
[ https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582918#comment-15582918 ] Davies Liu commented on SPARK-10915: Python UDF is executed in batch mode to have reasonable performance. UDAF could be much harder to implement in batch mode, especially when it's used together with other aggregate functions. One possible solution could be apply a Python UDF after CollectList, you already could do this as a workaround today. > Add support for UDAFs in Python > --- > > Key: SPARK-10915 > URL: https://issues.apache.org/jira/browse/SPARK-10915 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Justin Uang > > This should support python defined lambdas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17845) Improve window function frame boundary API in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-17845: --- Fix Version/s: 2.1.0 > Improve window function frame boundary API in DataFrame > --- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17845) Improve window function frame boundary API in DataFrame
[ https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17845. Resolution: Fixed > Improve window function frame boundary API in DataFrame > --- > > Key: SPARK-17845 > URL: https://issues.apache.org/jira/browse/SPARK-17845 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > ANSI SQL uses the following to specify the frame boundaries for window > functions: > {code} > ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > {code} > In Spark's DataFrame API, we use integer values to indicate relative position: > - 0 means "CURRENT ROW" > - -1 means "1 PRECEDING" > - Long.MinValue means "UNBOUNDED PRECEDING" > - Long.MaxValue to indicate "UNBOUNDED FOLLOWING" > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetween(Long.MinValue, -3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetween(Long.MinValue, 0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetween(0, Long.MaxValue) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetween(Long.MinValue, Long.MaxValue) > {code} > I think using numeric values to indicate relative positions is actually a > good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate > unbounded ends is pretty confusing: > 1. The API is not self-evident. There is no way for a new user to figure out > how to indicate an unbounded frame by looking at just the API. The user has > to read the doc to figure this out. > 2. It is weird Long.MinValue or Long.MaxValue has some special meaning. > 3. Different languages have different min/max values, e.g. in Python we use > -sys.maxsize and +sys.maxsize. > To make this API less confusing, we have a few options: > Option 1. Add the following (additional) methods: > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsBetween(-3, +3) // this one exists already > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsBetweenUnboundedPrecedingAnd(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsBetweenUnboundedPrecedingAndCurrentRow() > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsBetweenCurrentRowAndUnboundedFollowing() > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing() > {code} > This is obviously very verbose, but is very similar to how these functions > are done in SQL, and is perhaps the most obvious to end users, especially if > they come from SQL background. > Option 2. Decouple the specification for frame begin and frame end into two > functions. Assume the boundary is unlimited unless specified. > {code} > // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING > Window.rowsFrom(-3).rowsTo(3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING > Window.rowsTo(-3) > // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW > Window.rowsToCurrent() or Window.rowsTo(0) > // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING > Window.rowsFromCurrent() or Window.rowsFrom(0) > // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING > // no need to specify > {code} > If we go with option 2, we should throw exceptions if users specify multiple > from's or to's. A variant of option 2 is to require explicitly specification > of begin/end even in the case of unbounded boundary, e.g.: > {code} > Window.rowsFromBeginning().rowsTo(-3) > or > Window.rowsFromUnboundedPreceding().rowsTo(-3) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566858#comment-15566858 ] Davies Liu commented on SPARK-15621: [~rezasafi] We usually do not backport this kind of improvements, it's too large and risky for maintain release (2.0.X), sorry for that. > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Assignee: Davies Liu > Fix For: 2.1.0 > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17856) JVM Crash during tests: pyspark.mllib.linalg.distributed
Davies Liu created SPARK-17856: -- Summary: JVM Crash during tests: pyspark.mllib.linalg.distributed Key: SPARK-17856 URL: https://issues.apache.org/jira/browse/SPARK-17856 Project: Spark Issue Type: Bug Reporter: Davies Liu https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66531/consoleFull -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17806) Incorrect result when work with data from parquet
[ https://issues.apache.org/jira/browse/SPARK-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17806. Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 Issue resolved by pull request 15390 [https://github.com/apache/spark/pull/15390] > Incorrect result when work with data from parquet > - > > Key: SPARK-17806 > URL: https://issues.apache.org/jira/browse/SPARK-17806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov >Assignee: Davies Liu >Priority: Blocker > Labels: correctness > Fix For: 2.0.2, 2.1.0 > > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.{StructField, StructType} > import org.apache.spark.sql.types.DataTypes._ > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq( > """{"a":1,"b":1,"c":1}""", > """{"a":1,"b":1,"c":2}""" > )) > sc.read.schema(StructType(Seq( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("c", LongType) > ))).json(jsonRDD).write.parquet("/tmp/test") > val df = sc.read.load("/tmp/test") > df.join(df, Seq("a", "b", "c"), "left_outer").show() > {code} > returns: > {code} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 1| 1| > | 1| 1| 1| > | 1| 1| 2| > | 1| 1| 2| > +---+---+---+ > {code} > Expected result: > {code} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 1| 1| > | 1| 1| 2| > +---+---+---+ > {code} > If I use this code without saving to parquet it works fine. If you change > type of `c` column to `IntegerType` it also works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
[ https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1759#comment-1759 ] Davies Liu commented on SPARK-17738: I will looking into that. > Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP > append/extract > -- > > Key: SPARK-17738 > URL: https://issues.apache.org/jira/browse/SPARK-17738 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17806) Incorrect result when work with data from parquet
[ https://issues.apache.org/jira/browse/SPARK-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-17806: -- Assignee: Davies Liu > Incorrect result when work with data from parquet > - > > Key: SPARK-17806 > URL: https://issues.apache.org/jira/browse/SPARK-17806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov >Assignee: Davies Liu >Priority: Blocker > Labels: correctness > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.{StructField, StructType} > import org.apache.spark.sql.types.DataTypes._ > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq( > """{"a":1,"b":1,"c":1}""", > """{"a":1,"b":1,"c":2}""" > )) > sc.read.schema(StructType(Seq( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("c", LongType) > ))).json(jsonRDD).write.parquet("/tmp/test") > val df = sc.read.load("/tmp/test") > df.join(df, Seq("a", "b", "c"), "left_outer").show() > {code} > returns: > {code} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 1| 1| > | 1| 1| 1| > | 1| 1| 2| > | 1| 1| 2| > +---+---+---+ > {code} > Expected result: > {code} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 1| 1| > | 1| 1| 2| > +---+---+---+ > {code} > If I use this code without saving to parquet it works fine. If you change > type of `c` column to `IntegerType` it also works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17806) Incorrect result when work with data from parquet
[ https://issues.apache.org/jira/browse/SPARK-17806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-17806: --- Priority: Blocker (was: Critical) > Incorrect result when work with data from parquet > - > > Key: SPARK-17806 > URL: https://issues.apache.org/jira/browse/SPARK-17806 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Vitaly Gerasimov >Priority: Blocker > Labels: correctness > > {code} > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > import org.apache.spark.sql.types.{StructField, StructType} > import org.apache.spark.sql.types.DataTypes._ > val sc = SparkSession.builder().config(new > SparkConf().setMaster("local")).getOrCreate() > val jsonRDD = sc.sparkContext.parallelize(Seq( > """{"a":1,"b":1,"c":1}""", > """{"a":1,"b":1,"c":2}""" > )) > sc.read.schema(StructType(Seq( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("c", LongType) > ))).json(jsonRDD).write.parquet("/tmp/test") > val df = sc.read.load("/tmp/test") > df.join(df, Seq("a", "b", "c"), "left_outer").show() > {code} > returns: > {code} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 1| 1| > | 1| 1| 1| > | 1| 1| 2| > | 1| 1| 2| > +---+---+---+ > {code} > Expected result: > {code} > +---+---+---+ > | a| b| c| > +---+---+---+ > | 1| 1| 1| > | 1| 1| 2| > +---+---+---+ > {code} > If I use this code without saving to parquet it works fine. If you change > type of `c` column to `IntegerType` it also works fine. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16922) Query with Broadcast Hash join fails due to executor OOM in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15549494#comment-15549494 ] Davies Liu commented on SPARK-16922: Thanks for the feedback, that's reasonable. > Query with Broadcast Hash join fails due to executor OOM in Spark 2.0 > - > > Key: SPARK-16922 > URL: https://issues.apache.org/jira/browse/SPARK-16922 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Davies Liu > Fix For: 2.0.1, 2.1.0 > > > A query which used to work in Spark 1.6 fails with executor OOM in 2.0. > Stack trace - > {code} > at > org.apache.spark.unsafe.types.UTF8String.getBytes(UTF8String.java:229) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.hash$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$agg_VectorizedHashMap.findOrInsert(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Query plan in Spark 1.6 > {code} > == Physical Plan == > TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Final,isDistinct=false)], output=[field1#101,field3#3]) > +- TungstenExchange hashpartitioning(field1#101,200), None >+- TungstenAggregate(key=[field1#101], functions=[(sum((field2#74 / > 100.0)),mode=Partial,isDistinct=false)], output=[field1#101,sum#111]) > +- Project [field1#101,field2#74] > +- BroadcastHashJoin [field5#63L], [cast(cast(field4#97 as > decimal(20,0)) as bigint)], BuildRight > :- ConvertToUnsafe > : +- HiveTableScan [field2#74,field5#63L], MetastoreRelation > foo, table1, Some(a), [(ds#57 >= 2013-10-01),(ds#57 <= 2013-12-31)] > +- ConvertToUnsafe >+- HiveTableScan [field1#101,field4#97], MetastoreRelation > foo, table2, Some(b) > {code} > Query plan in 2.0 > {code} > == Physical Plan == > *HashAggregate(keys=[field1#160], functions=[sum((field2#133 / 100.0))]) > +- Exchange hashpartitioning(field1#160, 200) >+- *HashAggregate(keys=[field1#160], functions=[partial_sum((field2#133 / > 100.0))]) > +- *Project [field2#133, field1#160] > +- *BroadcastHashJoin [field5#122L], [cast(cast(field4#156 as > decimal(20,0)) as bigint)], Inner, BuildRight > :- *Filter isnotnull(field5#122L) > : +- HiveTableScan [field5#122L, field2#133], MetastoreRelation > foo, table1, a, [isnotnull(ds#116), (ds#116 >= 2013-10-01), (ds#116 <= > 2013-12-31)] > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(cast(input[0, string, false] as > decimal(20,0)) as bigint))) >+- *Filter isnotnull(field4#156) > +- HiveTableScan [field4#156, field1#160], > MetastoreRelation foo, table2, b > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15390) Memory management issue in complex DataFrame join and filter
[ https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-15390. Resolution: Fixed > Memory management issue in complex DataFrame join and filter > > > Key: SPARK-15390 > URL: https://issues.apache.org/jira/browse/SPARK-15390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: branch-2.0, 16 workers >Reporter: Joseph K. Bradley >Assignee: Davies Liu > Fix For: 2.0.1 > > > See [SPARK-15389] for a description of the code which produces this bug. I > am filing this as a separate JIRA since the bug in 2.0 is different. > In 2.0, the code fails with some memory management error. Here is the > stacktrace: > {code} > OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support > was removed in 8.0 > 16/05/18 19:23:16 ERROR Uncaught throwable from user code: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange SinglePartition, None > +- WholeStageCodegen >: +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L]) >: +- Project >:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None >: :- INPUT >: +- Project [id#110L] >: +- Filter (degree#115 > 200) >: +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Final,isDistinct=false)], > output=[id#110L,degree#115]) >:+- INPUT >:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) >: +- WholeStageCodegen >: : +- Project [row#66.id AS id#70L] >: : +- Filter isnotnull(row#66.id) >: :+- INPUT >: +- Scan ExistingRDD[row#66,uniq_id#67] >+- Exchange hashpartitioning(id#110L, 200), None > +- WholeStageCodegen > : +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[id#110L,count#136L]) > : +- Filter isnotnull(id#110L) > :+- INPUT > +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L] > +- WholeStageCodegen >: +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT > (src#2L = dst#3L)) >: +- INPUT >+- InMemoryTableScan [src#2L,dst#3L], > [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation > [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, > offheap=false, deserialized=true, replication=1), WholeStageCodegen, None > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNe
[jira] [Updated] (SPARK-15390) Memory management issue in complex DataFrame join and filter
[ https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15390: --- Fix Version/s: 2.0.1 > Memory management issue in complex DataFrame join and filter > > > Key: SPARK-15390 > URL: https://issues.apache.org/jira/browse/SPARK-15390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: branch-2.0, 16 workers >Reporter: Joseph K. Bradley >Assignee: Davies Liu > Fix For: 2.0.1 > > > See [SPARK-15389] for a description of the code which produces this bug. I > am filing this as a separate JIRA since the bug in 2.0 is different. > In 2.0, the code fails with some memory management error. Here is the > stacktrace: > {code} > OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support > was removed in 8.0 > 16/05/18 19:23:16 ERROR Uncaught throwable from user code: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange SinglePartition, None > +- WholeStageCodegen >: +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L]) >: +- Project >:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None >: :- INPUT >: +- Project [id#110L] >: +- Filter (degree#115 > 200) >: +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Final,isDistinct=false)], > output=[id#110L,degree#115]) >:+- INPUT >:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) >: +- WholeStageCodegen >: : +- Project [row#66.id AS id#70L] >: : +- Filter isnotnull(row#66.id) >: :+- INPUT >: +- Scan ExistingRDD[row#66,uniq_id#67] >+- Exchange hashpartitioning(id#110L, 200), None > +- WholeStageCodegen > : +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[id#110L,count#136L]) > : +- Filter isnotnull(id#110L) > :+- INPUT > +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L] > +- WholeStageCodegen >: +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT > (src#2L = dst#3L)) >: +- INPUT >+- InMemoryTableScan [src#2L,dst#3L], > [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation > [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, > offheap=false, deserialized=true, replication=1), WholeStageCodegen, None > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withN
[jira] [Comment Edited] (SPARK-15390) Memory management issue in complex DataFrame join and filter
[ https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546675#comment-15546675 ] Davies Liu edited comment on SPARK-15390 at 10/4/16 9:11 PM: - @lulian Dragos I think this is a different issue, fixed by https://github.com/apache/spark/pull/14373 in 2.0.1. was (Author: davies): @lulian Dragos I think this is a different issue, fixed by https://github.com/apache/spark/pull/14373 and https://github.com/apache/spark/pull/14464/files in 2.0.1. > Memory management issue in complex DataFrame join and filter > > > Key: SPARK-15390 > URL: https://issues.apache.org/jira/browse/SPARK-15390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: branch-2.0, 16 workers >Reporter: Joseph K. Bradley >Assignee: Davies Liu > Fix For: 2.0.1 > > > See [SPARK-15389] for a description of the code which produces this bug. I > am filing this as a separate JIRA since the bug in 2.0 is different. > In 2.0, the code fails with some memory management error. Here is the > stacktrace: > {code} > OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support > was removed in 8.0 > 16/05/18 19:23:16 ERROR Uncaught throwable from user code: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange SinglePartition, None > +- WholeStageCodegen >: +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L]) >: +- Project >:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None >: :- INPUT >: +- Project [id#110L] >: +- Filter (degree#115 > 200) >: +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Final,isDistinct=false)], > output=[id#110L,degree#115]) >:+- INPUT >:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) >: +- WholeStageCodegen >: : +- Project [row#66.id AS id#70L] >: : +- Filter isnotnull(row#66.id) >: :+- INPUT >: +- Scan ExistingRDD[row#66,uniq_id#67] >+- Exchange hashpartitioning(id#110L, 200), None > +- WholeStageCodegen > : +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[id#110L,count#136L]) > : +- Filter isnotnull(id#110L) > :+- INPUT > +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L] > +- WholeStageCodegen >: +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT > (src#2L = dst#3L)) >: +- INPUT >+- InMemoryTableScan [src#2L,dst#3L], > [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation > [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, > offheap=false, deserialized=true, replication=1), WholeStageCodegen, None > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByte
[jira] [Updated] (SPARK-15390) Memory management issue in complex DataFrame join and filter
[ https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15390: --- Fix Version/s: (was: 2.0.0) > Memory management issue in complex DataFrame join and filter > > > Key: SPARK-15390 > URL: https://issues.apache.org/jira/browse/SPARK-15390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: branch-2.0, 16 workers >Reporter: Joseph K. Bradley >Assignee: Davies Liu > Fix For: 2.0.1 > > > See [SPARK-15389] for a description of the code which produces this bug. I > am filing this as a separate JIRA since the bug in 2.0 is different. > In 2.0, the code fails with some memory management error. Here is the > stacktrace: > {code} > OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support > was removed in 8.0 > 16/05/18 19:23:16 ERROR Uncaught throwable from user code: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange SinglePartition, None > +- WholeStageCodegen >: +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L]) >: +- Project >:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None >: :- INPUT >: +- Project [id#110L] >: +- Filter (degree#115 > 200) >: +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Final,isDistinct=false)], > output=[id#110L,degree#115]) >:+- INPUT >:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) >: +- WholeStageCodegen >: : +- Project [row#66.id AS id#70L] >: : +- Filter isnotnull(row#66.id) >: :+- INPUT >: +- Scan ExistingRDD[row#66,uniq_id#67] >+- Exchange hashpartitioning(id#110L, 200), None > +- WholeStageCodegen > : +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[id#110L,count#136L]) > : +- Filter isnotnull(id#110L) > :+- INPUT > +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L] > +- WholeStageCodegen >: +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT > (src#2L = dst#3L)) >: +- INPUT >+- InMemoryTableScan [src#2L,dst#3L], > [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation > [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, > offheap=false, deserialized=true, replication=1), WholeStageCodegen, None > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Da
[jira] [Commented] (SPARK-15390) Memory management issue in complex DataFrame join and filter
[ https://issues.apache.org/jira/browse/SPARK-15390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15546675#comment-15546675 ] Davies Liu commented on SPARK-15390: @lulian Dragos I think this is a different issue, fixed by https://github.com/apache/spark/pull/14373 and https://github.com/apache/spark/pull/14464/files in 2.0.1. > Memory management issue in complex DataFrame join and filter > > > Key: SPARK-15390 > URL: https://issues.apache.org/jira/browse/SPARK-15390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: branch-2.0, 16 workers >Reporter: Joseph K. Bradley >Assignee: Davies Liu > Fix For: 2.0.0 > > > See [SPARK-15389] for a description of the code which produces this bug. I > am filing this as a separate JIRA since the bug in 2.0 is different. > In 2.0, the code fails with some memory management error. Here is the > stacktrace: > {code} > OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support > was removed in 8.0 > 16/05/18 19:23:16 ERROR Uncaught throwable from user code: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: > Exchange SinglePartition, None > +- WholeStageCodegen >: +- TungstenAggregate(key=[], > functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#170L]) >: +- Project >:+- BroadcastHashJoin [id#70L], [id#110L], Inner, BuildLeft, None >: :- INPUT >: +- Project [id#110L] >: +- Filter (degree#115 > 200) >: +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Final,isDistinct=false)], > output=[id#110L,degree#115]) >:+- INPUT >:- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint])) >: +- WholeStageCodegen >: : +- Project [row#66.id AS id#70L] >: : +- Filter isnotnull(row#66.id) >: :+- INPUT >: +- Scan ExistingRDD[row#66,uniq_id#67] >+- Exchange hashpartitioning(id#110L, 200), None > +- WholeStageCodegen > : +- TungstenAggregate(key=[id#110L], > functions=[(count(1),mode=Partial,isDistinct=false)], > output=[id#110L,count#136L]) > : +- Filter isnotnull(id#110L) > :+- INPUT > +- Generate explode(array(src#2L, dst#3L)), false, false, [id#110L] > +- WholeStageCodegen >: +- Filter ((isnotnull(src#2L) && isnotnull(dst#3L)) && NOT > (src#2L = dst#3L)) >: +- INPUT >+- InMemoryTableScan [src#2L,dst#3L], > [isnotnull(src#2L),isnotnull(dst#3L),NOT (src#2L = dst#3L)], InMemoryRelation > [src#2L,dst#3L], true, 1, StorageLevel(disk=true, memory=true, > offheap=false, deserialized=true, replication=1), WholeStageCodegen, None > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50) > at > org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:113) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:233) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate.inputRDDs(TungstenAggregate.scala:134) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:348) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:287) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spa
[jira] [Resolved] (SPARK-17679) Remove unnecessary Py4J ListConverter patch
[ https://issues.apache.org/jira/browse/SPARK-17679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17679. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15254 [https://github.com/apache/spark/pull/15254] > Remove unnecessary Py4J ListConverter patch > --- > > Key: SPARK-17679 > URL: https://issues.apache.org/jira/browse/SPARK-17679 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Jason White >Priority: Minor > Fix For: 2.1.0 > > > In SPARK-6949 davies documented a couple of bugs with Py4J that prevented > Spark from registering a converter for date and datetime objects. Patched in > https://github.com/apache/spark/pull/5570. > Specifically https://github.com/bartdag/py4j/issues/160 dealt with > ListConverter automatically converting bytearrays into ArrayList instead of > leaving it alone. > Py4J #160 has since been fixed in Py4J, since the 0.9 release a couple of > months after Spark #5570. According to spark-core's pom.xml, we're using > 0.10.3. > We should remove this patch on ListConverter since the upstream package no > longer has this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
[ https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-17738: --- Fix Version/s: (was: 2.2.0) 2.1.0 > Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP > append/extract > -- > > Key: SPARK-17738 > URL: https://issues.apache.org/jira/browse/SPARK-17738 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
[ https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17738. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15305 [https://github.com/apache/spark/pull/15305] > Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP > append/extract > -- > > Key: SPARK-17738 > URL: https://issues.apache.org/jira/browse/SPARK-17738 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.2.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract
Davies Liu created SPARK-17738: -- Summary: Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract Key: SPARK-17738 URL: https://issues.apache.org/jira/browse/SPARK-17738 Project: Spark Issue Type: Bug Reporter: Davies Liu Assignee: Davies Liu https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17494) Floor/ceil of decimal returns wrong result if it's in compact format
[ https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-17494: --- Summary: Floor/ceil of decimal returns wrong result if it's in compact format (was: Floor function rounds up during join) > Floor/ceil of decimal returns wrong result if it's in compact format > > > Key: SPARK-17494 > URL: https://issues.apache.org/jira/browse/SPARK-17494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Gokhan Civan >Assignee: Davies Liu > Labels: correctness > > If you create tables as follows: > create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num; > create table b as select 'A' as str; > Then > select floor(num) from a; > returns 10 > but > select floor(num) from a join b on a.str = b.str; > returns 11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-17100: --- Fix Version/s: (was: 2.2.0) 2.1.0 > pyspark filter on a udf column after join gives > java.lang.UnsupportedOperationException > --- > > Key: SPARK-17100 > URL: https://issues.apache.org/jira/browse/SPARK-17100 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3. >Reporter: Tim Sell >Assignee: Davies Liu > Fix For: 2.0.1, 2.1.0 > > Attachments: bug.py, test_bug.py > > > In pyspark, when filtering on a udf derived column after some join types, > the optimized logical plan results is a > java.lang.UnsupportedOperationException. > I could not replicate this in scala code from the shell, just python. It is a > pyspark regression from spark 1.6.2. > This can be replicated with: bin/spark-submit bug.py > {code:python:title=bug.py} > import pyspark.sql.functions as F > from pyspark.sql import Row, SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName("test").getOrCreate() > left = spark.createDataFrame([Row(a=1)]) > right = spark.createDataFrame([Row(a=1)]) > df = left.join(right, on='a', how='left_outer') > df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) > df = df.filter('b = "x"') > df.explain(extended=True) > {code} > The output is: > {code} > == Parsed Logical Plan == > 'Filter ('b = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Analyzed Logical Plan == > a: bigint, b: string > Filter (b#8 = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Optimized Logical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > == Physical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > {code} > It fails when the join is: > * how='outer', on=column expression > * how='left_outer', on=string or column expression > * how='right_outer', on=string or column expression > It passes when the join is: > * how='inner', on=string or column expression > * how='outer', on=string > I made some tests to demonstrate each of these. > Run with bin/spark-submit test_bug.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17100. Resolution: Fixed Fix Version/s: 2.2.0 2.0.1 Issue resolved by pull request 15103 [https://github.com/apache/spark/pull/15103] > pyspark filter on a udf column after join gives > java.lang.UnsupportedOperationException > --- > > Key: SPARK-17100 > URL: https://issues.apache.org/jira/browse/SPARK-17100 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3. >Reporter: Tim Sell >Assignee: Davies Liu > Fix For: 2.0.1, 2.2.0 > > Attachments: bug.py, test_bug.py > > > In pyspark, when filtering on a udf derived column after some join types, > the optimized logical plan results is a > java.lang.UnsupportedOperationException. > I could not replicate this in scala code from the shell, just python. It is a > pyspark regression from spark 1.6.2. > This can be replicated with: bin/spark-submit bug.py > {code:python:title=bug.py} > import pyspark.sql.functions as F > from pyspark.sql import Row, SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName("test").getOrCreate() > left = spark.createDataFrame([Row(a=1)]) > right = spark.createDataFrame([Row(a=1)]) > df = left.join(right, on='a', how='left_outer') > df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) > df = df.filter('b = "x"') > df.explain(extended=True) > {code} > The output is: > {code} > == Parsed Logical Plan == > 'Filter ('b = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Analyzed Logical Plan == > a: bigint, b: string > Filter (b#8 = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Optimized Logical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > == Physical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > {code} > It fails when the join is: > * how='outer', on=column expression > * how='left_outer', on=string or column expression > * how='right_outer', on=string or column expression > It passes when the join is: > * how='inner', on=string or column expression > * how='outer', on=string > I made some tests to demonstrate each of these. > Run with bin/spark-submit test_bug.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16439: --- Fix Version/s: (was: 2.2.0) 2.1.0 > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Maciej Bryński >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17494) Floor function rounds up during join
[ https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-17494: -- Assignee: Davies Liu > Floor function rounds up during join > > > Key: SPARK-17494 > URL: https://issues.apache.org/jira/browse/SPARK-17494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Gokhan Civan >Assignee: Davies Liu > Labels: correctness > > If you create tables as follows: > create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num; > create table b as select 'A' as str; > Then > select floor(num) from a; > returns 10 > but > select floor(num) from a join b on a.str = b.str; > returns 11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16439: --- Assignee: Davies Liu (was: Maciej Bryński) > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Davies Liu >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-16439. Resolution: Fixed Fix Version/s: (was: 2.0.0) 2.2.0 2.0.1 Issue resolved by pull request 15106 [https://github.com/apache/spark/pull/15106] > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Maciej Bryński >Priority: Minor > Fix For: 2.0.1, 2.2.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-16439: We could bring the seperator back for better readability. > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Maciej Bryński >Priority: Minor > Fix For: 2.0.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491744#comment-15491744 ] Davies Liu commented on SPARK-16439: The separator was added on purpose, otherwise it's very difficult to read the numbers (especially the number of rows), we need to count the number of digits to realize how large the number is. I think we should still keep that and fix the local issue (using English should be enough), I will send a PR to add it back. > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Assignee: Maciej Bryński >Priority: Minor > Fix For: 2.0.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org