[
https://issues.apache.org/jira/browse/SPARK-19037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15787733#comment-15787733
]
J.P Feng commented on SPARK-19037:
----------------------------------
errors logs when doing dropDuplicates with sub-query in spark-shell:
scala> spark.sql("select * from mytest limit 10").dropDuplicates("name").show
120.073: [GC [PSYoungGen: 233234K->12801K(282112K)] 378713K->165495K(624128K),
1.8045200 secs] [Times: user=6.52 sys=7.43, real=1.80 secs]
[Stage 0:> (0 + 8) /
16]124.182: [GC [PSYoungGen: 227841K->45026K(279552K)]
380535K->202214K(621568K), 0.9970190 secs] [Times: user=2.87 sys=4.96,
real=1.00 secs]
[Stage 0:> (0 + 16) /
16]16/12/30 21:58:21 ERROR (Executor): Exception in task 0.0 in stage 1.0 (TID
16)
java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/12/30 21:58:21 WARN (TaskSetManager): Lost task 0.0 in stage 1.0 (TID 16,
localhost, executor driver): java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
> Run count(distinct x) from sub query found some errors
> ------------------------------------------------------
>
> Key: SPARK-19037
> URL: https://issues.apache.org/jira/browse/SPARK-19037
> Project: Spark
> Issue Type: Bug
> Components: Spark Shell, SQL
> Affects Versions: 2.1.0
> Environment: spark 2.1.0, scala 2.11
> Reporter: J.P Feng
> Labels: distinct, sparkSQL, sub-query
>
> when i use spark-shell or spark-sql to execute count(distinct name) from
> subquery, some errors occur:
> select count(distinct name) from (select * from mytest limit 10) as a
> if i do this in hive-server2, i can get the correct result.
> if i just execute select count(name) from (select * from mytest limit 10) as
> a, i can also get the right result.
> besides, i found the same errors when i use distinct(),groupby() with
> subquery.
> I think there maybe some bugs when doing key-reduce jobs with subquery.
> I will add the errors in new comment.
> besides, i test dropDuplicates in spark-shell:
> 1. spark.sql("select * from mytest limit 10").dropDuplicates("name").show
> it will throw some exceptions
> 2. spark.table("mytest").dropDuplicates("name").show
> it will return the right result
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]