[
https://issues.apache.org/jira/browse/SPARK-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sameer Agarwal updated SPARK-6009:
----------------------------------
Fix Version/s: 1.5.0
> IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND ()
> --------------------------------------------------------------------
>
> Key: SPARK-6009
> URL: https://issues.apache.org/jira/browse/SPARK-6009
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.0
> Environment: Centos 7, Hadoop 2.6.0, Hive 0.15.0
> java version "1.7.0_75"
> OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
> OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
> Reporter: Paul Barber
> Fix For: 1.5.0
>
>
> Running the following SparkSQL query over JDBC:
> {noformat}
> SELECT *
> FROM FAA
> WHERE Year >= 1998 AND Year <= 1999
> ORDER BY RAND () LIMIT 100000
> {noformat}
> This results in one or more workers throwing the following exception, with
> variations for {{mergeLo}} and {{mergeHi}}.
> {noformat}
> :java.lang.IllegalArgumentException: Comparison method violates its
> general contract!
> - at java.util.TimSort.mergeHi(TimSort.java:868)
> - at java.util.TimSort.mergeAt(TimSort.java:485)
> - at java.util.TimSort.mergeCollapse(TimSort.java:410)
> - at java.util.TimSort.sort(TimSort.java:214)
> - at java.util.Arrays.sort(Arrays.java:727)
> - at
> org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708)
> - at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
> - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138)
> - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135)
> - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> - at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> - at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> - at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> - at org.apache.spark.scheduler.Task.run(Task.scala:56)
> - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> - at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> - at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> - at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same
> error in both. The query sometimes succeeds, but fails more often than not.
> Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the
> same.
> The {{ORDER BY RAND ()}} is using TimSort to produce the random ordering by
> sorting a list of random values. Having spent some time looking at the issue
> with jdb, it appears that the problem is triggered by the random values being
> changed during the sort - the code which triggers this is in
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala}}
> - class RowOrdering, function compare, line 250 - where a new random number
> is taken for the same row.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]