[
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664035#comment-15664035
]
Babak Alipour commented on SPARK-17788:
---------------------------------------
The details were in the email thread.
Here's the full stack trace:
Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more
than 17179869176 bytes
at
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
at
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
at
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
at
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
> RangePartitioner results in few very large tasks and many small to empty
> tasks
> -------------------------------------------------------------------------------
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
> Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner
> trying to create equal ranges. [1]
> [1]
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>
> The Double values I'm trying to sort are mostly in the range [0,1] (~70% of
> the data which roughly equates 1 billion records), other numbers in the
> dataset are as high as 2000. With the RangePartitioner trying to create equal
> ranges, some tasks are becoming almost empty while others are extremely
> large, due to the heavily skewed distribution.
> This is either a bug in Apache Spark or a major limitation of the framework.
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]