[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

Babak Alipour (JIRA) Mon, 14 Nov 2016 06:16:45 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664035#comment-15664035
 ]


Babak Alipour commented on SPARK-17788:
---------------------------------------

The details were in the email thread.
Here's the full stack trace: 

Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more 
than 17179869176 bytes
        at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
        at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
        at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)


> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-17788
>                 URL: https://issues.apache.org/jira/browse/SPARK-17788
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>            Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

Reply via email to