[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

Darren Govoni (JIRA) Wed, 06 Dec 2017 06:58:21 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280271#comment-16280271
 ]


Darren Govoni edited comment on SPARK-17788 at 12/6/17 2:57 PM:
----------------------------------------------------------------

I'm also running into this error on spark 2.1.0
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in 
stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 
(TID 7544,xxx.xxx.xxx.xxx.xx, executor 2): java.lang.IllegalArgumentException: 
Cannot allocate a page with more than 17179869176 bytes



was (Author: sesshomurai):
I'm also running into this error on spark 2.1.0
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in 
stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 
(TID 7544, bdr-itwp-hdfs-2.dev.uspto.gov, executor 2): 
java.lang.IllegalArgumentException: Cannot allocate a page with more than 
17179869176 bytes


> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-17788
>                 URL: https://issues.apache.org/jira/browse/SPARK-17788
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 2.0.0
>         Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>            Reporter: Babak Alipour
>            Assignee: Wenchen Fan
>             Fix For: 2.3.0
>
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

Reply via email to