[
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280271#comment-16280271
]
Darren Govoni edited comment on SPARK-17788 at 12/6/17 2:57 PM:
----------------------------------------------------------------
I'm also running into this error on spark 2.1.0
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in
stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0
(TID 7544,xxx.xxx.xxx.xxx.xx, executor 2): java.lang.IllegalArgumentException:
Cannot allocate a page with more than 17179869176 bytes
was (Author: sesshomurai):
I'm also running into this error on spark 2.1.0
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in
stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0
(TID 7544, bdr-itwp-hdfs-2.dev.uspto.gov, executor 2):
java.lang.IllegalArgumentException: Cannot allocate a page with more than
17179869176 bytes
> RangePartitioner results in few very large tasks and many small to empty
> tasks
> -------------------------------------------------------------------------------
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
> Reporter: Babak Alipour
> Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner
> trying to create equal ranges. [1]
> [1]
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>
> The Double values I'm trying to sort are mostly in the range [0,1] (~70% of
> the data which roughly equates 1 billion records), other numbers in the
> dataset are as high as 2000. With the RangePartitioner trying to create equal
> ranges, some tasks are becoming almost empty while others are extremely
> large, due to the heavily skewed distribution.
> This is either a bug in Apache Spark or a major limitation of the framework.
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]