[
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15699276#comment-15699276
]
Wenchen Fan commented on SPARK-17788:
-------------------------------------
After looking at the code, it seems the only way to trigger this exception is
setting `spark.buffer.pageSize` to a value larger than `((1L << 31) - 1) * 8L`,
[[email protected]] did you set this conf?
> RangePartitioner results in few very large tasks and many small to empty
> tasks
> -------------------------------------------------------------------------------
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
> Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner
> trying to create equal ranges. [1]
> [1]
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>
> The Double values I'm trying to sort are mostly in the range [0,1] (~70% of
> the data which roughly equates 1 billion records), other numbers in the
> dataset are as high as 2000. With the RangePartitioner trying to create equal
> ranges, some tasks are becoming almost empty while others are extremely
> large, due to the heavily skewed distribution.
> This is either a bug in Apache Spark or a major limitation of the framework.
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]