[ https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280271#comment-16280271 ]
Darren Govoni edited comment on SPARK-17788 at 12/6/17 2:57 PM: ---------------------------------------------------------------- I'm also running into this error on spark 2.1.0 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 (TID 7544,xxx.xxx.xxx.xxx.xx, executor 2): java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes was (Author: sesshomurai): I'm also running into this error on spark 2.1.0 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in stage 11.0 failed 4 times, most recent failure: Lost task 42.3 in stage 11.0 (TID 7544, bdr-itwp-hdfs-2.dev.uspto.gov, executor 2): java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes > RangePartitioner results in few very large tasks and many small to empty > tasks > ------------------------------------------------------------------------------- > > Key: SPARK-17788 > URL: https://issues.apache.org/jira/browse/SPARK-17788 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 2.0.0 > Environment: Ubuntu 14.04 64bit > Java 1.8.0_101 > Reporter: Babak Alipour > Assignee: Wenchen Fan > Fix For: 2.3.0 > > > Greetings everyone, > I was trying to read a single field of a Hive table stored as Parquet in > Spark (~140GB for the entire table, this single field is a Double, ~1.4B > records) and look at the sorted output using the following: > sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") > ​But this simple line of code gives: > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > Same error for: > sql("SELECT " + field + " FROM MY_TABLE).sort(field) > and: > sql("SELECT " + field + " FROM MY_TABLE).orderBy(field) > After doing some searching, the issue seems to lie in the RangePartitioner > trying to create equal ranges. [1] > [1] > https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html > > The Double values I'm trying to sort are mostly in the range [0,1] (~70% of > the data which roughly equates 1 billion records), other numbers in the > dataset are as high as 2000. With the RangePartitioner trying to create equal > ranges, some tasks are becoming almost empty while others are extremely > large, due to the heavily skewed distribution. > This is either a bug in Apache Spark or a major limitation of the framework. > I hope one of the devs can help solve this issue. > P.S. Email thread on Spark user mailing list: > http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org