[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957540#comment-14957540
 ] 

Glenn Strycker commented on SPARK-6235:
---------------------------------------

Until this issue and sub-issue tickets are solved, are there any known 
work-arounds?  Increase number of partitions, or decrease?  Split up RDDs into 
parts, run your command, and then union?  Turn off Kryo?  Use dataframes?  
Help!!

I am encountering the 2GB bug on attempting to simply (re)partition by key an 
RDD of modest size (84GB) and low skew (AFAIK).  I have my memory requests per 
executor, per master node, per Java, etc. all cranked up as far as they'll go, 
and I'm currently attempting to partition this RDD across 6800 partitions.  
Unless my skew is really bad, I don't see why 12MB per partition would be 
causing a shuffle to hit the 2GB limit, unless the overhead of so many 
partitions is actually hurting rather than helping.  I'm going to try adjusting 
my partition number and see what happens, but I wanted to know if there is a 
standard work-around answer to this 2GB issue.

> Address various 2G limits
> -------------------------
>
>                 Key: SPARK-6235
>                 URL: https://issues.apache.org/jira/browse/SPARK-6235
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Shuffle, Spark Core
>            Reporter: Reynold Xin
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to