Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/377#issuecomment-43033970
  
    Hey I spent a lot of time trying to recreate this issue last week to better 
understand what is causing it. I can do a write-up on the JIRA, but the 
long-and-short of it is that this error is caused primarily by the issue 
described in https://issues.apache.org/jira/browse/SPARK-1777. This is a bigger 
problem and it's not fixed in the general case by the fix here.
    
    The code for Spark itself (i.e. the permgen) is totally outside of the heap 
space, so the premise here that the code for Spark is taking up heap is not 
correct. There are extra on-heap datastructures for e.g. ZipFile entries for 
Spark's classes, but those consume about 50MB in my experiments for both 
Hadoop1 and Hadoop2 builds (only about 2MB extra for Hadoop2). I tested this by 
staring a spark shell, manually running a few GC's, and then profiling the heap.
    
    The issue in this case is that the individual partitions are pretty large 
(50-150MB) compared to the size of the heap and we unroll an entire partition 
when Spark is already "on the fringe" of available memory. I think adding these 
extra limits just coincidentally works for this exact input, but won't help 
other users who are running into this problem.
    
    I also noticed a second issue that with character arrays (produced by 
textfile), we slightly underestimate the size, which exacerbates this problem. 
The root cause is unknown at this point - I have a fairly extensive debugging 
log and I'd like to get the bottom of it after 1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to