Github user patmcdonough commented on the pull request:

    https://github.com/apache/spark/pull/377#issuecomment-43034662
  
    Thanks for the update. That all lines up very well with what I was seeing
    in tests, including the under-estimate of the size as well as the point
    where the OOM occurs (while estimating size).
    
    Looking forward to the long-term fix here. It sounds like this PR can be
    closed.
    
    
    On Tue, May 13, 2014 at 6:46 PM, Patrick Wendell
    <[email protected]>wrote:
    
    > Hey I spent a lot of time trying to recreate this issue last week to
    > better understand what is causing it. I can do a write-up on the JIRA, but
    > the long-and-short of it is that this error is caused primarily by the
    > issue described in https://issues.apache.org/jira/browse/SPARK-1777. This
    > is a bigger problem and it's not fixed in the general case by the fix 
here.
    >
    > The code for Spark itself (i.e. the permgen) is totally outside of the
    > heap space, so the premise here that the code for Spark is taking up heap
    > is not correct. There are extra on-heap datastructures for e.g. ZipFile
    > entries for Spark's classes, but those consume about 50MB in my 
experiments
    > for both Hadoop1 and Hadoop2 builds (only about 2MB extra for Hadoop2). I
    > tested this by staring a spark shell, manually running a few GC's, and 
then
    > profiling the heap.
    >
    > The issue in this case is that the individual partitions are pretty large
    > (50-150MB) compared to the size of the heap and we unroll an entire
    > partition when Spark is already "on the fringe" of available memory. I
    > think adding these extra limits just coincidentally works for this exact
    > input, but won't help other users who are running into this problem.
    >
    > I also noticed a second issue that with character arrays (produced by
    > textfile), we slightly underestimate the size, which exacerbates this
    > problem. The root cause is unknown at this point - I have a fairly
    > extensive debugging log and I'd like to get the bottom of it after 1.0.
    >
    > —
    > Reply to this email directly or view it on 
GitHub<https://github.com/apache/spark/pull/377#issuecomment-43033970>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to