[jira] [Created] (SPARK-20340) Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM

Thomas Graves (JIRA) Fri, 14 Apr 2017 12:34:01 -0700

Thomas Graves created SPARK-20340:
-------------------------------------

             Summary: Size estimate very wrong in ExternalAppendOnlyMap from 
CoGroupedRDD, cause OOM
                 Key: SPARK-20340
                 URL: https://issues.apache.org/jira/browse/SPARK-20340
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.1.0
            Reporter: Thomas Graves



I had a user doing a basic join operation. The values are image's binary 
data(in base64 format) and widely vary in size.

The job failed with out of memory. Originally failed on yarn with using to much 
overhead memory, turned spark.shuffle.io.preferDirectBufs        to false then 
failed with out of heap memory.  I debugged it down to during the shuffle when 
CoGroupedRDD putting things into the ExternalAppendOnlyMap, it computes an 
estimated size to determine when to spill.  In this case SizeEstimator handle 
arrays such that if it is larger then 400 elements, it samples 100 elements. 
The estimate is coming back as GB's different from the actual size.  It claims 
1GB when it is actually using close to 5GB. 

Temporary work around it to increase the memory to be very large (10GB 
executors) but that isn't really acceptable here.  User did the same thing in 
pig and it easily handled the data with 1.5GB of memory.

It seems risky to be using an estimate in such a critical thing. If the 
estimate is wrong you are going to run out of memory and fail the job.

I'm looking closer at the users data still to get more insights.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-20340) Size estimate very wrong in ExternalAppendOnlyMap from CoGroupedRDD, cause OOM

Reply via email to