[jira] [Created] (SPARK-8453) Unioning two RDDs in PySpark doesn't spill to disk

Jason White (JIRA) Thu, 18 Jun 2015 13:32:26 -0700

Jason White created SPARK-8453:
----------------------------------

             Summary: Unioning two RDDs in PySpark doesn't spill to disk
                 Key: SPARK-8453
                 URL: https://issues.apache.org/jira/browse/SPARK-8453
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.4.0
            Reporter: Jason White



When unioning 2 RDDs together in PySpark, spill limits do not seem to be 
recognized. Our YARN containers are frequently killed for exceeding memory 
limits for this reason.

I have been able to reproduce this in the following simple scenario:
- spark.executor.instances: 1, spark.executor.memory: 512m, 
spark.executor.cores: 20, spark.python.worker.reuse: false, 
spark.shuffle.spill: true, spark.yarn.executor.memoryOverhead: 5000
(I recognize this is not a good setup - I set things up this way to explore 
this problem and make the symptom easier to isolate)

I have a 1-billion-row dataset, split up evenly into 1000 partitions. Each 
partition contains exactly 1 million rows. Each row contains approximately 250 
characters, +/- 10.

I executed the following in a PySpark shell:
```
profiler = sc.textFile('/user/jasonwhite/profiler')
profiler_2 = sc.textFile('/user/jasonwhite/profiler')
profiler.count()
profiler_2.count()
```
Total container memory utilization was between 2500 & 2800 MB over the total 
execution, with no spill. No problem.

Then I executed:
```
z = profiler.union(profiler_2)
z.count()
```
Memory utilization spiked immediately to between 4700 & 4900 MB over the course 
of execution, also with no spill. Big problem. Since we are setting our 
container memory sizes based in part on the Python spill limit, when these 
spill limits are not properly recognized, our containers are unexpectedly 
killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-8453) Unioning two RDDs in PySpark doesn't spill to disk

Reply via email to