[
https://issues.apache.org/jira/browse/SPARK-8453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-8453:
--------------------------------
Labels: bulk-closed memory union (was: memory union)
> Unioning two RDDs in PySpark doesn't spill to disk
> --------------------------------------------------
>
> Key: SPARK-8453
> URL: https://issues.apache.org/jira/browse/SPARK-8453
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.4.0
> Reporter: Jason White
> Priority: Major
> Labels: bulk-closed, memory, union
>
> When unioning 2 RDDs together in PySpark, spill limits do not seem to be
> recognized. Our YARN containers are frequently killed for exceeding memory
> limits for this reason.
> I have been able to reproduce this in the following simple scenario:
> - spark.executor.instances: 1, spark.executor.memory: 512m,
> spark.executor.cores: 20, spark.python.worker.reuse: false,
> spark.shuffle.spill: true, spark.yarn.executor.memoryOverhead: 5000
> (I recognize this is not a good setup - I set things up this way to explore
> this problem and make the symptom easier to isolate)
> I have a 1-billion-row dataset, split up evenly into 1000 partitions. Each
> partition contains exactly 1 million rows. Each row contains approximately
> 250 characters, +/- 10.
> I executed the following in a PySpark shell:
> ```
> profiler = sc.textFile('/user/jasonwhite/profiler')
> profiler_2 = sc.textFile('/user/jasonwhite/profiler')
> profiler.count()
> profiler_2.count()
> ```
> Total container memory utilization was between 2500 & 2800 MB over the total
> execution, with no spill. No problem.
> Then I executed:
> ```
> z = profiler.union(profiler_2)
> z.count()
> ```
> Memory utilization spiked immediately to between 4700 & 4900 MB over the
> course of execution, also with no spill. Big problem. Since we are setting
> our container memory sizes based in part on the Python spill limit, when
> these spill limits are not properly recognized, our containers are
> unexpectedly killed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]