[
https://issues.apache.org/jira/browse/SPARK-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh Rosen updated SPARK-4779:
------------------------------
Description:
When Spark is tight on memory it starts saying files don't exist during shuffle
causing tasks to fail and be rebuilt destroying performance.
The same code works flawlessly with smaller datasets with less memory pressure
I assume.
{code}
14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0
(TID 1099, ip-10-13-192-209.ec2.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/root/spark/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/root/spark/python/pyspark/shuffle.py", line 370, in _external_items
self.mergeCombiners(self.serializer.load_stream(open(p)),
IOError: [Errno 2] No such file or directory:
'/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18'
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91)
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87)
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
{code}
was:
When Spark is tight on memory it starts saying files don't exist during shuffle
causing tasks to fail and be rebuilt destroying performance.
The same code works flawlessly with smaller datasets with less memory pressure
I assume.
14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0
(TID 1099, ip-10-13-192-209.ec2.internal):
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/root/spark/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
for obj in iterator:
File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
File "/root/spark/python/pyspark/shuffle.py", line 370, in _external_items
self.mergeCombiners(self.serializer.load_stream(open(p)),
IOError: [Errno 2] No such file or directory:
'/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18'
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91)
org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87)
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
[~davies], do you think this could be a bug in PySpark's spilling code?
> PySpark Shuffle Fails Looking for Files that Don't Exist when low on Memory
> ---------------------------------------------------------------------------
>
> Key: SPARK-4779
> URL: https://issues.apache.org/jira/browse/SPARK-4779
> Project: Spark
> Issue Type: Bug
> Components: PySpark, Shuffle
> Affects Versions: 1.1.0
> Environment: ec2 launched cluster with scripts
> 6 Nodes
> c3.2xlarge
> Reporter: Brad Willard
>
> When Spark is tight on memory it starts saying files don't exist during
> shuffle causing tasks to fail and be rebuilt destroying performance.
> The same code works flawlessly with smaller datasets with less memory
> pressure I assume.
> {code}
> 14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0
> (TID 1099, ip-10-13-192-209.ec2.internal):
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
> File "/root/spark/python/pyspark/worker.py", line 79, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
> File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
> File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
> for obj in iterator:
> File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
> for item in iterator:
> File "/root/spark/python/pyspark/shuffle.py", line 370, in _external_items
> self.mergeCombiners(self.serializer.load_stream(open(p)),
> IOError: [Errno 2] No such file or directory:
> '/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18'
>
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91)
> org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87)
>
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
> scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
>
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
> scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335)
>
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
>
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
>
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]