[ 
https://issues.apache.org/jira/browse/SPARK-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4779:
------------------------------
    Description: 
When Spark is tight on memory it starts saying files don't exist during shuffle 
causing tasks to fail and be rebuilt destroying performance.

The same code works flawlessly with smaller datasets with less memory pressure 
I assume.

{code}
14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0 
(TID 1099, ip-10-13-192-209.ec2.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/spark/python/pyspark/worker.py", line 79, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
    for obj in iterator:
  File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
    for item in iterator:
  File "/root/spark/python/pyspark/shuffle.py", line 370, in _external_items
    self.mergeCombiners(self.serializer.load_stream(open(p)),
IOError: [Errno 2] No such file or directory: 
'/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18'

        org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
        org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91)
        org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87)
        
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
        
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
        scala.collection.Iterator$class.foreach(Iterator.scala:727)
        scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335)
        
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
        
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
        
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
{code}

  was:
When Spark is tight on memory it starts saying files don't exist during shuffle 
causing tasks to fail and be rebuilt destroying performance.

The same code works flawlessly with smaller datasets with less memory pressure 
I assume.

14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0 
(TID 1099, ip-10-13-192-209.ec2.internal): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/root/spark/python/pyspark/worker.py", line 79, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
    for obj in iterator:
  File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
    for item in iterator:
  File "/root/spark/python/pyspark/shuffle.py", line 370, in _external_items
    self.mergeCombiners(self.serializer.load_stream(open(p)),
IOError: [Errno 2] No such file or directory: 
'/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18'

        org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
        org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91)
        org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87)
        
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
        
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
        scala.collection.Iterator$class.foreach(Iterator.scala:727)
        scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335)
        
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
        
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
        org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
        
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)


[~davies], do you think this could be a bug in PySpark's spilling code?

> PySpark Shuffle Fails Looking for Files that Don't Exist when low on Memory
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-4779
>                 URL: https://issues.apache.org/jira/browse/SPARK-4779
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Shuffle
>    Affects Versions: 1.1.0
>         Environment: ec2 launched cluster with scripts
> 6 Nodes
> c3.2xlarge
>            Reporter: Brad Willard
>
> When Spark is tight on memory it starts saying files don't exist during 
> shuffle causing tasks to fail and be rebuilt destroying performance.
> The same code works flawlessly with smaller datasets with less memory 
> pressure I assume.
> {code}
> 14/12/06 18:39:37 WARN scheduler.TaskSetManager: Lost task 292.0 in stage 3.0 
> (TID 1099, ip-10-13-192-209.ec2.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/root/spark/python/pyspark/worker.py", line 79, in main
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/root/spark/python/pyspark/serializers.py", line 196, in dump_stream
>     self.serializer.dump_stream(self._batched(iterator), stream)
>   File "/root/spark/python/pyspark/serializers.py", line 127, in dump_stream
>     for obj in iterator:
>   File "/root/spark/python/pyspark/serializers.py", line 185, in _batched
>     for item in iterator:
>   File "/root/spark/python/pyspark/shuffle.py", line 370, in _external_items
>     self.mergeCombiners(self.serializer.load_stream(open(p)),
> IOError: [Errno 2] No such file or directory: 
> '/mnt/spark/spark-local-20141206182702-8748/python/16070/66618000/1/18'
>         
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
>         org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:91)
>         org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:87)
>         
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>         scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
>         
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>         scala.collection.Iterator$$anon$12.next(Iterator.scala:357)
>         scala.collection.Iterator$class.foreach(Iterator.scala:727)
>         scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>         
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:335)
>         
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209)
>         
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>         
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184)
>         org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
>         
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to