Miles Crawford created SPARK-14209:
--------------------------------------

             Summary: Application failure during preemption.
                 Key: SPARK-14209
                 URL: https://issues.apache.org/jira/browse/SPARK-14209
             Project: Spark
          Issue Type: Bug
          Components: Block Manager
    Affects Versions: 1.6.1
         Environment: Spark on YARN
            Reporter: Miles Crawford


We have a fair-sharing cluster set up.  When a new job arrives, existing jobs 
are successfully preempted down to fit.

A spate of these messages arrives:
        ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
Reason: Container container_1458935819920_0019_01_000143 on host: 
ip-10-12-46-235.us-west-2.compute.internal was preempted.

This seems fine - the problem is that soon thereafter, our whole application 
fails because it is unable to fetch blocks from the pre-empted containers:

org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
locations. Most recent failure cause:
    Caused by: java.io.IOException: Failed to connect to 
ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
        Caused by: java.net.ConnectException: Connection refused: 
ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681

Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf

Spark does not attempt to recreate these blocks - the tasks simply fail over 
and over until the maxTaskAttempts value is reached.

It appears to me that there is some fault in the way preempted containers are 
being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to