Josh Rosen commented on SPARK-14209:

I believe that this issue should have been fixed by SPARK-17485 in Spark 2.x.

> Application failure during preemption.
> --------------------------------------
>                 Key: SPARK-14209
>                 URL: https://issues.apache.org/jira/browse/SPARK-14209
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 1.6.1
>         Environment: Spark on YARN
>            Reporter: Miles Crawford
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>       ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
>     Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/
>         Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to