[jira] [Commented] (SPARK-14209) Application failure during preemption.

Josh Rosen (JIRA) Thu, 22 Sep 2016 11:16:46 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15514054#comment-15514054
 ]


Josh Rosen commented on SPARK-14209:
------------------------------------

I have backported SPARK-17485 to Spark 1.6.x (for inclusion in Spark 1.6.3), so 
I believe that this issue should be fixed and therefore I'm going to resolve it.

I believe that this ticket may actually be discussing multiple issues that are 
related to fetch failures following executor lost but which have different 
underlying causes and fixes. Marcelo has pointed out several patches which 
affect fetching of shuffle blocks, whereas I think the original issue reported 
in this JIRA relates to BlockFetchException, an error which occurs due to 
failed fetches of NON-shuffle blocks (such as broadcasts or cached RDD blocks).

If you're a user and are still experiencing Spark application failures due to 
executor pre-emption then please file a new JIRA ticket and make sure to 
include the Spark version and the portion of the driver log which contains the 
job failure log message (since that will show which exception / stack trace 
ultimately triggered the failure, allowing us to distinguish shuffle block 
fetch failures vs. other types of fetch failures). 

> Application failure during preemption.
> --------------------------------------
>
>                 Key: SPARK-14209
>                 URL: https://issues.apache.org/jira/browse/SPARK-14209
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 1.6.1
>         Environment: Spark on YARN
>            Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>       ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
>     Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
>         Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14209) Application failure during preemption.

Reply via email to