[jira] [Commented] (SPARK-14209) Application failure during preemption.

Marcelo Vanzin (JIRA) Wed, 30 Mar 2016 09:41:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218283#comment-15218283
 ]


Marcelo Vanzin commented on SPARK-14209:
----------------------------------------

That's weird. I don't know whether the funny logs are a symptom of the problem 
or just something else unrelated. For example, aside from the missing logs, 
there's stuff like this:

{noformat}
2016-03-25 23:28:41,722 ERROR o.a.s.s.cluster.YarnClusterScheduler: Lost an 
executor 51 (already removed): Pending loss reason.
{noformat}

Which looks fine, except that log is generated by TaskSchedulerImpl, not 
YarnClusterScheduler.

Tracing the code it doesn't seem like there's a problem anywhere; when the 
executor dies the BlockManagerMaster should remove it from its internal state, 
and although some tasks might fail in that window before the bookkeeping is 
updated, what your logs show shouldn't really happen unless there's something 
really catastrophic going on, like the BlockManagerMaster being deadlocked or 
having a ridiculously long message queue or something.

> Application failure during preemption.
> --------------------------------------
>
>                 Key: SPARK-14209
>                 URL: https://issues.apache.org/jira/browse/SPARK-14209
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 1.6.1
>         Environment: Spark on YARN
>            Reporter: Miles Crawford
>
> We have a fair-sharing cluster set up, including the external shuffle 
> service.  When a new job arrives, existing jobs are successfully preempted 
> down to fit.
> A spate of these messages arrives:
>       ExecutorLostFailure (executor 48 exited unrelated to the running tasks) 
> Reason: Container container_1458935819920_0019_01_000143 on host: 
> ip-10-12-46-235.us-west-2.compute.internal was preempted.
> This seems fine - the problem is that soon thereafter, our whole application 
> fails because it is unable to fetch blocks from the pre-empted containers:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1 
> locations. Most recent failure cause:
>     Caused by: java.io.IOException: Failed to connect to 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
>         Caused by: java.net.ConnectException: Connection refused: 
> ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681
> Full stack: https://gist.github.com/milescrawford/33a1c1e61d88cc8c6daf
> Spark does not attempt to recreate these blocks - the tasks simply fail over 
> and over until the maxTaskAttempts value is reached.
> It appears to me that there is some fault in the way preempted containers are 
> being handled - shouldn't these blocks be recreated on demand?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-14209) Application failure during preemption.

Reply via email to