[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

Shay Rojansky (JIRA) Sun, 28 Jun 2015 21:57:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605117#comment-14605117
 ]


Shay Rojansky commented on SPARK-8374:
--------------------------------------

Any chance someone can look at this bug, at least to confirm it? This is a 
pretty serious issue preventing Spark 1.4 use in YARN where preemption may 
happen...

> Job frequently hangs after YARN preemption
> ------------------------------------------
>
>                 Key: SPARK-8374
>                 URL: https://issues.apache.org/jira/browse/SPARK-8374
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.4.0
>         Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
>            Reporter: Shay Rojansky
>            Priority: Critical
>
> After upgrading to Spark 1.4.0, jobs that get preempted very frequently will 
> not reacquire executors and will therefore hang. To reproduce:
> 1. I run Spark job A that acquires all grid resources
> 2. I run Spark job B in a higher-priority queue that acquires all grid 
> resources. Job A is fully preempted.
> 3. Kill job B, releasing all resources
> 4. Job A should at this point reacquire all grid resources, but occasionally 
> doesn't. Repeating the preemption scenario makes the bad behavior occur 
> within a few attempts.
> (see logs at bottom).
> Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption 
> issues, maybe the work there is related to the new issues.
> The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've 
> downgraded to 1.3.1 just because of this issue).
> Logs
> ------
> When job B (the preemptor first acquires an application master, the following 
> is logged by job A (the preemptee):
> {noformat}
> ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc 
> client disassociated
> INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://[email protected]:54167] has failed, address 
> is now gated for [5000] ms. Reason is: [Disassociated].
> WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, 
> g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
> INFO DAGScheduler: Executor lost: 447 (epoch 0)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from 
> BlockManagerMaster.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, 
> g023.grid.eaglerd.local, 41406)
> INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
> {noformat}
> (It's strange for errors/warnings to be logged for preemption)
> Later, when job B's AM starts requesting its resources, I get lots of the 
> following in job A:
> {noformat}
> ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc 
> client disassociated
> INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
> WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, 
> g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://[email protected]:34357] has failed, address 
> is now gated for [5000] ms. Reason is: [Disassociated].
> {noformat}
> Finally, when I kill job B, job A emits lots of the following:
> {noformat}
> INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
> WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
> {noformat}
> And finally after some time:
> {noformat}
> WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 
> 165964 ms exceeds timeout 120000 ms
> ERROR YarnScheduler: Lost an executor 466 (already removed): Executor 
> heartbeat timed out after 165964 ms
> {noformat}
> At this point the job never requests/acquires more resources and hangs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-8374) Job frequently hangs after YARN preemption

Reply via email to