[
https://issues.apache.org/jira/browse/SPARK-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605117#comment-14605117
]
Shay Rojansky commented on SPARK-8374:
--------------------------------------
Any chance someone can look at this bug, at least to confirm it? This is a
pretty serious issue preventing Spark 1.4 use in YARN where preemption may
happen...
> Job frequently hangs after YARN preemption
> ------------------------------------------
>
> Key: SPARK-8374
> URL: https://issues.apache.org/jira/browse/SPARK-8374
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.4.0
> Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04
> Reporter: Shay Rojansky
> Priority: Critical
>
> After upgrading to Spark 1.4.0, jobs that get preempted very frequently will
> not reacquire executors and will therefore hang. To reproduce:
> 1. I run Spark job A that acquires all grid resources
> 2. I run Spark job B in a higher-priority queue that acquires all grid
> resources. Job A is fully preempted.
> 3. Kill job B, releasing all resources
> 4. Job A should at this point reacquire all grid resources, but occasionally
> doesn't. Repeating the preemption scenario makes the bad behavior occur
> within a few attempts.
> (see logs at bottom).
> Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption
> issues, maybe the work there is related to the new issues.
> The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've
> downgraded to 1.3.1 just because of this issue).
> Logs
> ------
> When job B (the preemptor first acquires an application master, the following
> is logged by job A (the preemptee):
> {noformat}
> ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc
> client disassociated
> INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0
> WARN ReliableDeliverySupervisor: Association with remote system
> [akka.tcp://[email protected]:54167] has failed, address
> is now gated for [5000] ms. Reason is: [Disassociated].
> WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15,
> g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost)
> INFO DAGScheduler: Executor lost: 447 (epoch 0)
> INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from
> BlockManagerMaster.
> INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447,
> g023.grid.eaglerd.local, 41406)
> INFO BlockManagerMaster: Removed 447 successfully in removeExecutor
> {noformat}
> (It's strange for errors/warnings to be logged for preemption)
> Later, when job B's AM starts requesting its resources, I get lots of the
> following in job A:
> {noformat}
> ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc
> client disassociated
> INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0
> WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231,
> g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost)
> WARN ReliableDeliverySupervisor: Association with remote system
> [akka.tcp://[email protected]:34357] has failed, address
> is now gated for [5000] ms. Reason is: [Disassociated].
> {noformat}
> Finally, when I kill job B, job A emits lots of the following:
> {noformat}
> INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31
> WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist!
> {noformat}
> And finally after some time:
> {noformat}
> WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats:
> 165964 ms exceeds timeout 120000 ms
> ERROR YarnScheduler: Lost an executor 466 (already removed): Executor
> heartbeat timed out after 165964 ms
> {noformat}
> At this point the job never requests/acquires more resources and hangs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]