[
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sam Tran updated SPARK-27347:
-----------------------------
Description:
Ran into scenarios where {{--supervised}} Spark jobs were retried multiple
times when an agent would crash, come back, and re-register even when those
jobs had already relaunched on a different agent.
That is:
* supervised driver is running on agent1
* agent1 crashes
* driver is relaunched on another agent as `<task-id>-retry-1`
* agent1 comes back online and re-registers with scheduler
* spark relaunches the same job as `<task-id>-retry-2`
* now there are two jobs running simultaneously and the first retry job is
effectively orphaned within Zookeeper
This is because when an agent comes back and re-registers, it sends a status
update {{TASK_FAILED}} for its old driver-task. Previous logic would
indiscriminately remove the {{submissionId }}from Zookeeper's
{{launchedDrivers}} node and add it to {{retryList}} node.
Then, when a new offer came in, it would relaunch another {{-retry-}} task even
though one was previously running.
Sample log looks something like this:
{code:java}
```
19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos:
... [offers] ...
19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer
5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver
driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
...
19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
...
19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed
out' reason=REASON_SLAVE_REMOVED
...
19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer
5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver
driver-20190115192138-0001 with taskId: value:
"driver-20190115192138-0001-retry-1"
...
19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
...
19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent
re-reregistered'
...
19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor
termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with
driver-20190115192138-0001 in status update
...
19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer
5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver
driver-20190115192138-0001 with taskId: value:
"driver-20190115192138-0001-retry-2"
...
19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update:
taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''
```{code}
was:
Ran into scenarios where {{--supervised}} Spark jobs were retried multiple
times when an agent would crash, come back, and re-register even when those
jobs had already relaunched on a different agent.
That is:
* supervised driver is running on agent1
* agent1 crashes
* driver is relaunched on another agent as `<task-id>-retry-1`
* agent1 comes back online and re-registers with scheduler
* spark relaunches the same job as `<task-id>-retry-2`
* now there are two jobs running simultaneously and the first retry job is
effectively orphaned within Zookeeper
This is because when an agent comes back and re-registers, it sends a status
update {{TASK_FAILED}} for its old driver-task. Previous logic would
indiscriminately remove the {{submissionId }}from Zookeeper's
{{launchedDrivers}} node and add it to {{retryList}} node.
Then, when a new offer came in, it would relaunch another {{-retry-}} task even
though one was previously running.
> [MESOS] Fix supervised driver retry logic when agent crashes/restarts
> ---------------------------------------------------------------------
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
> Issue Type: Bug
> Components: Mesos
> Affects Versions: 2.2.1, 2.3.2, 2.4.0
> Reporter: Sam Tran
> Priority: Major
> Labels: mesos
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple
> times when an agent would crash, come back, and re-register even when those
> jobs had already relaunched on a different agent.
> That is:
> * supervised driver is running on agent1
> * agent1 crashes
> * driver is relaunched on another agent as `<task-id>-retry-1`
> * agent1 comes back online and re-registers with scheduler
> * spark relaunches the same job as `<task-id>-retry-2`
> * now there are two jobs running simultaneously and the first retry job is
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status
> update {{TASK_FAILED}} for its old driver-task. Previous logic would
> indiscriminately remove the {{submissionId }}from Zookeeper's
> {{launchedDrivers}} node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another {{-retry-}} task
> even though one was previously running.
> Sample log looks something like this:
> {code:java}
> ```
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos:
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver
> driver-20190115192138-0001 with taskId: value:
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver
> driver-20190115192138-0001 with taskId: value:
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update:
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''
> ```{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]