[ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27347.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0
                   2.4.4
                   2.3.4

This is resolved via https://github.com/apache/spark/pull/24276

> Fix supervised driver retry logic when agent crashes/restarts
> -------------------------------------------------------------
>
>                 Key: SPARK-27347
>                 URL: https://issues.apache.org/jira/browse/SPARK-27347
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos
>    Affects Versions: 2.2.1, 2.3.2, 2.4.0
>            Reporter: Sam Tran
>            Priority: Major
>             Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `<task-id>-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `<task-id>-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to