samvantran opened a new pull request #24276: [SPARK-27347][MESOS] Fix 
supervised driver retry logic for outdated tasks
URL: https://github.com/apache/spark/pull/24276
 
 
   ## What changes were proposed in this pull request?
   
   This patch fixes a bug where `--supervised` Spark jobs would retry multiple 
times whenever an agent would crash, come back, and re-register even when those 
jobs had already relaunched on a different agent.
   
   That is: 
   ```
   - supervised driver is running on agent1
   - agent1 crashes
   - driver is relaunched on another agent as `<task-id>-retry-1`
   - agent1 comes back online and re-registers with scheduler
   - spark relaunches the same job as `<task-id>-retry-2`
   - now there are two jobs running simultaneously
   ```
   
   This is because when an agent would come back and re-register it would send 
a status update `TASK_FAILED` for its old driver-task. Previous logic would 
indiscriminately remove the `submissionId` from Zookeeper's `launchedDrivers` 
node and add it to `retryList` node. Then, when a new offer came in, it would 
relaunch another `-retry-`  task even though one was previously running. 
   
   For example logs, scroll to bottom
   
   ## How was this patch tested?
   
   - Added a unit test to simulate behavior described above
   - Tested manually on a DC/OS cluster by
     ```
     - launching a --supervised spark job
     - dcos node ssh <to the agent with the running spark-driver>
     - systemctl stop dcos-mesos-slave
     - docker kill <driver-container-id>
     - [ wait until spark job is relaunched ]
     - systemctl start dcos-mesos-slave
     - [ observe spark driver is not relaunched as `-retry-2` ]
     ```
   
   Log snippets included below. Notice the `-retry-1` task is running when 
status update for the old task comes in afterward: 
   ```
   19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
   ... [offers] ...
   19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
   ...
   19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
   19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
   ...
   19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
out' reason=REASON_SLAVE_REMOVED
   ...
   19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
driver-20190115192138-0001 with taskId: value: 
"driver-20190115192138-0001-retry-1"
   ...
   19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
   19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
   ...
   19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent 
re-reregistered'
   ...
   19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor 
termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
   19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
driver-20190115192138-0001 in status update
   ...
   19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
driver-20190115192138-0001 with taskId: value: 
"driver-20190115192138-0001-retry-2"
   ...
   19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
   19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to