[GitHub] [dolphinscheduler] hzyangkai opened a new pull request, #13030: [Improvement-12968][Master]improvement failover process

GitBox Mon, 28 Nov 2022 04:59:44 -0800


hzyangkai opened a new pull request, #13030:
URL: https://github.com/apache/dolphinscheduler/pull/13030


   all tasks keep running when master crashes, spark task in cluster mode keep 
running when worker crashes
   
   ## Purpose of the pull request
   
   Achieve the basic goals of the design document  in the issue #12968 
   
   1. When the worker crashes, the task of type 3 keeps running and the task of 
type 1, 2 is killed and restarted as before 
   2. When the master crashes, all three types of tasks keep running 
   3. When the master & worker crash, the task of type 3 keeps running, and the 
task of type 1, 2 is killed and restarted as before
   
   currently , only adjust spark task in cluster mode  to type 3 from type 2.
   
   ## Brief change log
   1. WorkerTaskExecuteRunnable#execute :  for the task of type 3, the submit 
process exit after the task is submitted；for the task of type 1, 2, the submit 
process exit after the task is finished
   2. WorkerTaskExecuteRunnable#afterExecute: for the task of type 3, it 
reports the running status, along with the appid, then monitors app status on 
yarn, finally sends the final status to master when the app on yarn finished; 
for the task of type 1, 2, it reports the final status directly.
   
   ## Verify this pull request
   
   Manually verified the change by testing locally.
   
   ### master crashes
   1. when master crashes, and then restart ,  all types of tasks will rebuild 
channel to worker , keep running.
   
   ### worker crashes
   1. when kill worker using "dolphinscheduler-daemon.sh stop worker-server " 
and then restart  worker using "dolphinscheduler-daemon.sh start 
worker-server", tasks of type 1 and 2 are killed by shutdown process of the 
worker，and then theses tasks are restarted. tasks of type3 (spark task in 
cluster mode) will keep running.
   3. when kill worker using "kill -9 pid" and then restart worker using 
"dolphinscheduler-daemon.sh start worker-server", tasks of type 1 and 2 keep 
running , and then restart a new task instance, this is not reasonable, but is 
the same to the orginal logic of dophinscheduler. We should use scripts to stop 
tasks。
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [dolphinscheduler] hzyangkai opened a new pull request, #13030: [Improvement-12968][Master]improvement failover process

Reply via email to