Ngone51 opened a new pull request #24408: Avoid Master falls into dead loop 
while launching executor failed in Worker
URL: https://github.com/apache/spark/pull/24408
 
 
   ## What changes were proposed in this pull request?
   
   This is a long standing issue which I met before and I've seen other people 
got trouble with it:
   [test cases stuck on "local-cluster mode" of 
ReplSuite?](http://apache-spark-developers-list.1001551.n3.nabble.com/test-cases-stuck-on-quot-local-cluster-mode-quot-of-ReplSuite-td3086.html)
   [Spark tests hang on local machine due to "testGuavaOptional" in 
JavaAPISuite](http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-tests-hang-on-local-machine-due-to-quot-testGuavaOptional-quot-in-JavaAPISuite-tc10999.html)
   
   When running test under local-cluster mode with wrong 
SPARK_HOME(spark.test.home), test just get stuck and no response forever. After 
looking into SPARK_WORKER_DIR, I found there's endless executor directories 
under it. So, this explains what happens during test getting stuck.
   
   The whole process looks like:
   
   1. Driver submits an app to Master and asks for N executors
   2. Master inits executor state with LAUNCHING and sends `LaunchExecutor` to 
Worker
   3. Worker receives `LaunchExecutor`, launches ExecutorRunner asynchronously 
and sends `ExecutorStateChanged(state=RUNNING)` to Mater immediately
   4. Master receives `ExecutorStateChanged(state=RUNNING)` and reset 
`_retyCount` to 0.
   5. ExecutorRunner throws exception during executor launching, sends 
`ExecutorStateChanged(state=FAILED)` to Worker, Worker forwards the msg to 
Master
   6. Master receives `ExecutorStateChanged(state=FAILED)`. Since Master always 
reset `_retyCount` when it receives RUNNING msg, so, event if a Worker fails to 
launch executor for continuous many times, ` _retryCount` would never exceed 
`maxExecutorRetries`. So, Master continue to launch executor and fall into the 
dead loop.
   
   The problem exists in step 3. Worker sends 
`ExecutorStateChanged(state=RUNNING)` to Master immediately while executor is 
still launching. And, when Master receive that msg, it believes the executor 
has launched successfully, and reset `_retryCount` subsequently. However, 
that's not true.
   
   This pr suggests to remove step 3 and requires Worker only send 
`ExecutorStateChanged(state=RUNNING)` after executor has really launched 
successfully.
   
   ## How was this patch tested?
   
   Tested Manually.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to