[GitHub] [spark] Ngone51 opened a new pull request #24408: Avoid Master falls into dead loop while launching executor failed in Worker

GitBox Thu, 18 Apr 2019 09:17:02 -0700

Ngone51 opened a new pull request #24408: Avoid Master falls into dead loop
while launching executor failed in Worker
URL: https://github.com/apache/spark/pull/24408

## What changes were proposed in this pull request?

This is a long standing issue which I met before and I've seen other people
got trouble with it:
[test cases stuck on "local-cluster mode" of
ReplSuite?](http://apache-spark-developers-list.1001551.n3.nabble.com/test-cases-stuck-on-quot-local-cluster-mode-quot-of-ReplSuite-td3086.html)
[Spark tests hang on local machine due to "testGuavaOptional" in
JavaAPISuite](http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-tests-hang-on-local-machine-due-to-quot-testGuavaOptional-quot-in-JavaAPISuite-tc10999.html)

When running test under local-cluster mode with wrong
SPARK_HOME(spark.test.home), test just get stuck and no response forever. After
looking into SPARK_WORKER_DIR, I found there's endless executor directories
under it. So, this explains what happens during test getting stuck.

The whole process looks like:

1. Driver submits an app to Master and asks for N executors
2. Master inits executor state with LAUNCHING and sends `LaunchExecutor` to
Worker
3. Worker receives `LaunchExecutor`, launches ExecutorRunner asynchronously
and sends `ExecutorStateChanged(state=RUNNING)` to Mater immediately
4. Master receives `ExecutorStateChanged(state=RUNNING)` and reset
`_retyCount` to 0.
5. ExecutorRunner throws exception during executor launching, sends
`ExecutorStateChanged(state=FAILED)` to Worker, Worker forwards the msg to
Master
6. Master receives `ExecutorStateChanged(state=FAILED)`. Since Master always
reset `_retyCount` when it receives RUNNING msg, so, event if a Worker fails to
launch executor for continuous many times, ` _retryCount` would never exceed
`maxExecutorRetries`. So, Master continue to launch executor and fall into the
dead loop.

The problem exists in step 3. Worker sends
`ExecutorStateChanged(state=RUNNING)` to Master immediately while executor is
still launching. And, when Master receive that msg, it believes the executor
has launched successfully, and reset `_retryCount` subsequently. However,
that's not true.

This pr suggests to remove step 3 and requires Worker only send
`ExecutorStateChanged(state=RUNNING)` after executor has really launched
successfully.

## How was this patch tested?

Tested Manually.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 opened a new pull request #24408: Avoid Master falls into dead loop while launching executor failed in Worker

Reply via email to