baohe-zhang opened a new pull request #28480:
URL: https://github.com/apache/spark/pull/28480


   ### What changes were proposed in this pull request?
   Right now the communication mechanism between executor and driver when 
launching executor is:
   
   1. executor backend sends "RegisterExecutor" to the driver backend.
   2. the driver backend replies "true".
   3. executor backend instantiates executor once it receives "true" from 
driver backend.
   4. after the executor is instantiated, the executor backend sends 
"LaunchedExecutor"(introduced in PR#25964) to the driver backend.
   5. the driver backend makes offers for executor when received 
"LaunchedExecutor".
   
   A race can occur in steps 3 and 4. If the driver backend is stopped(hence 
driver endpoint removed in dispatcher) during step 3, in step 4, when executor 
backend tries to send "LaunchedExecutor" to driver backend, RPC dispatcher will 
throw an uncaught SparkException for "Could not find CoarseGrainedScheduler".  
These exception logs are verbose and somewhat misleading.
   
   This PR is trying to fix this issue through these changes:
   When the CoarseGrainedSchedulerBackend#stop() is called:
   
   - A stopping boolean variable is set to true.
   - driverEndpoint will not be stopped at this time. (dispatcher will stop it 
at the end)
   
   And when the stopping is set to true, the driver backend will:
   
   - replies sendFailure to executor backend when receives "RegisterExecutor".
   - replies "StopExecutor" to executor backend (or "RemoveExecutor" to self) 
when receives "LaunchedExecutor"
   
   
   ### Why are the changes needed?
   Exceptions thrown by this issue are verbose and somewhat misleading.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   
   ### How was this patch tested?
   Re-run with the same command in a YARN cluster multiple times and didn't see 
the issue happen again. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to