kevin85421 opened a new pull request, #37411:
URL: https://github.com/apache/spark/pull/37411

   ### What changes were proposed in this pull request?
   This PR aims to provide a method to lower the timeout. Our solution is to 
ask master for worker’s heartbeat when Driver does not receive heartbeat from 
executor for `TimeoutThreshold` seconds.
   
   ![Screen Shot 2022-08-01 at 6 10 56 
PM](https://user-images.githubusercontent.com/20109646/182272331-2f972aa8-31c9-4c3e-8c88-b7cf2616fad8.png)
   
   In Databricks, driver and master processes are running on the master node. 
Executor and worker processes are running on the worker node but different 
JVMs. Therefore, the network connection between driver/executor and 
master/worker is equivalent because they are running on same physical nodes. 
   
   When Executor performs full GC, it cannot send any message during full GC. 
Next, Driver cannot receive heartbeat from Executor. Instead of removing the 
executor directly, driver will ask master for `workerLastHeartbeat`. Driver 
will determine whether it is network disconnection or other issues (e.g. GC) 
based on `workerLastHeartbeat`. If it is network disconnection, we will remove 
the executor. Otherwise, we will put the executor into a waitingList rather 
than expiring it immediately.
   
   * Result
   ![Screen Shot 2022-08-01 at 6 54 51 
PM](https://user-images.githubusercontent.com/20109646/182275086-474d7458-f2a8-473c-adb2-c13f5c942ea1.png)
   
   ### Why are the changes needed?
   Currently, the driver’s HeartbeatReceiver will expire an executor if it does 
not receive any heartbeat from the executor for 120 seconds. However, 120 
seconds is too long, but we will face other challenges when we try to lower the 
timeout threshold. To elaborate, when an executor is performing GC, it cannot 
reply any message.
   
   ![Screen Shot 2022-08-01 at 6 03 50 
PM](https://user-images.githubusercontent.com/20109646/182269820-4802877d-a4e4-4d20-969d-6ece37ffdb55.png)
   
   We will use the above figure to explain why we cannot lower 
`TimeoutThreshold` to 60 seconds directly. When Executor performs full GC, it 
cannot send any message, including heartbeat. Next, driver will remove the 
executor because driver cannot receive heartbeat from Executor for 60 seconds. 
In other words, we cannot distinguish between GC and network disconnection.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   ```
   build/sbt "core/testOnly *HeartbeatReceiverSuite"
   build/sbt "core/testOnly *MasterSuite"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to