sumeetgajjar commented on pull request #32114:
URL: https://github.com/apache/spark/pull/32114#issuecomment-819046189


   > Hey guys, I'd like to propose a simpler (but might be a little bit ticky) 
fix if I understand the issue correctly. The idea is,
   > 
   > Instead of removing the executor directly, we set 
`executorLastSeen(executorId) = -1L` when we receives `ExecutorRemoved` in 
`HeartbeatReceiver`. And then,
   > 
   >     1. if `Heartbeat` comes first before `ExpireDeadHosts`, we remove the 
executor from `executorLastSeen` by checking the value "< 0" and avoid the 
re-register.
   > 
   >     2. if `ExpireDeadHosts` comes first before `Heartbeat`,  we set 
`executorLastSeen(executorId) = -2L`. We can't remove it this time in 
`ExpireDeadHosts` because if `Heartbeat` comes later we'd have the same issue 
again.
   > 
   > 
   > 2.1 if `Heartbeat` comes later, we remove the executor from 
`executorLastSeen` by checking the value "< 0" too and also avoid the 
re-register.
   > 
   > 2.2 if `Heartbeat` doesn't come (that means the executor stopped before 
sending the heartbeat), we remove the executor from `executorLastSeen` by 
checking the value = -2L in next `ExpireDeadHosts`.
   > 
   > In this way, we can avoid the extra cache and all changes should be 
limited to `HeartbeatReceiver`.
   > 
   > Any thoughts?
   
   Thanks for the comment @Ngone51 .
   There are two places from where the re-registration can be triggered
   - From `HeartbeatReceiver` - by responding 
`HeartbeatResponse(reregisterBlockManager = true)`. 
   Your solution will take care of this.
   - From `BlockManager` - e.g. 
[reportBlockStatus](https://github.com/apache/spark/blob/ee7d838aaf46f9d786e0388915b422fb78952893/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L766).
   Just modifying `HeartbeatReceiver` won't solve the re-registration issue 
here. We will also have to implement a similar kind of tracking inside 
`BlockManagerMasterEndpoint`. And now since both tracking are independent of 
each other, it might introduce some race condition (please correct me if I am 
wrong).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to