viirya commented on PR #44280:
URL: https://github.com/apache/spark/pull/44280#issuecomment-1849046025

   > BTW, `masterRef.send(WorkerLatestState(workerId, execs.toList, 
drivers.keys.toSeq))` already reported when the Worker received this `Duplicate 
registration` message here.
   
   `WorkerLatestState` looks still different to `WorkerSchedulerStateResponse` 
in their processing logic inside `Master`. I think they are different and 
`WorkerLatestState` cannot replace `WorkerSchedulerStateResponse`'s role during 
the recovery.
   
   > This PR focuses only on Worker recovery and there is no regression from 
the Driver or App perspective because they will be deleted like the existing 
behavior.
   
   Okay. This sounds correct. As this PR proposes to address the case 
`MasterChanged` is not sent/received correctly, `WorkerSchedulerStateResponse` 
is not processed too as the consequence for sure. It is just setting from 
`UNKNOWN` to `ALIVE` to bring the worker back.
   
   `WorkerSchedulerStateResponse` adds valid executors and drivers to the 
worker. And you are right that the drivers without worker will be deleted so it 
is not a regression. For executor, I don't see `completeRecovery` explicitly 
delete executors which are not of any worker. Although the workers in `UNKNOWN` 
status will be deleted, but their `executors` are not updated (because 
`WorkerSchedulerStateResponse` is not sent). I don't know how Spark will handle 
such case, but it seems not a regression too.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to