viirya commented on PR #44280: URL: https://github.com/apache/spark/pull/44280#issuecomment-1849046025
> BTW, `masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq))` already reported when the Worker received this `Duplicate registration` message here. `WorkerLatestState` looks still different to `WorkerSchedulerStateResponse` in their processing logic inside `Master`. I think they are different and `WorkerLatestState` cannot replace `WorkerSchedulerStateResponse`'s role during the recovery. > This PR focuses only on Worker recovery and there is no regression from the Driver or App perspective because they will be deleted like the existing behavior. Okay. This sounds correct. As this PR proposes to address the case `MasterChanged` is not sent/received correctly, `WorkerSchedulerStateResponse` is not processed too as the consequence for sure. It is just setting from `UNKNOWN` to `ALIVE` to bring the worker back. `WorkerSchedulerStateResponse` adds valid executors and drivers to the worker. And you are right that the drivers without worker will be deleted so it is not a regression. For executor, I don't see `completeRecovery` explicitly delete executors which are not of any worker. Although the workers in `UNKNOWN` status will be deleted, but their `executors` are not updated (because `WorkerSchedulerStateResponse` is not sent). I don't know how Spark will handle such case, but it seems not a regression too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
