viirya commented on PR #44280: URL: https://github.com/apache/spark/pull/44280#issuecomment-1848900769
> In EKS environment, the Master pod's network setting takes some time when there are thousands Worker pods. The Master's error message in the PR description is the evidence what I mentioned. For example, 1k worker case, 40~50 workers remains UNKNOWN status frequently Hmm, so in network failure case so the master cannot send `MasterChanged` correctly to the worker, the worker can remain `UNKNOWN` statue in the master. Okay, it makes sense. > For the following question, we use K8s service master which provides a mapping to the specific pod. And, spark.worker.preferConfiguredMasterAddress=true informs Worker to use the service name always. As the service mapping works, the worker will still send `RegisterWorker` to the recovering master. For the worker in `UNKNOWN` status, changing its status from `UNKNOWN` to `ALIVE` can bring it registered with the recovering master. But the worker doesn't send `WorkerSchedulerStateResponse` as expected to the master (because it doesn't receive `MasterChanged` correctly). When the recovering master receives `WorkerSchedulerStateResponse`, looks like it has some important steps to do like adding executor to application, etc. I'm wondering, is it okay to skip `WorkerSchedulerStateResponse` and all these steps? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
