viirya commented on PR #44280:
URL: https://github.com/apache/spark/pull/44280#issuecomment-1848900769

   > In EKS environment, the Master pod's network setting takes some time when 
there are thousands Worker pods. The Master's error message in the PR 
description is the evidence what I mentioned. For example, 1k worker case, 
40~50 workers remains UNKNOWN status frequently
   
   Hmm, so in network failure case so the master cannot send `MasterChanged` 
correctly to the worker, the worker can remain `UNKNOWN` statue in the master. 
Okay, it makes sense.
   
   > For the following question, we use K8s service master which provides a 
mapping to the specific pod. And, 
spark.worker.preferConfiguredMasterAddress=true informs Worker to use the 
service name always.
   
   As the service mapping works, the worker will still send `RegisterWorker` to 
the recovering master. For the worker in `UNKNOWN` status, changing its status 
from `UNKNOWN` to `ALIVE` can bring it registered with the recovering master.
   
   But the worker doesn't send `WorkerSchedulerStateResponse` as expected to 
the master (because it doesn't receive `MasterChanged` correctly). When the 
recovering master receives `WorkerSchedulerStateResponse`, looks like it has 
some important steps to do like adding executor to application, etc.
   
   I'm wondering, is it okay to skip `WorkerSchedulerStateResponse` and all 
these steps?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to