attilapiros opened a new pull request #31195:
URL: https://github.com/apache/spark/pull/31195


   ### What changes were proposed in this pull request?
   
   Missing POD detection is extended by timestamp (and time limit) based check 
to avoid wrongfully detection of missing POD detection.
   
   The two new timestamps:
   - `fullSnapshotTs` is introduced for the `ExecutorPodsSnapshot` which only 
updated by the pod polling snapshot source
   - `registrationTs` is introduced for the `ExecutorData` and it is 
initialized at the executor registration at the scheduler backend
   
   Moreover a new config `spark.kubernetes.executor.missingPodDetectDelta` is 
used to specify the accepted delta between the two.
   
   ### Why are the changes needed?
   
   Watching a POD (`ExecutorPodsWatchSnapshotSource`) only inform about single 
POD changes. This could wrongfully lead to detecting of missing PODs (PODs 
known by scheduler backend but missing from POD snapshots) by the executor POD 
lifecycle manager.
   
   A key indicator of this error is seeing this log message:
   
   > "The executor with ID [some_id] was not found in the cluster but we didn't 
get a reason why. Marking the executor as failed. The executor may have been 
deleted but the driver missed the deletion event."
   
   So one of the problem is running the missing POD detection check even when a 
single POD is changed without having a full consistent snapshot about all the 
PODs (see `ExecutorPodsPollingSnapshotSource`).
   The other problem could be the race between the executor POD lifecycle 
manager and the scheduler backend: so even in case of a having a full snapshot 
the registration at the scheduler backend could precede the snapshot polling 
(and processing of those polled snapshots).
   
   ### Does this PR introduce any user-facing change?
   
   Yes. When the POD is missing then the reason message explaining the 
executor's exit is extended with both timestamps (the polling time and the 
executor registration time) and even the new config is mentioned.
   
   ### How was this patch tested?
   
   The existing unit tests are extended.
   
   (cherry picked from commit 6bd7a6200f8beaab1c68b2469df05870ea788d49)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to