attilapiros opened a new pull request #30675:
URL: https://github.com/apache/spark/pull/30675
### What changes were proposed in this pull request?
Missing POD detection is extended by timestamp (and time limit) based check
to avoid wrongfully detection of missing POD detection.
The two new timestamps:
- `fullSnapshotTs` is introduced for the `ExecutorPodsSnapshot` which only
updated by the pod polling snapshot source
- `registrationTs` is introduced for the `ExecutorData` and it is
initialized at the executor registration at the scheduler backend
Moreover a new config `spark.kubernetes.executor.missingPodDetectDelta` is
used to specify the accepted delta between the two.
### Why are the changes needed?
Watching a POD (`ExecutorPodsWatchSnapshotSource`) only inform about single
POD changes. This could wrongfully lead to detecting of missing PODs (PODs
known by scheduler backend but missing from POD snapshots) by the executor POD
lifecycle manager.
A key indicator of this error is seeing this log message:
> "The executor with ID [some_id] was not found in the cluster but we didn't
get a reason why. Marking the executor as failed. The executor may have been
deleted but the driver missed the deletion event."
So one of the problem is running the missing POD detection check even when a
single POD is changed without having a full consistent snapshot about all the
PODs (see `ExecutorPodsPollingSnapshotSource`).
The other problem could be the race between the executor POD lifecycle
manager and the scheduler backend: so even in case of a having a full snapshot
the registration at the scheduler backend could precede the snapshot polling
(and processing of those polled snapshots).
### Does this PR introduce _any_ user-facing change?
Yes. When the POD is missing then the reason message explaining the
executor's exit is extended with both timestamps (the polling time and the
executor registration time) and even the new config is mentioned.
### How was this patch tested?
The existing unit tests are extended.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]