pan3793 commented on PR #53840:
URL: https://github.com/apache/spark/pull/53840#issuecomment-3857666340
> I've added a new api in the `ExecutorPodsLifecycleManager` to call
`failureTracker.registerExecutorFailure` and pass the
`ExecutorPodsLifecycleManager` to the `ExecutorPodsAllocator`. This should now
get propagated as an executor failure.
@parthchandra, yeah, I think this is sufficient to fix your problem.
> Adds a retry for executor pod creation ...
this does not help in your case, more generally, for permanent error. I
would rather not add such logic, because:
- `ExecutorPodsAllocator` will continue to request new pods as long as the
pod number does not reach the requested number, so a few transient pod creation
errors do not matter.
- I think `ExecutorFailureTracker` is designed to capture all kinds of
executor failures, e.g.
- executor (pod on K8s, container on YARN) launch failures,
- executor bootstrap failures, e.g., due to wrong setup of env, networkd,
or config
- executor running failures, e.g., due to OOM.
- etc.
without pod creation retry logic,
1) for permanent errors (your case), it fails fast
2) for rare transient errors, it won't reach
`spark.executor.maxNumFailures`
3) for frequently transient errors, it usually indicates that your
cluster is overloaded or some services are unstable, in that case, user should
either increase the `spark.executor.maxNumFailures`, or let the app fail to
expose those potential issues.
If you really like to have separate configurations for pod creation error,
maybe you can enhance the `ExecutorFailureTracker` to accept `kind` on
`registerExecutorFailure`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]