pan3793 commented on PR #53840:
URL: https://github.com/apache/spark/pull/53840#issuecomment-3857666340

   > I've added a new api in the `ExecutorPodsLifecycleManager` to call 
`failureTracker.registerExecutorFailure` and pass the 
`ExecutorPodsLifecycleManager` to the `ExecutorPodsAllocator`. This should now 
get propagated as an executor failure.
   
   @parthchandra, yeah, I think this is sufficient to fix your problem.
   
   > Adds a retry for executor pod creation ...
   
   this does not help in your case, more generally, for permanent error. I 
would rather not add such logic, because:
   
   - `ExecutorPodsAllocator` will continue to request new pods as long as the 
pod number does not reach the requested number, so a few transient pod creation 
errors do not matter.
   
   - I think `ExecutorFailureTracker` is designed to capture all kinds of 
executor failures, e.g.
     - executor (pod on K8s, container on YARN) launch failures,
     - executor bootstrap failures, e.g., due to wrong setup of env, networkd, 
or config
     - executor running failures, e.g., due to OOM.
     - etc.
     
     without pod creation retry logic, 
       1) for permanent errors (your case), it fails fast
       2) for rare transient errors, it won't reach 
`spark.executor.maxNumFailures`
       3) for frequently transient errors, it usually indicates that your 
cluster is overloaded or some services are unstable, in that case, user should 
either increase the `spark.executor.maxNumFailures`, or let the app fail to 
expose those potential issues.
   
   If you really like to have separate configurations for pod creation error, 
maybe you can enhance the `ExecutorFailureTracker` to accept `kind` on 
`registerExecutorFailure`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to