tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1810429020

   > ERROR cluster.YarnScheduler: Lost executor x on x.163.org: Container 
container_x on host:x.163.org was preempted.
   
   Preemption on yarn shouldn't be going against the number of failed 
executors. If it is then something has changed and we should fix that. 
   
   ```
    case ContainerExitStatus.PREEMPTED =>                                       
                                                                                
|
               // Preemption is not the fault of the running tasks, since YARN 
preempts containers                                                             
          
               // merely to do resource sharing, and tasks that fail due to 
preempted executors could                                                       
             
               // just as easily finish on any other executor. 
   ```
   
   > K8s environment with Horizontal Pod Scheduler case
   
   Can you be more specific here, why is this going to cause failures that 
aren't similar to YARN dynamic allocation getting more executors?  Is it 
scaling down and the containers are marked as failed vs yarn marking them as 
preempted and not counting against failures?  Is there anyway to know on k8s 
this happened so we could not count them?  it seems like if this is really an 
issue the feature should be off by default on k8s
   
   
   > Let's take Spark Thrift Server and Spark Connect as examples, ADD JAR is 
an end-user user operation and the artifacts can be changed by themselves. 
Starting and maintaining the Server is for system admins. If the jar issue 
occurs here, shall we give enough time for admins to detect the issue and then 
gracefully reboot it to reduce the impact on other concurrent users?
   
   This is a consequence of using a shared environment.  Ideally Spark would 
isolate it from other and other users wouldn't be  affected but that 
unfortunately isn't the case.  I'm not sure your environment but ideally users 
test things before running in some production environment and breaking things. 
   
   If this feature doesn't really work or has issues on k8s then there should 
be a way to disable it, which seems like more what you want here right?  You 
essentially are saying you don't want it to fail the application and you should 
just do monitoring on your own to catch issues.
   
   Note, the documentation on this feature are missing, I made some comments: 
https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb 
 can you please fix those?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to