tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1810429020
> ERROR cluster.YarnScheduler: Lost executor x on x.163.org: Container
container_x on host:x.163.org was preempted.
Preemption on yarn shouldn't be going against the number of failed
executors. If it is then something has changed and we should fix that.
```
case ContainerExitStatus.PREEMPTED =>
|
// Preemption is not the fault of the running tasks, since YARN
preempts containers
// merely to do resource sharing, and tasks that fail due to
preempted executors could
// just as easily finish on any other executor.
```
> K8s environment with Horizontal Pod Scheduler case
Can you be more specific here, why is this going to cause failures that
aren't similar to YARN dynamic allocation getting more executors? Is it
scaling down and the containers are marked as failed vs yarn marking them as
preempted and not counting against failures? Is there anyway to know on k8s
this happened so we could not count them? it seems like if this is really an
issue the feature should be off by default on k8s
> Let's take Spark Thrift Server and Spark Connect as examples, ADD JAR is
an end-user user operation and the artifacts can be changed by themselves.
Starting and maintaining the Server is for system admins. If the jar issue
occurs here, shall we give enough time for admins to detect the issue and then
gracefully reboot it to reduce the impact on other concurrent users?
This is a consequence of using a shared environment. Ideally Spark would
isolate it from other and other users wouldn't be affected but that
unfortunately isn't the case. I'm not sure your environment but ideally users
test things before running in some production environment and breaking things.
If this feature doesn't really work or has issues on k8s then there should
be a way to disable it, which seems like more what you want here right? You
essentially are saying you don't want it to fail the application and you should
just do monitoring on your own to catch issues.
Note, the documentation on this feature are missing, I made some comments:
https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb
can you please fix those?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]