tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1812838085
> > Preemption on yarn shouldn't be going against the number of failed
executors. If it is then something has changed and we should fix that.
> Yes, you are right
What do you mean by this, are you saying the Spark on YARN handling of
preempted containers is not working properly? Meaning if the container is
preempted it should not show up as an executor failure. Are you seeing those
preempted containers show up as failed?
Or are you saying that yes Spark on YARN doesn't mark preempted as failed?
> What does 'this feature' point to?
Sorry I misunderstood your environment here, I thought you were running on
k8s but it looks like you running on YARN. by feature I mean the
spark.yarn.max.executor.failures/spark.executor.maxNumFailures config and its
functionality.
So unless yarn preemption handling is broken (please answer question above),
you gave one very specific use case where user added a bad JAR, in that use
case it seems like you just don't want spark.executor.maxNumFailures enabled at
all. You said you don't want the app to fail so admins can come fix things up
and not have it affect other users. If that is the case then Spark should
allow users to turn spark.executor.maxNumFailures off or I assume you could do
the same thing by setting it to int.maxvalue.
As implemented this seems very arbitrary and I would think hard for a normal
user to set and use this feature. You have it as a ratio, which normally I'm
in favor of but really only works if you have max executors set so it is really
just a hardcoded number. That number seems arbitrary as its just depends on if
you get lucky and happen to have that before some users pushes a bad jar. I
don't understand why this isn't the same as minimum number of executors as that
seems more in line - saying you need some minimum number for this application
to run and by the way its ok to keep running with this is launching new
executors is failing.
If there is some other issues with Spark Connect and add jars maybe that is
a different conversation about isolation
(https://issues.apache.org/jira/browse/SPARK-44146). Or maybe it needs to
better prevent users from adding jars with the same name.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]