Github user ryan-williams commented on a diff in the pull request:
https://github.com/apache/spark/pull/9147#discussion_r42271853
--- Diff:
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -62,10 +62,23 @@ private[spark] class ApplicationMaster(
.asInstanceOf[YarnConfiguration]
private val isClusterMode = args.userClass != null
- // Default to numExecutors * 2, with minimum of 3
- private val maxNumExecutorFailures =
sparkConf.getInt("spark.yarn.max.executor.failures",
- sparkConf.getInt("spark.yarn.max.worker.failures",
- math.max(sparkConf.getInt("spark.executor.instances", 0) * 2, 3)))
+ // Default to numExecutors * 2 (maxExecutors in the case that we are
+ // dynamically allocating executors), with minimum of 3.
+ private val maxNumExecutorFailures =
+ sparkConf.getInt("spark.yarn.max.executor.failures",
+ sparkConf.getInt("spark.yarn.max.worker.failures",
+ math.max(
+ 3,
+ 2 * sparkConf.getInt(
+ if (Utils.isDynamicAllocationEnabled(sparkConf))
+ "spark.dynamicAllocation.maxExecutors"
--- End diff --
To be clear, this change does not place any additional requirements on a
user to set `maxExecutors` to get sane dynamic allocation (DA) default behavior.
It merely alleviates one class of "gotcha" that caused me some trouble this
week: when setting standard DA params, the `val maxNumExecutorFailures` here
becomes `3` by default, which does not seem sensible for apps that are going up
to many 100s of executors.
It seems to me that the extant
`math.max(sparkConf.getInt("spark.executor.instances", 0) * 2, 3)` expression
is not _intentionally_ making DA apps have a limit of `3` failures, but that it
simply wasn't taking into account the fact that `spark.executor.instances` is
not set in DA mode.
It's true that we could also "resolve" this by declaring
`spark.yarn.max.worker.failures` to be yet another configuration param that
must be set to a non-default value in order to get sane DA behavior.
Off the top of my head, there is already one param
(`spark.shuffle.service.enabled=true`) that is not named in a way that suggests
that it is important for DA apps to set, and we could make
`spark.yarn.max.worker.failures` a second.
My belief is that it would be better to not require yet another parameter
(especially one that is not named in a way that makes it obvious that it is or
could be important for DA to not fail in unexpected ways) for sane DA behavior,
but to just fix the clearly-inadvertently-missed setting of a good default
value here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]