[
https://issues.apache.org/jira/browse/SPARK-34519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18033975#comment-18033975
]
Oleksiy Dyagilev commented on SPARK-34519:
------------------------------------------
In our production deployment, we run multiple Spark applications simultaneously
on Kubernetes. I would like to implement a backoff mechanism to protect the
Control Plane from being overloaded with continuous pod allocation requests
when pods fail to start (e.g. due to Istio sidecar problems or other
initialization issues). We're aware that Spark 3.2 introduced
"spark.kubernetes.allocation.maxPendingPods" which can partially help limit the
number of concurrent pending pods. However, the downside of this approach is
that it increases startup time even when cluster conditions are normal.
[~dongjoon] , could you review the draft proposal below and let me know if you
have better suggestions please.
-----------------
Introduce a dynamic *maxPendingLimit* that automatically adjusts based on
system health.
Introduce three states:
*1. Normal State*
Initial state. The system operates with standard behavior.
* *Transition to* \{*}Backoff{*}: If *failureThreshold* failures occur within
the *failureInterval* (both configurable), transition to Backoff state.
*2. Backoff State*
* Set *maxPendingLimit* to *initialMaxPending* (configurable, default: 1)
* Track executor launch attempts with this reduced limit:
** {*}On failure{*}: Apply exponential backoff before the next attempt
*** Exponential backoff sequence: *initialDelay* × 2^n, capped at
*maxBackoffDelay.* E.g. 5s → 10s → 20s → 40s → ... → 5 min → 5 min
({*}initialDelay{*}=5s{*}, maxBackoffDelay={*}5min)
*** Continue attempting with the maximum delay until successful
** *On* \{*}success{*}: Transition to Recovery state
*3. Recovery State*
* Gradually increase *maxPendingLimit* linearly using *maxPendingIncreaseStep*
(configurable). Respect the upper limit defined by existing
'spark.kubernetes.allocation.maxPendingPods'.
* Increase *maxPendingLimit* only after all executors at the current limit
succeed
* *On* \{*}any failure{*}: Immediately revert to Backoff state
* {*}Transition to Normal{*}: When the target number of executors is reached,
i.e. all allocated (Any better ideas?)
!spark_backoff.png|width=553,height=195!
*New configs:*
|*Config*|*Description (draft version)*|*Type*|*Default value*|
|spark.kubernetes.allocation.executor.backoff.enabled|Enable backoff
mechanism|Boolean|false|
|spark.kubernetes.allocation.executor.backoff.failureThreshold|Number of
executor failures to trigger Backoff state|Integer|2 |
|spark.kubernetes.allocation.executor.backoff.failureInterval|Time window for
counting failures|Duration|10m|
|spark.kubernetes.allocation.executor.backoff.initialDelay|Starting delay for
exponential backoff|Duration|5s|
|spark.kubernetes.allocation.executor.backoff.maxBackoffDelay|Maximum
exponential backoff delay cap|Duration|5m|
|spark.kubernetes.allocation.executor.backoff.initialMaxPending|Initial number
of Maximum pending executors in Backoff state|Integer|1|
|spark.kubernetes.allocation.executor.backoff.maxPendingIncreaseStep|Linear
increment step for Recovery state|Integer|2|
> ExecutorPodsAllocator use exponential backoff strategy when request executor
> pod failed
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-34519
> URL: https://issues.apache.org/jira/browse/SPARK-34519
> Project: Spark
> Issue Type: Improvement
> Components: Kubernetes
> Affects Versions: 3.0.1
> Environment: spark 3.0.1
> kubernetes 1.18.8
> Reporter: Fengyu Cao
> Priority: Minor
> Attachments: spark_backoff.png
>
>
> # create a resouce quota `kubectl create quota test --hard=cpu=20,memory=60G`
> # submit an application request more than quota `spark-submit
> --executor-cores 5 --executor-memory 10G --num-executors 10 <spark-pi.py>`
> # seems `ExecutorPodsAllocator: Going to request 5 executors from
> Kubernetes.` print every second
> `spark.kubernetes.allocation.batch.delay` default is 1s, which good enough
> when allocation succeeded, but exponential backoff maybe an better choice
> when alloction failed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]