This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.2 by this push:
new 1d8b556 [SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved`
should handle `ExecutorDecommission` as finished
1d8b556 is described below
commit 1d8b556bd600baaed23fe700b60bddb2bf7cfc5f
Author: Dongjoon Hyun <[email protected]>
AuthorDate: Tue Jan 25 14:34:56 2022 -0800
[SPARK-38023][CORE] `ExecutorMonitor.onExecutorRemoved` should handle
`ExecutorDecommission` as finished
Although SPARK-36614 (https://github.com/apache/spark/pull/33868) fixed the
UI issue, it made a regression where the `K8s integration test` has been broken
and shows a wrong metrics and message to the users. After `Finished
decommissioning`, it's still counted it as `unfinished`. This PR aims to fix
this bug.
**BEFORE**
```
22/01/25 13:05:16 DEBUG
KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:
Asked to remove executor 1 with reason Finished decommissioning
...
22/01/25 13:05:16 INFO ExecutorMonitor: Executor 1 is removed.
Remove reason statistics: (gracefully decommissioned: 0, decommision
unfinished: 1, driver killed: 0, unexpectedly exited: 0).
```
**AFTER**
```
Remove reason statistics: (gracefully decommissioned: 1, decommision
unfinished: 0, driver killed: 0, unexpectedly exited: 0).
```
```
$ build/sbt -Pkubernetes -Pkubernetes-integration-tests
-Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
-Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
```
**BEFORE**
The corresponding test case hangs and fails.
```
[info] KubernetesSuite:
...
[info] *** Test still running after 2 minutes, 13 seconds: suite name:
KubernetesSuite, test name: Test decommissioning with dynamic allocation &
shuffle cleanups.
// Eventually fails
...
```
**AFTER**
```
[info] KubernetesSuite:
...
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2
minutes, 41 seconds)
...
```
Yes, this is a regression bug fix.
Manually because this should be verified via the K8s integration test
```
$ build/sbt -Pkubernetes -Pkubernetes-integration-tests
-Dspark.kubernetes.test.dockerFile=resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile.java17
-Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
```
Closes #35321 from dongjoon-hyun/SPARK-38023.
Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 9887d0f7f55157da1b9f55d7053cc6c78ea3cdc5)
Signed-off-by: Dongjoon Hyun <[email protected]>
---
.../scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git
a/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
b/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
index acdfaa5..8ce24aa 100644
---
a/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
+++
b/core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
@@ -355,7 +355,8 @@ private[spark] class ExecutorMonitor(
if (removed != null) {
decrementExecResourceProfileCount(removed.resourceProfileId)
if (removed.decommissioning) {
- if (event.reason == ExecutorLossMessage.decommissionFinished) {
+ if (event.reason == ExecutorLossMessage.decommissionFinished ||
+ event.reason == ExecutorDecommission().message) {
metrics.gracefullyDecommissioned.inc()
} else {
metrics.decommissionUnfinished.inc()
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]