attilapiros commented on code in PR #37821:
URL: https://github.com/apache/spark/pull/37821#discussion_r985016320
##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala:
##########
@@ -324,11 +324,24 @@ private[spark] class KubernetesClusterSchedulerBackend(
super.receiveAndReply(context)))
override def onDisconnected(rpcAddress: RpcAddress): Unit = {
Review Comment:
So this is tested by "Test decommissioning with dynamic allocation & shuffle
cleanups", right?
But that test is not even a decommissioning test:
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala#L183
I am a bit confused but I try to understand where and how this is tested.
Could you please give me some pointers?
##########
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala:
##########
@@ -324,11 +324,24 @@ private[spark] class KubernetesClusterSchedulerBackend(
super.receiveAndReply(context)))
override def onDisconnected(rpcAddress: RpcAddress): Unit = {
- // Don't do anything besides disabling the executor - allow the
Kubernetes API events to
- // drive the rest of the lifecycle decisions
- // TODO what if we disconnect from a networking issue? Probably want to
mark the executor
- // to be deleted eventually.
- addressToExecutorId.get(rpcAddress).foreach(disableExecutor)
+ val execId = addressToExecutorId.get(rpcAddress)
+ execId match {
+ case Some(id) =>
+ executorsPendingDecommission.get(id) match {
+ case Some(host) =>
+ // We don't pass through the host because by convention the
+ // host is only populated if the entire host is going away
+ // and we don't know if that's the case or just one container.
Review Comment:
Can we mention this distinction at the `ExecutorDecommissionInfo` too?
There I found the followings:
```
* @param workerHost When workerHost is defined, it means the host (aka the
`node` or `worker`
* in other places) has been decommissioned too. Used to
infer if the
* shuffle data might be lost even if the external shuffle
service is enabled.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]