abhishekd0907 commented on pull request #35683:
URL: https://github.com/apache/spark/pull/35683#issuecomment-1083126883
> I am not sure I follow. In k8s, where there is no shuffle service - it
makes sense to move shuffle blocks when executor is getting decommissioned. In
yarn, particularly with dynamic resource allocation (but even without it), it
is very common for executors to exit - while shuffle service continues to serve
the shuffle blocks generated on that node, which might have no active executors
for application on them. You can have quite wild swings in executor allocations
(ramp up to 1000s of executors in 10s of seconds, and ramp down to very low
number equally fast).
>
> We should not be moving shuffle blocks when an executor is exiting - only
when the node is actually getting decommissioned (they might be the same for
k8s, not for yarn).
@mridulm
I have followed the approach discussed in this [Design
DOC](https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing)
for this PR and it doesn't discuss any risks when marking executors which are
running on Decommissionining nodes as Decommissioning or migrating shuffle
blocks to executors on other nodes. Please let me know if I am missing
something. Below is the discussion for YARN:
> application master receives the pre-emption/decommission message first and
needs to send the message to the executor. However the general principle and
mechanism remains the same. The DecommissionSelf message has been added to
support this. For performance, we can, when receiving the message from the
application master, immediately add the associated executor(s) to the
decommissioning hashmap.
Spark's application master can receive a message from YARN indicating that
one of its executors is being preempted/decommissioned or otherwise shut down.
After receiving this message the application master can send the
DecommissionSelf to the executor that is going to be shut down. From there the
rest of the message flow will look like that of standalone/k8s.
I belive there are use cases to handle graceful decommissioning of executors
if they are running on a node which has been marked as decommissioning. In many
public cloud environments, the node loss (in case of AWS SpotLoss, or GCP
preemptible VMs) is a planned and informed activity. The cloud provider
intimates the Yarn cluster manager about the possible loss of node in advance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]