Github user JoshRosen commented on the issue:
https://github.com/apache/spark/pull/14931
LGTM.
There's a slight change of behavior here for the corner-case scenario where
the worker (not executor) dies and then is immediately recovered: prior to this
patch, I believe that the old shuffle files would continue to be served by the
restarted worker's shuffle service, but after this patch the MapOutputTracker
entries will have been invalidated and the driver won't ask for shuffle files
from that worker.
In terms of default / common-case behaviors, I prefer the behavior
implemented in this patch: when a worker disappears it seems reasonable to
treat its map outputs as missing and if the worker happens to come back later
then it would make more sense to explicitly re-register those outputs. Even if
a worker will be eventually recovered it might take a long time for that to
happen, leading to long hangs.
If we decide that it's important to re-register map outputs after worker
recovery then I think we can add that explicitly in a separate patch.
I'm going to merge this to master and will evaluate backporting to
branch-2.0.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]