Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3785#issuecomment-68292416
On the surface, this seems like an okay change. I wonder whether this
retry logic could have unexpected consequences. Let me try to reason it out:
- `askTracker` is only called with `GetMapOutputStatuses`.
- In the master actor, it calls `getSerializedMapOutputStatuses`. This
method never throws exceptions: if a shuffle is missing, then it just stores an
empty array and serializes it.
- It's possible that the serialized map statuses could exceed the Akka
frame size (although extremely unlikely and perhaps impossible with the new
output status compression techniques). In this case, though, the master would
throw an exception and fail to send a reply back to the asker. In this case,
with this patch we'd end up performing a bunch of retries for an operation that
will ultimately fail, so we'll take longer to detect a failure.
In the common cases, though, this seems fine, even if the map output
statuses are missing (since it won't introduce a bunch of futile retries).
Therefore, I think we should pull this in; I don't know if this fixes an actual
bug, but it seems like it could make things more robust.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]