Github user shubhamchopra commented on the issue:
https://github.com/apache/spark/pull/17325
Elaborating a little more on how replication happens and what the code
change here does:
Spark executors cache a list of peers that is refreshed every 60s by
default. When replicating a block, the replication logic looks at this list,
and tries to create replicas on randomly chosen executors.
When an executor with some cached blocks fails, if proactive replenishment
is enabled, we try to replenish the lost block. We find other executors that
might still have the block, and get them to replicate it using the replication
process.
The replication process on an executor looks at the list of peers, randomly
chooses one and tries to replicate to it. Now, if this happens to contact the
lost executor, it will see a failed replication attempt, and the failure is
handled by refreshing the list of peers, and trying the process again. The
replication process is designed to handle such a failure. In that sense,
pre-fetching the list of peers as suggested here, eliminates the probability of
trying to replicate to a failed executor.
While this change alone should eliminate most of the "flakiness" seen in
the test, change #2 in the unit test just makes the unit test more
representative of the what we would likely see in real-world.
Hope that answers some of the questions. I will make a note on the JIRA
about this as well.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]