Github user shubhamchopra commented on the issue:

    https://github.com/apache/spark/pull/17325
  
    Elaborating a little more on how replication happens and what the code 
change here does:
    Spark executors cache a list of peers that is refreshed every 60s by 
default. When replicating a block, the replication logic looks at this list, 
and tries to create replicas on randomly chosen executors. 
    When an executor with some cached blocks fails, if proactive replenishment 
is enabled, we try to replenish the lost block. We find other executors that 
might still have the block, and get them to replicate it using the replication 
process. 
    The replication process on an executor looks at the list of peers, randomly 
chooses one and tries to replicate to it. Now, if this happens to contact the 
lost executor, it will see a failed replication attempt, and the failure is 
handled by refreshing the list of peers, and trying the process again. The 
replication process is designed to handle such a failure. In that sense, 
pre-fetching the list of peers as suggested here, eliminates the probability of 
trying to replicate to a failed executor. 
    While this change alone should eliminate most of the "flakiness" seen in 
the test, change #2 in the unit test just makes the unit test more 
representative of the what we would likely see in real-world.
    Hope that answers some of the questions. I will make a note on the JIRA 
about this as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to