Github user nezihyigitbasi commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11241#discussion_r53530728
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
    @@ -575,6 +582,18 @@ private[spark] class BlockManager(
                 // This location failed, so we retry fetch from a different 
one by returning null here
                 logWarning(s"Failed to fetch remote block $blockId " +
                   s"from $loc (failed attempt $numFetchFailures)", e)
    +
    +            // if dynamic alloc. is enabled and if there is a large number 
of executors
    +            // then locations list can contain a large number of stale 
entries causing
    +            // a large number of retries that may take a significant 
amount of time
    +            // To get rid of these stale entries we refresh the block 
locations
    +            // after a certain number of fetch failures
    +            if (dynamicAllocationEnabled && numFetchFailures >= 
maxFailuresBeforeLocationRefresh) {
    --- End diff --
    
    Eventually it should terminate as this loop should eventually hit a live 
replica as the locations are refreshed constantly. I thought about setting a 
global threshold, but it is tricky. We can keep refreshing locations and keep 
track of the fetch failures, but it's possible that we can hit the threshold 
before hitting a live replica (even if it exists in the location list -- that's 
one of the reasons this logic just keeps going on to eventually hit a live 
replica). Because I have seen that if there is a lot of executors coming up and 
going away we still get multiple stale locations with every location refresh 
and depending on the threshold value we may not even hit a live executor in 
this list.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to