Github user andrewor14 commented on a diff in the pull request:
https://github.com/apache/spark/pull/11241#discussion_r53529137
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
---
@@ -575,6 +582,18 @@ private[spark] class BlockManager(
// This location failed, so we retry fetch from a different
one by returning null here
logWarning(s"Failed to fetch remote block $blockId " +
s"from $loc (failed attempt $numFetchFailures)", e)
+
+ // if dynamic alloc. is enabled and if there is a large number
of executors
+ // then locations list can contain a large number of stale
entries causing
+ // a large number of retries that may take a significant
amount of time
+ // To get rid of these stale entries we refresh the block
locations
+ // after a certain number of fetch failures
+ if (dynamicAllocationEnabled && numFetchFailures >=
maxFailuresBeforeLocationRefresh) {
--- End diff --
also, is there danger that this will never terminate? What if
`maxFailuresBeforeLocationRefresh` was set to 10 when number of executors is
like 1000, then we'll literally never throw the `BlockFetchException`. Maybe we
need to keep track of a global `numFetchFailures` to throw that exception based
on some threshold.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]