Github user GraceH commented on a diff in the pull request:
https://github.com/apache/spark/pull/7927#discussion_r36372781
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
---
@@ -590,10 +590,21 @@ private[spark] class BlockManager(
private def doGetRemote(blockId: BlockId, asBlockResult: Boolean):
Option[Any] = {
require(blockId != null, "BlockId is null")
val locations = Random.shuffle(master.getLocations(blockId))
+ var failTimes = 0
for (loc <- locations) {
logDebug(s"Getting remote block $blockId from $loc")
- val data = blockTransferService.fetchBlockSync(
- loc.host, loc.port, loc.executorId,
blockId.toString).nioByteBuffer()
+ val data = try {
+ blockTransferService.fetchBlockSync(
+ loc.host, loc.port, loc.executorId,
blockId.toString).nioByteBuffer()
+ } catch {
+ case e: IOException if failTimes < locations.size - 1 =>
+ // Return null when IOException throw, so we can fetch block
+ // from another location if there still have locations
+ failTimes += 1
+ logWarning(s"Try ${failTimes} times getting remote block
$blockId from $loc failed:", e)
--- End diff --
@squito We'd better to catch up the exception to avoid working flow to be
broken. The single fetch attempt should not break the entire code path. No
matter what the exception type is.
Regarding the place to catch that exception. For the debugging or
diagnosis, we'd better to mark down how many remote fetch failures there and
why. And log out all the fetches are failed or not. It seems to be more
reasonable to catch the exception in ```getRemote```. ```getRemote``` itself
can tolerant certain fetch failures from parts of the remotes. It only
requires single success. That is the design for ```getRemote```.
The ```fetchBlockSync``` is a function to tell if fetch fail or success. If
success, return back the data. If not, throw out the exception to indicate why.
To swallow the exception there seems not so that reasonable. The upper level
function can decide if the exception or fetch failure is acceptable or not.
What do you think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]