Github user GraceH commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7927#discussion_r36372781
  
    --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
    @@ -590,10 +590,21 @@ private[spark] class BlockManager(
       private def doGetRemote(blockId: BlockId, asBlockResult: Boolean): 
Option[Any] = {
         require(blockId != null, "BlockId is null")
         val locations = Random.shuffle(master.getLocations(blockId))
    +    var failTimes = 0
         for (loc <- locations) {
           logDebug(s"Getting remote block $blockId from $loc")
    -      val data = blockTransferService.fetchBlockSync(
    -        loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
    +      val data = try {
    +        blockTransferService.fetchBlockSync(
    +          loc.host, loc.port, loc.executorId, 
blockId.toString).nioByteBuffer()
    +      } catch {
    +        case e: IOException if failTimes < locations.size - 1 =>
    +          // Return null when IOException throw, so we can fetch block
    +          // from another location if there still have locations
    +          failTimes += 1
    +          logWarning(s"Try ${failTimes} times getting remote block 
$blockId from $loc failed:", e)
    --- End diff --
    
    @squito We'd better to catch up the exception to avoid working flow to be 
broken.   The single fetch attempt should not break the entire code path. No 
matter what the exception type is.
    
    Regarding the place to catch that exception. For the debugging or 
diagnosis, we'd better to mark down how many remote fetch failures there and 
why. And log out all the fetches are failed or not.  It seems to be more 
reasonable to catch the exception in ```getRemote```. ```getRemote``` itself 
can tolerant  certain fetch failures from parts of the remotes. It only 
requires single success. That is the design for ```getRemote```.
    
    The ```fetchBlockSync``` is a function to tell if fetch fail or success. If 
success, return back the data. If not, throw out the exception to indicate why. 
To swallow the exception there seems not so that reasonable. The upper level 
function can decide if the exception or fetch failure is acceptable or not. 
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to