[GitHub] spark pull request #15186: [SPARK-17485] Prevent failed remote reads of cach...

JoshRosen Wed, 21 Sep 2016 14:29:48 -0700

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/15186


    [SPARK-17485] Prevent failed remote reads of cached blocks from failing 
entire job (branch-1.6 backport)

    This patch is a branch-1.6 backport of #15037:
    
    ## What changes were proposed in this pull request?
    
    In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached 
RDD block, then a remote copy, and only fall back to recomputing the block if 
no cached copy (local or remote) can be read. This logic works correctly in the 
case where no remote copies of the block exist, but if there _are_ remote 
copies and reads of those copies fail (due to network issues or internal Spark 
bugs) then the BlockManager will throw a `BlockFetchException` that will fail 
the task (and which could possibly fail the whole job if the read failures keep 
occurring).
    
    In the cases of TorrentBroadcast and task result fetching we really do want 
to fail the entire job in case no remote blocks can be fetched, but this logic 
is inappropriate for reads of cached RDD blocks because those can/should be 
recomputed in case cached blocks are unavailable.
    
    Therefore, I think that the `BlockManager.getRemoteBytes()` method should 
never throw on remote fetch errors and, instead, should handle failures by 
returning `None`.
    
    ## How was this patch tested?
    
    Block manager changes should be covered by modified tests in 
`BlockManagerSuite`: the old tests expected exceptions to be thrown on failed 
remote reads, while the modified tests now expect `None` to be returned from 
the `getRemote*` method.
    
    I also manually inspected all usages of `BlockManager.getRemoteValues()`, 
`getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on 
the result and handle `None`. Note that these `None` branches are already 
exercised because the old `getRemoteBytes` returned `None` when no remote 
locations for the block could be found (which could occur if an executor died 
and its block manager de-registered with the master).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark 
SPARK-17485-branch-1.6-backport

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15186.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15186
    
----
commit af88ada7548e8988f491997b929f30436353c20a
Author: Josh Rosen <[email protected]>
Date:   2016-09-12T22:43:57Z

    [SPARK-17485] Prevent failed remote reads of cached blocks from failing 
entire job
    
    In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached 
RDD block, then a remote copy, and only fall back to recomputing the block if 
no cached copy (local or remote) can be read. This logic works correctly in the 
case where no remote copies of the block exist, but if there _are_ remote 
copies and reads of those copies fail (due to network issues or internal Spark 
bugs) then the BlockManager will throw a `BlockFetchException` that will fail 
the task (and which could possibly fail the whole job if the read failures keep 
occurring).
    
    In the cases of TorrentBroadcast and task result fetching we really do want 
to fail the entire job in case no remote blocks can be fetched, but this logic 
is inappropriate for reads of cached RDD blocks because those can/should be 
recomputed in case cached blocks are unavailable.
    
    Therefore, I think that the `BlockManager.getRemoteBytes()` method should 
never throw on remote fetch errors and, instead, should handle failures by 
returning `None`.
    
    Block manager changes should be covered by modified tests in 
`BlockManagerSuite`: the old tests expected exceptions to be thrown on failed 
remote reads, while the modified tests now expect `None` to be returned from 
the `getRemote*` method.
    
    I also manually inspected all usages of `BlockManager.getRemoteValues()`, 
`getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on 
the result and handle `None`. Note that these `None` branches are already 
exercised because the old `getRemoteBytes` returned `None` when no remote 
locations for the block could be found (which could occur if an executor died 
and its block manager de-registered with the master).
    
    Author: Josh Rosen <[email protected]>
    
    Closes #15037 from JoshRosen/SPARK-17485.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15186: [SPARK-17485] Prevent failed remote reads of cach...

Reply via email to