GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/15186
[SPARK-17485] Prevent failed remote reads of cached blocks from failing
entire job (branch-1.6 backport)
This patch is a branch-1.6 backport of #15037:
## What changes were proposed in this pull request?
In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached
RDD block, then a remote copy, and only fall back to recomputing the block if
no cached copy (local or remote) can be read. This logic works correctly in the
case where no remote copies of the block exist, but if there _are_ remote
copies and reads of those copies fail (due to network issues or internal Spark
bugs) then the BlockManager will throw a `BlockFetchException` that will fail
the task (and which could possibly fail the whole job if the read failures keep
occurring).
In the cases of TorrentBroadcast and task result fetching we really do want
to fail the entire job in case no remote blocks can be fetched, but this logic
is inappropriate for reads of cached RDD blocks because those can/should be
recomputed in case cached blocks are unavailable.
Therefore, I think that the `BlockManager.getRemoteBytes()` method should
never throw on remote fetch errors and, instead, should handle failures by
returning `None`.
## How was this patch tested?
Block manager changes should be covered by modified tests in
`BlockManagerSuite`: the old tests expected exceptions to be thrown on failed
remote reads, while the modified tests now expect `None` to be returned from
the `getRemote*` method.
I also manually inspected all usages of `BlockManager.getRemoteValues()`,
`getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on
the result and handle `None`. Note that these `None` branches are already
exercised because the old `getRemoteBytes` returned `None` when no remote
locations for the block could be found (which could occur if an executor died
and its block manager de-registered with the master).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark
SPARK-17485-branch-1.6-backport
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15186.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15186
----
commit af88ada7548e8988f491997b929f30436353c20a
Author: Josh Rosen <[email protected]>
Date: 2016-09-12T22:43:57Z
[SPARK-17485] Prevent failed remote reads of cached blocks from failing
entire job
In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached
RDD block, then a remote copy, and only fall back to recomputing the block if
no cached copy (local or remote) can be read. This logic works correctly in the
case where no remote copies of the block exist, but if there _are_ remote
copies and reads of those copies fail (due to network issues or internal Spark
bugs) then the BlockManager will throw a `BlockFetchException` that will fail
the task (and which could possibly fail the whole job if the read failures keep
occurring).
In the cases of TorrentBroadcast and task result fetching we really do want
to fail the entire job in case no remote blocks can be fetched, but this logic
is inappropriate for reads of cached RDD blocks because those can/should be
recomputed in case cached blocks are unavailable.
Therefore, I think that the `BlockManager.getRemoteBytes()` method should
never throw on remote fetch errors and, instead, should handle failures by
returning `None`.
Block manager changes should be covered by modified tests in
`BlockManagerSuite`: the old tests expected exceptions to be thrown on failed
remote reads, while the modified tests now expect `None` to be returned from
the `getRemote*` method.
I also manually inspected all usages of `BlockManager.getRemoteValues()`,
`getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on
the result and handle `None`. Note that these `None` branches are already
exercised because the old `getRemoteBytes` returned `None` when no remote
locations for the block could be found (which could occur if an executor died
and its block manager de-registered with the master).
Author: Josh Rosen <[email protected]>
Closes #15037 from JoshRosen/SPARK-17485.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]