[jira] [Created] (SPARK-17485) Failed remote cached block reads can lead to whole job failure

Josh Rosen (JIRA) Fri, 09 Sep 2016 14:32:47 -0700

Josh Rosen created SPARK-17485:
----------------------------------

             Summary: Failed remote cached block reads can lead to whole job 
failure
                 Key: SPARK-17485
                 URL: https://issues.apache.org/jira/browse/SPARK-17485
             Project: Spark
          Issue Type: Improvement
          Components: Block Manager
    Affects Versions: 2.0.0, 1.6.2
            Reporter: Josh Rosen
            Assignee: Josh Rosen



In Spark's RDD.getOrCompute we first try to read a local copy of a cached 
block, then a remote copy, and only fall back to recomputing the block if no 
cached copy (local or remote) can be read. This logic works correctly in the 
case where no remote copies of the block exist, but if there _are_ remote 
copies but reads of those copies fail (due to network issues or internal Spark 
bugs) then the BlockManager will throw a {{BlockFetchException}} error that 
fails the entire job.

In the case of torrent broadcast we really _do_ want to fail the entire job in 
case no remote blocks can be fetched, but this logic is inappropriate for 
cached blocks because those can/should be recomputed.

Therefore, I think that this exception should be thrown higher up the call 
stack by the BlockManager client code and not the block manager itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-17485) Failed remote cached block reads can lead to whole job failure

Reply via email to