[jira] [Updated] (SPARK-15736) Gracefully handle loss of DiskStore files

Josh Rosen (JIRA) Thu, 02 Jun 2016 14:25:29 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-15736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen updated SPARK-15736:
-------------------------------
    Description: 
If an RDD partition is cached on disk and the DiskStore file is lost, then 
reads of that cached partition will fail and the missing partition is supposed 
to be recomputed by a new task attempt. In the current BlockManager 
implementation, however, the missing file does not trigger any metadata updates 
/ does not invalidate the cache, so subsequent task attempts will be scheduled 
on the same executor and the doomed read will be repeatedly retried, leading to 
repeated task failures and eventually a total job failure.


In order to fix this problem, the executor with the missing file needs to 
properly mark the corresponding block as missing so that it stops advertising 
itself as a cache location for that block.

  was:
If an RDD partition is cached on disk and the on-disk file is lost, then reads 
of that cached partition will fail and the missing partition is supposed to be 
recomputed by a new task attempt. However, the current behavior is to 
repeatedly re-attempt the read on the same machine without performing any 
recomputation, which leads to a complete job failure.

In order to fix this problem, the executor with the missing file needs to 
properly mark the corresponding block as missing so that it stops advertising 
itself as a cache location for that block.


> Gracefully handle loss of DiskStore files
> -----------------------------------------
>
>                 Key: SPARK-15736
>                 URL: https://issues.apache.org/jira/browse/SPARK-15736
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>
> If an RDD partition is cached on disk and the DiskStore file is lost, then 
> reads of that cached partition will fail and the missing partition is 
> supposed to be recomputed by a new task attempt. In the current BlockManager 
> implementation, however, the missing file does not trigger any metadata 
> updates / does not invalidate the cache, so subsequent task attempts will be 
> scheduled on the same executor and the doomed read will be repeatedly 
> retried, leading to repeated task failures and eventually a total job failure.
> In order to fix this problem, the executor with the missing file needs to 
> properly mark the corresponding block as missing so that it stops advertising 
> itself as a cache location for that block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-15736) Gracefully handle loss of DiskStore files

Reply via email to