Daryn Sharp created HDFS-12070:
----------------------------------

             Summary: Failed block recovery leaves files open indefinitely and 
at risk for data loss
                 Key: HDFS-12070
                 URL: https://issues.apache.org/jira/browse/HDFS-12070
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.0.0-alpha
            Reporter: Daryn Sharp


Files will remain open indefinitely if block recovery fails which creates a 
high risk of data loss.  The replication monitor will not replicate these 
blocks.

The NN provides the primary node a list of candidate nodes for recovery which 
involves a 2-stage process. The primary node removes any candidates that cannot 
init replica recovery (essentially alive and knows about the block) to create a 
sync list.  Stage 2 issues updates to the sync list – _but fails if any node 
fails_ unlike the first stage.  The NN should be informed of nodes that did 
succeed.

Manual recovery will also fail until the problematic node is temporarily 
stopped so a connection refused will induce the bad node to be pruned from the 
candidates.  Recovery succeeds, the lease is released, under replication is 
fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to