[ https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363210#comment-16363210 ]
Kihwal Lee commented on HDFS-12070: ----------------------------------- {noformat} [INFO] ------------------------------------------------------- [INFO] T E S T S [INFO] ------------------------------------------------------- [INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA [INFO] Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 108.315 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA [INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 80.609 s - in org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits [INFO] Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure [WARNING] Tests run: 18, Failures: 0, Errors: 0, Skipped: 10, Time elapsed: 124.02 s - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure [INFO] Running org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts [INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.039 s - in org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts [INFO] Running org.apache.hadoop.hdfs.TestMaintenanceState [INFO] Tests run: 25, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 353.62 s - in org.apache.hadoop.hdfs.TestMaintenanceState [INFO] [INFO] Results: [INFO] [WARNING] Tests run: 87, Failures: 0, Errors: 0, Skipped: 10 {noformat} > Failed block recovery leaves files open indefinitely and at risk for data loss > ------------------------------------------------------------------------------ > > Key: HDFS-12070 > URL: https://issues.apache.org/jira/browse/HDFS-12070 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.0.0-alpha > Reporter: Daryn Sharp > Assignee: Kihwal Lee > Priority: Major > Attachments: HDFS-12070.0.patch, lease.patch > > > Files will remain open indefinitely if block recovery fails which creates a > high risk of data loss. The replication monitor will not replicate these > blocks. > The NN provides the primary node a list of candidate nodes for recovery which > involves a 2-stage process. The primary node removes any candidates that > cannot init replica recovery (essentially alive and knows about the block) to > create a sync list. Stage 2 issues updates to the sync list – _but fails if > any node fails_ unlike the first stage. The NN should be informed of nodes > that did succeed. > Manual recovery will also fail until the problematic node is temporarily > stopped so a connection refused will induce the bad node to be pruned from > the candidates. Recovery succeeds, the lease is released, under replication > is fixed, and block is invalidated from the bad node. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org