[
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363210#comment-16363210
]
Kihwal Lee commented on HDFS-12070:
-----------------------------------
{noformat}
[INFO] -------------------------------------------------------
[INFO] T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
[INFO] Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 108.315
s - in org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
[INFO] Running org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits
[INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 80.609 s
- in org.apache.hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits
[INFO] Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure
[WARNING] Tests run: 18, Failures: 0, Errors: 0, Skipped: 10, Time elapsed:
124.02 s - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure
[INFO] Running org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts
[INFO] Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.039 s
- in org.apache.hadoop.hdfs.web.TestWebHdfsTimeouts
[INFO] Running org.apache.hadoop.hdfs.TestMaintenanceState
[INFO] Tests run: 25, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 353.62
s - in org.apache.hadoop.hdfs.TestMaintenanceState
[INFO]
[INFO] Results:
[INFO]
[WARNING] Tests run: 87, Failures: 0, Errors: 0, Skipped: 10
{noformat}
> Failed block recovery leaves files open indefinitely and at risk for data loss
> ------------------------------------------------------------------------------
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.0.0-alpha
> Reporter: Daryn Sharp
> Assignee: Kihwal Lee
> Priority: Major
> Attachments: HDFS-12070.0.patch, lease.patch
>
>
> Files will remain open indefinitely if block recovery fails which creates a
> high risk of data loss. The replication monitor will not replicate these
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which
> involves a 2-stage process. The primary node removes any candidates that
> cannot init replica recovery (essentially alive and knows about the block) to
> create a sync list. Stage 2 issues updates to the sync list – _but fails if
> any node fails_ unlike the first stage. The NN should be informed of nodes
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily
> stopped so a connection refused will induce the bad node to be pruned from
> the candidates. Recovery succeeds, the lease is released, under replication
> is fixed, and block is invalidated from the bad node.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]