[
https://issues.apache.org/jira/browse/HDFS-9358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002872#comment-15002872
]
Wei-Chiu Chuang commented on HDFS-9358:
---------------------------------------
[~iwasakims] Thanks for the patch.
I looked at the patch, and what it does is follows:
After NN detects DN is down, wait until the excess replica is invalidated,
before restarting the stopped DN again.
After the DN is restarted, make sure the excessive replica is detected.
So the process is deterministic and will always go like (granted no timeout)
{noformat}
(live, excess): (3, 1) -> (3, 0) -> (2, 1)
{noformat}
I don't have the committership, but looks good to me. I ran the patched test
and it did not fail in 100 runs.
> TestNodeCount#testNodeCount timed out
> -------------------------------------
>
> Key: HDFS-9358
> URL: https://issues.apache.org/jira/browse/HDFS-9358
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Wei-Chiu Chuang
> Assignee: Masatake Iwasaki
> Attachments: HDFS-9358.001.patch
>
>
> I have seen this test failure occurred a few times in trunk:
> Error Message
> Timeout: excess replica count not equal to 2 for block blk_1073741825_1001
> after 20000 msec. Last counts: live = 2, excess = 0, corrupt = 0
> Stacktrace
> java.util.concurrent.TimeoutException: Timeout: excess replica count not
> equal to 2 for block blk_1073741825_1001 after 20000 msec. Last counts: live
> = 2, excess = 0, corrupt = 0
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.checkTimeout(TestNodeCount.java:152)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.checkTimeout(TestNodeCount.java:146)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.__CLR4_0_39bdgm666uf(TestNodeCount.java:130)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.TestNodeCount.testNodeCount(TestNodeCount.java:54)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)