[ 
https://issues.apache.org/jira/browse/HDFS-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15023654#comment-15023654
 ] 

Walter Su commented on HDFS-6101:
---------------------------------

Thanks for the update.
bq. We should call cluster.setDataNodeDead(..) to remove it from cluster map.
1. Actually it's wrong. My mistake. This line is unnecessary.

2. Suggestion, you can enable log to make debug easier.
{code}
static {
  ((Log4JLogger)LogFactory.getLog(BlockPlacementPolicy.class))
      .getLogger().setLevel(Level.ALL);
}
{code}

3. 
bq. sleep 3 seconds instead of 1 seconds
It's not exactly the intention of the old logic. I tried sleep(1), and I found
{noformat}
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy$NotEnoughReplicasException:
 [
Node /rack1/127.0.0.1:44392 [
  Datanode 127.0.0.1:44392 is not chosen since no good storage to place the 
block .
]
{noformat}
That's because the first block report is not finished, so 
DatanodeDescriptor#storageMap is empty. I tried 
{{cluster.waitFirstBRCompleted();}} but there is race condition with the 
{{slowwriters}}.

So I think we can:
1. start 5 writers, and sleep shortly to make them all started.
2. start 2 new DNs, waitFirstBRCompleted, and stop an old DN. (We don't need to 
call cluster.setDataNodeDead())
3. start 5 new writers.

As the comment says
{noformat}
      // Let slow writers write something.
      // Some of them are too slow and will be not yet started. 
{noformat}
In this way, we don't change the logic of the test.

4. This line is not needed.
{code}
final BlockManager bm =
{code}

> TestReplaceDatanodeOnFailure fails occasionally
> -----------------------------------------------
>
>                 Key: HDFS-6101
>                 URL: https://issues.apache.org/jira/browse/HDFS-6101
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Arpit Agarwal
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-6101.001.patch, HDFS-6101.002.patch, 
> HDFS-6101.003.patch, HDFS-6101.004.patch, HDFS-6101.005.patch, 
> HDFS-6101.006.patch, TestReplaceDatanodeOnFailure.log
>
>
> Exception details in a comment below.
> The failure repros on both OS X and Linux if I run the test ~10 times in a 
> loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to