[ 
https://issues.apache.org/jira/browse/HADOOP-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681107#action_12681107
 ] 

Hairong Kuang edited comment on HADOOP-5465 at 3/12/09 2:37 PM:
----------------------------------------------------------------

Two bugs in DFS contributed to the problem:
(1). DataNode does not sync on modification to the counter "xmitsInProgress", 
which keeps track of the number of replication in progress. When two threads 
update the counter concurrently, race condition may occurs. The counter may 
change to be a non-zero value when no replication is going on.
(2). Each DN is configured to have at most 2 replications in progress. When DN 
notifies NN that it has 1 replication in progress, NN should be able to send 
one block replication request to DN. But NN wrongly interprets the counter as 
the number of targets. When it sees that the block is scheduled to 2 targets 
but DN can only take 1, it sends an empty replication request to DN. As a 
result, blocking all replications from this DataNode. If the DataNode is the 
only source of an under-replicated block, the block will never get replicated.

Fixing either (1) or (2) could fix the problem. I think (1) is more fundamental 
so I will fix (1) in this jira and file a different jira to fix (2).

      was (Author: hairong):
    Two bugs in DFS contributed to the problem:
(1). DataNode does not sync on modification to the counter "xmitsInProgress", 
which keeps track of the number of replication in progress. When two threads 
update the counter concurrently, race condition may occurs. The counter may 
change to be a non-zero value when no replication is going on.
(2). Each DN is configured to have at most 2 replications in progress. When DN 
notifies NN that it has 1 replication in progress, NN should be able to send 
one block replication request to DN. But NN wrongly interprets the counter as 
the number of targets. When it sees that the block is scheduled to 2 targets 
but DN can only take 1, it sends an empty replication request to DN. As a 
result, blocking all replication from this DataNode. If the DataNode is the 
only source of an under-replicated block, the block will never gets replicated.

Fixing either one or two could fix the problem. I think (1) is more fundamental 
so I will fix (1) in this jira and file a different jira to fix (2).
  
> Blocks remain under-replicated
> ------------------------------
>
>                 Key: HADOOP-5465
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5465
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.18.3
>            Reporter: Hairong Kuang
>            Assignee: Hairong Kuang
>            Priority: Blocker
>             Fix For: 0.18.4, 0.19.2, 0.20.0, 0.21.0
>
>         Attachments: xmitsSync1.patch
>
>
> Occasionally we see some blocks remain to be under-replicated in our 
> production clusters. This is what we obeserved:
> 1. Sometimes when increasing the replication factor of a file, some blocks 
> belonged to this file do not get to increase to the new replication factor.
> 2. When taking meta save in two different days, some blocks remain in 
> under-replication queue. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to