[ 
https://issues.apache.org/jira/browse/HDFS-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-5438:
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.23.10
                   2.3.0
                   3.0.0
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

Committed to trunk, branch-2 and branch-0.23. Thanks for the reviews, Daryn.

> Flaws in block report processing can cause data loss
> ----------------------------------------------------
>
>                 Key: HDFS-5438
>                 URL: https://issues.apache.org/jira/browse/HDFS-5438
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 0.23.9, 2.2.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>             Fix For: 3.0.0, 2.3.0, 0.23.10
>
>         Attachments: HDFS-5438-1.trunk.patch, HDFS-5438-2.trunk.patch, 
> HDFS-5438-3.trunk.patch, HDFS-5438-4.branch-0.23.patch, 
> HDFS-5438-4.trunk.patch, HDFS-5438.trunk.patch
>
>
> The incremental block reports from data nodes and block commits are 
> asynchronous. This becomes troublesome when the gen stamp for a block is 
> changed during a write pipeline recovery.
> * If an incremental block report is delayed from a node but NN had enough 
> replicas already, a report with the old gen stamp may be received after block 
> completion. This replica will be correctly marked corrupt. But if the node 
> had participated in the pipeline recovery, a new (delayed) report with the 
> correct gen stamp will come soon. However, this report won't have any effect 
> on the corrupt state of the replica.
> * If block reports are received while the block is still under construction 
> (i.e. client's call to make block committed has not been received by NN), 
> they are blindly accepted regardless of the gen stamp. If a failed node 
> reports in with the old gen stamp while pipeline recovery is on-going, it 
> will be accepted and counted as valid during commit of the block.
> Due to the above two problems, correct replicas can be marked corrupt and 
> corrupt replicas can be accepted during commit.  So far we have observed two 
> cases in production.
> * The client hangs forever to close a file. All replicas are marked corrupt.
> * After the successful close of a file, read fails. Corrupt replicas are 
> accepted during commit and valid replicas are marked corrupt afterward.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to