[
https://issues.apache.org/jira/browse/HDFS-7065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kihwal Lee reassigned HDFS-7065:
--------------------------------
Assignee: Kihwal Lee
> Pipeline close recovery race can cause block corruption
> -------------------------------------------------------
>
> Key: HDFS-7065
> URL: https://issues.apache.org/jira/browse/HDFS-7065
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.5.0
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Critical
> Attachments: HDFS-7065.patch
>
>
> If multiple pipeline close recoveries are performed against the same block,
> the replica may go corrupt. Here is one case I have observed:
> The client tried to close a block, but the ACK timed out. It excluded the
> first DN and tried pipeline recovery (recoverClose). It too failed and
> another recovery was attempted with only one DN. This took more than usual
> but the client eventually got an ACK and the file was closed successfully.
> Later on the one and only replica was found to be corrupt.
> It turned out the DN was having transient slow disk I/O issue at that time.
> The first recovery was stuck until the second recovery was attempted 30
> seconds later. After few seconds, they both threads started running. The
> second recovery finished first and then the first recovery with an older gen
> stamp finished, turning gen stamp backward.
> There is a sanity check in {{recoverCheck()}}, but since check and modify are
> not synchronized, {{recoverClose()}} is not multi-thread safe.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)