[
https://issues.apache.org/jira/browse/HDFS-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eli Collins resolved HDFS-2742.
-------------------------------
Resolution: Fixed
Hadoop Flags: Reviewed
I've committed this.
> HA: observed dataloss in replication stress test
> ------------------------------------------------
>
> Key: HDFS-2742
> URL: https://issues.apache.org/jira/browse/HDFS-2742
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: data-node, ha, name-node
> Affects Versions: HA branch (HDFS-1623)
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Blocker
> Attachments: hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt,
> hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt, hdfs-2742.txt,
> log-colorized.txt
>
>
> The replication stress test case failed over the weekend since one of the
> replicas went missing. Still diagnosing the issue, but it seems like the
> chain of events was something like:
> - a block report was generated on one of the nodes while the block was being
> written - thus the block report listed the block as RBW
> - when the standby replayed this queued message, it was replayed after the
> file was marked complete. Thus it marked this replica as corrupt
> - it asked the DN holding the corrupt replica to delete it. And, I think,
> removed it from the block map at this time.
> - That DN then did another block report before receiving the deletion. This
> caused it to be re-added to the block map, since it was "FINALIZED" now.
> - Replication was lowered on the file, and it counted the above replica as
> non-corrupt, and asked for the other replicas to be deleted.
> - All replicas were lost.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira