[
https://issues.apache.org/jira/browse/HDFS-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon resolved HDFS-1260.
-------------------------------
Resolution: Fixed
Fix Version/s: (was: 0.20-append)
0.20.205.0
This was committed to 0.20.205, resolving JIRA
> 0.20: Block lost when multiple DNs trying to recover it to different genstamps
> ------------------------------------------------------------------------------
>
> Key: HDFS-1260
> URL: https://issues.apache.org/jira/browse/HDFS-1260
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.20-append
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
> Fix For: 0.20.205.0
>
> Attachments: HDFS-1260-20S.3.patch, hdfs-1260.txt, hdfs-1260.txt,
> simultaneous-recoveries.txt
>
>
> Saw this issue on a cluster where some ops people were doing network changes
> without shutting down DNs first. So, recovery ended up getting started at
> multiple different DNs at the same time, and some race condition occurred
> that caused a block to get permanently stuck in recovery mode. What seems to
> have happened is the following:
> - FSDataset.tryUpdateBlock called with old genstamp 7091, new genstamp 7094,
> while the block in the volumeMap (and on filesystem) was genstamp 7093
> - we find the block file and meta file based on block ID only, without
> comparing gen stamp
> - we rename the meta file to the new genstamp _7094
> - in updateBlockMap, we do comparison in the volumeMap by oldblock *without*
> wildcard GS, so it does *not* update volumeMap
> - validateBlockMetaData now fails with "blk_7739687463244048122_7094 does not
> exist in blocks map"
> After this point, all future recovery attempts to that node fail in
> getBlockMetaDataInfo, since it finds the _7094 gen stamp in getStoredBlock
> (since the meta file got renamed above) and then fails since _7094 isn't in
> volumeMap in validateBlockMetadata
> Making a unit test for this is probably going to be difficult, but doable.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira