[
https://issues.apache.org/jira/browse/HBASE-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487013#comment-14487013
]
Matteo Bertozzi commented on HBASE-13430:
-----------------------------------------
looks good but the HFileLinkCleaner change should be the other way around.
you must check temp first and then the dir, because the file is moving in that
order. from temp to final.
the failure that you were seeing was probably related to this, or at least was
the one that was showing up every once in a while for me using your test.
{code}
- hfilePath = HFileLink.getHFileFromBackReference(getConf(), filePath);
+ // Also check if the HFile is in the HBASE_TEMP_DIRECTORY; this is
where the referenced
+ // file gets created when cloning a snapshot.
+ hfilePath = HFileLink.getHFileFromBackReference(
+ new Path(FSUtils.getRootDir(getConf()),
HConstants.HBASE_TEMP_DIRECTORY), filePath);
+ if (fs.exists(hfilePath)) {
+ return false;
+ }
+ hfilePath =
HFileLink.getHFileFromBackReference(FSUtils.getRootDir(getConf()), filePath);
return !fs.exists(hfilePath);
{code}
> HFiles that are in use by a table cloned from a snapshot may be deleted when
> that snapshot is deleted
> -----------------------------------------------------------------------------------------------------
>
> Key: HBASE-13430
> URL: https://issues.apache.org/jira/browse/HBASE-13430
> Project: HBase
> Issue Type: Bug
> Components: hbase
> Reporter: Tobi Vollebregt
> Priority: Critical
> Labels: data-integrity, master
> Fix For: 2.0.0, 1.0.1, 1.1.0, 0.98.13
>
> Attachments: hbase-13430-attempted-fix.patch, hbase-13430-test.patch
>
>
> We recently had a production issue in which HFiles that were still in use by
> a table were deleted. This appears to have been caused by race conditions in
> the order in which HFileLinks are created, combined with the fact that only
> files younger than {{hbase.master.hfilecleaner.ttl}} are kept alive.
> This is how to reproduce:
> * Clone a large snapshot into a new table. The clone operation must take
> more than {{hbase.master.hfilecleaner.ttl}} time to guarantee data loss.
> * Ensure that no other table or snapshot is referencing the HFiles used by
> the new table.
> * Delete the snapshot. This breaks the table.
> The main cause is this:
> * Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
> * However, it immediately creates back references to the HFileLinks that it
> creates for the table in the archive directory.
> * HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it
> considers all those back references deletable.
> * The only thing that keeps them alive is the TimeToLiveHFileCleaner, but
> only for 5 minutes.
> * So if cloning the snapshot takes more than 5 minutes, and the HFiles
> aren't referenced by anything else, data loss is guaranteed.
> I have a unit test reproducing the issue and I tried to fix this, but didn't
> completely succeed. I will attach the patch shortly.
> Workarounds:
> * Don't delete any snapshots that you cloned into a table (we used this
> successfully-- we actually restored the deleted snapshot from backup using
> ExportSnapshot after the data loss happened, which successfully reversed the
> data loss).
> * Manually check the back references and create any missing ones after
> cloning a snapshot.
> * Increase {{hbase.master.hfilecleaner.ttl}}. (untested)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)