[
https://issues.apache.org/jira/browse/HBASE-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493091#comment-14493091
]
Matteo Bertozzi commented on HBASE-13430:
-----------------------------------------
I think the solution Tobi provided is ok, we already do the same thing with
HFileLink
the main problem are files that are moving around, if we want to change the fs
layout again we should avoid having files moving around (see HBASE-7806).
so, to me the only fix we should do here is adding the temp dir (as the
HFileLink already does)
otherwise we end up with much more code and possible "incompatibilities" when
you do rolling upgrade of the masters .
> HFiles that are in use by a table cloned from a snapshot may be deleted when
> that snapshot is deleted
> -----------------------------------------------------------------------------------------------------
>
> Key: HBASE-13430
> URL: https://issues.apache.org/jira/browse/HBASE-13430
> Project: HBase
> Issue Type: Bug
> Components: hbase
> Reporter: Tobi Vollebregt
> Priority: Critical
> Labels: data-integrity, master
> Fix For: 2.0.0, 1.1.0, 0.98.13, 1.0.2
>
> Attachments: HBASE-13430-master-v1.patch,
> HBASE-13430-master-v2.patch, hbase-13430-attempted-fix.patch,
> hbase-13430-test.patch
>
>
> We recently had a production issue in which HFiles that were still in use by
> a table were deleted. This appears to have been caused by race conditions in
> the order in which HFileLinks are created, combined with the fact that only
> files younger than {{hbase.master.hfilecleaner.ttl}} are kept alive.
> This is how to reproduce:
> * Clone a large snapshot into a new table. The clone operation must take
> more than {{hbase.master.hfilecleaner.ttl}} time to guarantee data loss.
> * Ensure that no other table or snapshot is referencing the HFiles used by
> the new table.
> * Delete the snapshot. This breaks the table.
> The main cause is this:
> * Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
> * However, it immediately creates back references to the HFileLinks that it
> creates for the table in the archive directory.
> * HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it
> considers all those back references deletable.
> * The only thing that keeps them alive is the TimeToLiveHFileCleaner, but
> only for 5 minutes.
> * So if cloning the snapshot takes more than 5 minutes, and the HFiles
> aren't referenced by anything else, data loss is guaranteed.
> I have a unit test reproducing the issue and I tried to fix this, but didn't
> completely succeed. I will attach the patch shortly.
> Workarounds:
> * Don't delete any snapshots that you cloned into a table (we used this
> successfully-- we actually restored the deleted snapshot from backup using
> ExportSnapshot after the data loss happened, which successfully reversed the
> data loss).
> * Manually check the back references and create any missing ones after
> cloning a snapshot.
> * Increase {{hbase.master.hfilecleaner.ttl}}. (untested)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)