[
https://issues.apache.org/jira/browse/HBASE-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501309#comment-14501309
]
Hudson commented on HBASE-13430:
--------------------------------
FAILURE: Integrated in HBase-1.2 #6 (See
[https://builds.apache.org/job/HBase-1.2/6/])
HBASE-13430 HFiles that are in use by a table cloned from a snapshot may be
deleted when that snapshot is deleted (matteo.bertozzi: rev
10254b74ae69a43b31e2e85d25ba72b88cbd635e)
*
hbase-server/src/main/java/org/apache/hadoop/hbase/master/cleaner/HFileLinkCleaner.java
*
hbase-server/src/test/java/org/apache/hadoop/hbase/client/TestSnapshotCloneIndependence.java
> HFiles that are in use by a table cloned from a snapshot may be deleted when
> that snapshot is deleted
> -----------------------------------------------------------------------------------------------------
>
> Key: HBASE-13430
> URL: https://issues.apache.org/jira/browse/HBASE-13430
> Project: HBase
> Issue Type: Bug
> Components: hbase
> Reporter: Tobi Vollebregt
> Assignee: Tobi Vollebregt
> Priority: Critical
> Labels: data-integrity, master
> Fix For: 2.0.0, 1.1.0, 0.98.13, 1.0.2
>
> Attachments: HBASE-13430-master-v1.patch,
> HBASE-13430-master-v2.patch, hbase-13430-attempted-fix.patch,
> hbase-13430-test.patch
>
>
> We recently had a production issue in which HFiles that were still in use by
> a table were deleted. This appears to have been caused by race conditions in
> the order in which HFileLinks are created, combined with the fact that only
> files younger than {{hbase.master.hfilecleaner.ttl}} are kept alive.
> This is how to reproduce:
> * Clone a large snapshot into a new table. The clone operation must take
> more than {{hbase.master.hfilecleaner.ttl}} time to guarantee data loss.
> * Ensure that no other table or snapshot is referencing the HFiles used by
> the new table.
> * Delete the snapshot. This breaks the table.
> The main cause is this:
> * Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}.
> * However, it immediately creates back references to the HFileLinks that it
> creates for the table in the archive directory.
> * HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it
> considers all those back references deletable.
> * The only thing that keeps them alive is the TimeToLiveHFileCleaner, but
> only for 5 minutes.
> * So if cloning the snapshot takes more than 5 minutes, and the HFiles
> aren't referenced by anything else, data loss is guaranteed.
> I have a unit test reproducing the issue and I tried to fix this, but didn't
> completely succeed. I will attach the patch shortly.
> Workarounds:
> * Don't delete any snapshots that you cloned into a table (we used this
> successfully-- we actually restored the deleted snapshot from backup using
> ExportSnapshot after the data loss happened, which successfully reversed the
> data loss).
> * Manually check the back references and create any missing ones after
> cloning a snapshot.
> * Increase {{hbase.master.hfilecleaner.ttl}}. (untested)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)