[ https://issues.apache.org/jira/browse/HBASE-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490498#comment-14490498 ]
Tobi Vollebregt commented on HBASE-13430: ----------------------------------------- Okay, so based on running the tests many times in the background I'm pretty sure that my changes to {{HFileLink}} cause: {code} java.lang.AssertionError: Expected :17576 Actual :14046 {code} If I undo that change, but keep my changes to {{HFileCleaner}}, then I'm not seeing that exception in 50 test runs, but if I keep the changes to {{HFileLink}} then it fails roughly 1 out of 10 times with the above {{AssertionError}} in the *existing tests*. I will run some more tests to make sure that the changes to {{HFileCleaner}} are sufficient to fix the issue, and then I'll submit a smaller patch that does not modify {{HFileLink}}. > HFiles that are in use by a table cloned from a snapshot may be deleted when > that snapshot is deleted > ----------------------------------------------------------------------------------------------------- > > Key: HBASE-13430 > URL: https://issues.apache.org/jira/browse/HBASE-13430 > Project: HBase > Issue Type: Bug > Components: hbase > Reporter: Tobi Vollebregt > Priority: Critical > Labels: data-integrity, master > Fix For: 2.0.0, 1.1.0, 0.98.13, 1.0.2 > > Attachments: HBASE-13430-master-v1.patch, > hbase-13430-attempted-fix.patch, hbase-13430-test.patch > > > We recently had a production issue in which HFiles that were still in use by > a table were deleted. This appears to have been caused by race conditions in > the order in which HFileLinks are created, combined with the fact that only > files younger than {{hbase.master.hfilecleaner.ttl}} are kept alive. > This is how to reproduce: > * Clone a large snapshot into a new table. The clone operation must take > more than {{hbase.master.hfilecleaner.ttl}} time to guarantee data loss. > * Ensure that no other table or snapshot is referencing the HFiles used by > the new table. > * Delete the snapshot. This breaks the table. > The main cause is this: > * Cloning a snapshot creates the table in the {{HBASE_TEMP_DIRECTORY}}. > * However, it immediately creates back references to the HFileLinks that it > creates for the table in the archive directory. > * HFileLinkCleaner does not check the {{HBASE_TEMP_DIRECTORY}}, so it > considers all those back references deletable. > * The only thing that keeps them alive is the TimeToLiveHFileCleaner, but > only for 5 minutes. > * So if cloning the snapshot takes more than 5 minutes, and the HFiles > aren't referenced by anything else, data loss is guaranteed. > I have a unit test reproducing the issue and I tried to fix this, but didn't > completely succeed. I will attach the patch shortly. > Workarounds: > * Don't delete any snapshots that you cloned into a table (we used this > successfully-- we actually restored the deleted snapshot from backup using > ExportSnapshot after the data loss happened, which successfully reversed the > data loss). > * Manually check the back references and create any missing ones after > cloning a snapshot. > * Increase {{hbase.master.hfilecleaner.ttl}}. (untested) -- This message was sent by Atlassian JIRA (v6.3.4#6332)