Junegunn Choi created HBASE-29111:
-------------------------------------

             Summary: Data loss in table cloned from a snapshot
                 Key: HBASE-29111
                 URL: https://issues.apache.org/jira/browse/HBASE-29111
             Project: HBase
          Issue Type: Bug
            Reporter: Junegunn Choi


We experienced permanent data loss in a table cloned from a snapshot.

Here's what we found.
 * If you clone a table from a snapshot that contains split regions and 
reference files, and immediately delete the snapshot, HBase can prematurely 
delete the original HFiles causing data loss.

h2. How to reproduce

To quickly reproduce the issue, adjust the cleaner and janitor intervals and 
set HFile TTL to zero. Also disable compaction so that reference files are not 
compacted away. Or, you can put a lot of data so that compaction doesn't finish 
during the test.
{code:java}
  <property>
    <name>hbase.master.cleaner.interval</name>
    <value>1000</value>
  </property>

  <property>
    <name>hbase.catalogjanitor.interval</name>
    <value>1000</value>
  </property>

  <property>
    <name>hbase.master.hfilecleaner.ttl</name>
    <value>0</value>
  </property>

  <property>
    <name>hbase.regionserver.compaction.enabled</name>
    <value>false</value>
  </property>
{code}
And run this code on HBase shell.
{code:java}
# Create test table and write some data
create 't', 'd'
10.times do |i|
  put 't', i, 'd:foo', '_' * 1024
end

# Split in the middle and take the snapshot
split 't', '5'
snapshot 't', 's'

# Drop the table and clone it from the snapshot
disable 't'
drop 't'
clone_snapshot 's', 't'

# Immediately delete the snapshot
delete_snapshot 's'

# Try disabling and re-enabling the table
sleep 2
disable 't'
enable 't'

# java.io.FileNotFoundException: HFileLink locations=[...]
{code}
h2. What actually happens

User clones a table from a snapshot containing split regions and reference 
files.
{noformat}
snapshot.RestoreSnapshotHelper: clone region=a23be88470c13611f6f24f20e0cf00ed 
as a23be88470c13611f6f24f20e0cf00ed in snapshot s

...

regionserver.HRegion: creating {ENCODED => a23be88470c13611f6f24f20e0cf00ed, 
NAME => 't,40000000,1738562472443.a23be88470c13611f6f24f20e0cf00ed.', STARTKEY 
=> '40000000', ENDKEY => '80000000', OFFLINE => true, SPLIT => true}, 
tableDescriptor='t', {TABLE_ATTRIBUTES => {METADATA => 
{'hbase.store.file-tracker.impl' => 'DEFAULT'}}}, {NAME => 'd', 
INDEX_BLOCK_ENCODING => 'NONE', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', 
DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', 
REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', 
COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536 B (64KB)'}, 
regionDir=file:/Users/jg/github/hbase/tmp/hbase

...

snapshot.RestoreSnapshotHelper: finishing restore table regions using 
snapshot=name: "s"
{noformat}
And the user deletes the snapshot.
{noformat}
snapshot.SnapshotManager: Deleting snapshot: s
{noformat}
After a while, CatalogJanitor garbage-collects the split parents as it sees no 
daughter information in the meta table.
{noformat}
janitor.CatalogJanitor: Cleaning parent region {ENCODED => 
a23be88470c13611f6f24f20e0cf00ed, NAME => 
't,40000000,1738562472443.a23be88470c13611f6f24f20e0cf00ed.', STARTKEY => 
'40000000', ENDKEY => '80000000', OFFLINE => true, SPLIT => true}
janitor.CatalogJanitor: Deleting region a23be88470c13611f6f24f20e0cf00ed 
because daughters -- null, null -- no longer hold references
{noformat}
(see "{{{}null, null"{}}} part)

This causes the HFileLinks to be archived.
{noformat}
backup.HFileArchiver: Archived from FileablePath, 
file:/Users/jg/github/hbase/tmp/hbase/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/t=a23be88470c13611f6f24f20e0cf00ed-ecdd6aa22a6146599467839c56767522
 to 
file:/Users/jg/github/hbase/tmp/hbase/archive/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/t=a23be88470c13611f6f24f20e0cf00ed-ecdd6aa22a6146599467839c56767522
{noformat}
And the cleaners unanimously agree to delete the original HFile.
 * We already deleted the snapshot, so SnapshotHFileCleaner won't complain.
 * Because HFileLink is archived, HFileLinkCleaner won't complain.

And the HFile is deleted before the daughter regions succeed to rebuild the 
data in it through compaction.
{noformat}
cleaner.HFileCleaner: Removing 
file:/Users/jg/github/hbase/tmp/hbase/archive/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522
{noformat}
And the data loss.
{noformat}
regionserver.CompactSplit: Compaction selection failed 
region=t,6,1738562566034.8ad3785a3afe89e59b72db5d5d3a1bf5., 
storeName=8ad3785a3afe89e59b72db5d5d3a1bf5/d, priority=14, 
startTime=1738562622689
java.io.FileNotFoundException: HFileLink locations=[
  
file:/Users/jg/github/hbase/tmp/hbase/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522,
  
file:/Users/jg/github/hbase/tmp/hbase/archive/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522,
  
file:/Users/jg/github/hbase/tmp/hbase/.tmp/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522,
  
file:/Users/jg/github/hbase/tmp/hbase/mobdir/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522
]
{noformat}
h2. Fix

Make sure to put the split information to the meta table when cloning a table. 
The information was already there, we just didn't use it.


Let me open pull requests both on master and branch-2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to