Junegunn Choi created HBASE-29111: ------------------------------------- Summary: Data loss in table cloned from a snapshot Key: HBASE-29111 URL: https://issues.apache.org/jira/browse/HBASE-29111 Project: HBase Issue Type: Bug Reporter: Junegunn Choi
We experienced permanent data loss in a table cloned from a snapshot. Here's what we found. * If you clone a table from a snapshot that contains split regions and reference files, and immediately delete the snapshot, HBase can prematurely delete the original HFiles causing data loss. h2. How to reproduce To quickly reproduce the issue, adjust the cleaner and janitor intervals and set HFile TTL to zero. Also disable compaction so that reference files are not compacted away. Or, you can put a lot of data so that compaction doesn't finish during the test. {code:java} <property> <name>hbase.master.cleaner.interval</name> <value>1000</value> </property> <property> <name>hbase.catalogjanitor.interval</name> <value>1000</value> </property> <property> <name>hbase.master.hfilecleaner.ttl</name> <value>0</value> </property> <property> <name>hbase.regionserver.compaction.enabled</name> <value>false</value> </property> {code} And run this code on HBase shell. {code:java} # Create test table and write some data create 't', 'd' 10.times do |i| put 't', i, 'd:foo', '_' * 1024 end # Split in the middle and take the snapshot split 't', '5' snapshot 't', 's' # Drop the table and clone it from the snapshot disable 't' drop 't' clone_snapshot 's', 't' # Immediately delete the snapshot delete_snapshot 's' # Try disabling and re-enabling the table sleep 2 disable 't' enable 't' # java.io.FileNotFoundException: HFileLink locations=[...] {code} h2. What actually happens User clones a table from a snapshot containing split regions and reference files. {noformat} snapshot.RestoreSnapshotHelper: clone region=a23be88470c13611f6f24f20e0cf00ed as a23be88470c13611f6f24f20e0cf00ed in snapshot s ... regionserver.HRegion: creating {ENCODED => a23be88470c13611f6f24f20e0cf00ed, NAME => 't,40000000,1738562472443.a23be88470c13611f6f24f20e0cf00ed.', STARTKEY => '40000000', ENDKEY => '80000000', OFFLINE => true, SPLIT => true}, tableDescriptor='t', {TABLE_ATTRIBUTES => {METADATA => {'hbase.store.file-tracker.impl' => 'DEFAULT'}}}, {NAME => 'd', INDEX_BLOCK_ENCODING => 'NONE', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', COMPRESSION => 'NONE', BLOCKCACHE => 'true', BLOCKSIZE => '65536 B (64KB)'}, regionDir=file:/Users/jg/github/hbase/tmp/hbase ... snapshot.RestoreSnapshotHelper: finishing restore table regions using snapshot=name: "s" {noformat} And the user deletes the snapshot. {noformat} snapshot.SnapshotManager: Deleting snapshot: s {noformat} After a while, CatalogJanitor garbage-collects the split parents as it sees no daughter information in the meta table. {noformat} janitor.CatalogJanitor: Cleaning parent region {ENCODED => a23be88470c13611f6f24f20e0cf00ed, NAME => 't,40000000,1738562472443.a23be88470c13611f6f24f20e0cf00ed.', STARTKEY => '40000000', ENDKEY => '80000000', OFFLINE => true, SPLIT => true} janitor.CatalogJanitor: Deleting region a23be88470c13611f6f24f20e0cf00ed because daughters -- null, null -- no longer hold references {noformat} (see "{{{}null, null"{}}} part) This causes the HFileLinks to be archived. {noformat} backup.HFileArchiver: Archived from FileablePath, file:/Users/jg/github/hbase/tmp/hbase/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/t=a23be88470c13611f6f24f20e0cf00ed-ecdd6aa22a6146599467839c56767522 to file:/Users/jg/github/hbase/tmp/hbase/archive/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/t=a23be88470c13611f6f24f20e0cf00ed-ecdd6aa22a6146599467839c56767522 {noformat} And the cleaners unanimously agree to delete the original HFile. * We already deleted the snapshot, so SnapshotHFileCleaner won't complain. * Because HFileLink is archived, HFileLinkCleaner won't complain. And the HFile is deleted before the daughter regions succeed to rebuild the data in it through compaction. {noformat} cleaner.HFileCleaner: Removing file:/Users/jg/github/hbase/tmp/hbase/archive/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522 {noformat} And the data loss. {noformat} regionserver.CompactSplit: Compaction selection failed region=t,6,1738562566034.8ad3785a3afe89e59b72db5d5d3a1bf5., storeName=8ad3785a3afe89e59b72db5d5d3a1bf5/d, priority=14, startTime=1738562622689 java.io.FileNotFoundException: HFileLink locations=[ file:/Users/jg/github/hbase/tmp/hbase/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522, file:/Users/jg/github/hbase/tmp/hbase/archive/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522, file:/Users/jg/github/hbase/tmp/hbase/.tmp/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522, file:/Users/jg/github/hbase/tmp/hbase/mobdir/data/default/t/a23be88470c13611f6f24f20e0cf00ed/d/ecdd6aa22a6146599467839c56767522 ] {noformat} h2. Fix Make sure to put the split information to the meta table when cloning a table. The information was already there, we just didn't use it. Let me open pull requests both on master and branch-2. -- This message was sent by Atlassian Jira (v8.20.10#820010)