[
https://issues.apache.org/jira/browse/HBASE-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Manning resolved HBASE-26722.
-----------------------------------
Resolution: Duplicate
> Snapshot is corrupted due to interaction between move, warmupRegion,
> compaction, and HFileArchiver
> --------------------------------------------------------------------------------------------------
>
> Key: HBASE-26722
> URL: https://issues.apache.org/jira/browse/HBASE-26722
> Project: HBase
> Issue Type: Bug
> Components: Compaction, mover, snapshots
> Affects Versions: 2.0.0, 1.3.5
> Reporter: David Manning
> Priority: Critical
>
> There is an interesting sequence of events which leads to split-brain,
> double-assignment type of behavior with management of store files.
> The scenario is this:
> # Take snapshot
> # RegionX of snapshotted table is hosted on RegionServer1.
> # Stop RegionServer1, using {{region_mover}}, gracefully moving all regions
> to other regionservers using {{move}} RPCs.
> # RegionX is now opened on RegionServer2.
> # RegionServer2 compacts RegionX after opening.
> # RegionServer1 starts and uses {{region_mover}} to {{move}} all previously
> owned regions back to itself.
> # The HMaster RPC to {{move}} calls {{warmupRegion}} on RegionServer1.
> # As part of {{warmupRegion}}, RegionServer1 opens all store files of
> RegionX. CompactedHFilesDischarger chore has not yet archived the
> pre-compacted store file. RegionServer1 finds both the pre-compacted store
> file and post-compacted store file. It logs a warning and archives the
> pre-compacted file.
> # RegionServer1 has warmed up the region, so now HMaster resumes the {{move}}
> and sends {{close}} RegionX to RegionServer2.
> # RegionServer2 closes its store files. As part of this, it archives any
> compacted files which have not yet been archived by the
> {{CompactedHFilesDischarger}} chore.
> # Since RegionServer1 already archived the file, RegionServer2's
> {{HFileArchiver}} finds the destination archive file already exists. (code
> link)
> # RegionServer2 renames the archived file, to free up the desired destination
> filename.
> With the archived file renamed, RegionServer2 attempts to archive the file as
> planned. But the source file doesn't exist because RegionServer1 already
> moved it... to the location RegionServer2 expected to use!
> # RegionServer2 silently ignores this archival failure. (code link)
> # HMaster {{HFileCleaner}} chore later deletes the renamed archive file,
> because there is no active reference to it. (The snapshot reference is to the
> original named file, not the "backup" timestamped version.) The snapshot data
> is irretrievably lost.
> HBASE-26718 tracks a potential, specific change to the archival process to
> avoid this specific issue.
> However, there is a more fundamental problem here that a region opened by
> {{warmupRegion}} can operate on that region's store files while the region is
> opened elsewhere, which must not be allowed.
> This was seen on branch-1, and is a combination of HBASE-22330 and not having
> the fix for HBASE-22163.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)