[
https://issues.apache.org/jira/browse/HBASE-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bryan Beaudreault updated HBASE-27579:
--------------------------------------
Fix Version/s: 2.4.16
2.5.3
> CatalogJanitor can cause data loss due to errors during cleanMergeRegion
> ------------------------------------------------------------------------
>
> Key: HBASE-27579
> URL: https://issues.apache.org/jira/browse/HBASE-27579
> Project: HBase
> Issue Type: Bug
> Reporter: Bryan Beaudreault
> Priority: Blocker
> Fix For: 2.4.16, 2.5.3
>
>
> In CatalogJanitor.cleanMergeRegion, there is the following check:
> {code:java}
> HRegionFileSystem regionFs = null;
> try {
> regionFs =
> HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(),
> fs,
> tabledir, mergedRegion, true);
> } catch (IOException e) {
> LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
> }
> if (regionFs == null || !regionFs.hasReferences(htd)) {
> .. do the cleanup ..
> } {code}
>
> I think the assumption here is that an IOException would only be thrown if a
> region doesn't exist? We had a very poorly timed NameNode failover, during
> CatalogJanitor run, after a merge. The NameNode failover caused the
> openRegionFromFileSystem call to fail, which logged:
> {code:java}
> WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region
> does not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
> This region did in fact exist and had not fully compacted, so there were
> still some lingering reference files.
> The cleanup process moves the parent regions to the archive directory, but
> the default TTL for those files in the archive directory is only 5 minutes.
> After that they are cleaned up and the data is now unrecoverable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)