[ 
https://issues.apache.org/jira/browse/HBASE-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault reassigned HBASE-27579:
-----------------------------------------

    Assignee: Bryan Beaudreault

> CatalogJanitor can cause data loss due to errors during cleanMergeRegion
> ------------------------------------------------------------------------
>
>                 Key: HBASE-27579
>                 URL: https://issues.apache.org/jira/browse/HBASE-27579
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Blocker
>             Fix For: 2.4.16, 2.5.3
>
>
> In CatalogJanitor.cleanMergeRegion, there is the following check:
> {code:java}
> HRegionFileSystem regionFs = null;
> try {
>   regionFs = 
> HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), 
> fs,
>     tabledir, mergedRegion, true);
> } catch (IOException e) {
>   LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
> }
> if (regionFs == null || !regionFs.hasReferences(htd)) {
>  .. do the cleanup ..
> } {code}
>  
> I think the assumption here is that an IOException would only be thrown if a 
> region doesn't exist? We had a very poorly timed NameNode failover, during 
> CatalogJanitor run, after a merge. The NameNode failover caused the 
> openRegionFromFileSystem call to fail, which logged:
> {code:java}
> WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region 
> does not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
> This region did in fact exist and had not fully compacted, so there were 
> still some lingering reference files.
> The cleanup process moves the parent regions to the archive directory, but 
> the default TTL for those files in the archive directory is only 5 minutes. 
> After that they are cleaned up and the data is now unrecoverable.
> This resulted in FileNotFoundExceptions trying to read or open this region. 
> Our only course of action was to move the lingering reference files aside, so 
> the data is unrecoverable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to