[ 
https://issues.apache.org/jira/browse/HBASE-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault updated HBASE-27579:
--------------------------------------
    Status: Patch Available  (was: Open)

I decided to not touch openRegionFromFileSystem. Instead I re-used the existing 
robust checkDaughterInFs method, which is used for cleaning up splits. I made 
it a little more generic, and then updated cleanMergeRegion to use that. This 
way the two cleaners use the same reference checking logic, which should be 
easier to maintain.

I was thinking about the FileNotFoundException idea. Since this code is highly 
destructive, I'd rather not assume that a FileNotFoundException is for the 
region we're checking. Today, openRegionFileSystem is a simple method and we 
can be reasonably certain that the FNFE would be related to the region dir. But 
that may change in the future, maybe some other call is added which can also 
throw FNFE. So I'd rather be defensive here and treat any exception as "let's 
wait and try again later".

> CatalogJanitor can cause data loss due to errors during cleanMergeRegion
> ------------------------------------------------------------------------
>
>                 Key: HBASE-27579
>                 URL: https://issues.apache.org/jira/browse/HBASE-27579
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Blocker
>             Fix For: 2.4.16, 2.5.3
>
>
> In CatalogJanitor.cleanMergeRegion, there is the following check:
> {code:java}
> HRegionFileSystem regionFs = null;
> try {
>   regionFs = 
> HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), 
> fs,
>     tabledir, mergedRegion, true);
> } catch (IOException e) {
>   LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
> }
> if (regionFs == null || !regionFs.hasReferences(htd)) {
>  .. do the cleanup ..
> } {code}
>  
> I think the assumption here is that an IOException would only be thrown if a 
> region doesn't exist? We had a very poorly timed NameNode failover, during 
> CatalogJanitor run, after a merge. The NameNode failover caused the 
> openRegionFromFileSystem call to fail, which logged:
> {code:java}
> WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region 
> does not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
> This region did in fact exist and had not fully compacted, so there were 
> still some lingering reference files.
> The cleanup process moves the parent regions to the archive directory, but 
> the default TTL for those files in the archive directory is only 5 minutes. 
> After that they are cleaned up and the data is now unrecoverable.
> This resulted in FileNotFoundExceptions trying to read or open this region. 
> Our only course of action was to move the lingering reference files aside, so 
> the data is unrecoverable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to