[jira] [Updated] (HBASE-27579) CatalogJanitor can cause data loss due to errors during cleanMergeRegion

Bryan Beaudreault (Jira) Wed, 18 Jan 2023 20:30:05 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Beaudreault updated HBASE-27579:
--------------------------------------
    Description: 
In CatalogJanitor.cleanMergeRegion, there is the following check:
{code:java}
HRegionFileSystem regionFs = null;
try {
  regionFs = 
HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), fs,
    tabledir, mergedRegion, true);
} catch (IOException e) {
  LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
}

if (regionFs == null || !regionFs.hasReferences(htd)) {
 .. do the cleanup ..
} {code}
 

I think the assumption here is that an IOException would only be thrown if a 
region doesn't exist? We had a very poorly timed NameNode failover, during 
CatalogJanitor run, after a merge. The NameNode failover caused the 
openRegionFromFileSystem call to fail, which logged:
{code:java}
WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region does 
not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
This region did in fact exist and had not fully compacted, so there were still 
some lingering reference files.

The cleanup process moves the parent regions to the archive directory, but the 
default TTL for those files in the archive directory is only 5 minutes. After 
that they are cleaned up and the data is now unrecoverable.

This resulted in FileNotFoundExceptions trying to read or open this region. Our 
only course of action was to move the lingering reference files aside, so the 
data is unrecoverable.

  was:
In CatalogJanitor.cleanMergeRegion, there is the following check:
{code:java}
HRegionFileSystem regionFs = null;
try {
  regionFs = 
HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), fs,
    tabledir, mergedRegion, true);
} catch (IOException e) {
  LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
}

if (regionFs == null || !regionFs.hasReferences(htd)) {
 .. do the cleanup ..
} {code}
 

I think the assumption here is that an IOException would only be thrown if a 
region doesn't exist? We had a very poorly timed NameNode failover, during 
CatalogJanitor run, after a merge. The NameNode failover caused the 
openRegionFromFileSystem call to fail, which logged:
{code:java}
WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region does 
not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
This region did in fact exist and had not fully compacted, so there were still 
some lingering reference files.

The cleanup process moves the parent regions to the archive directory, but the 
default TTL for those files in the archive directory is only 5 minutes. After 
that they are cleaned up and the data is now unrecoverable.


> CatalogJanitor can cause data loss due to errors during cleanMergeRegion
> ------------------------------------------------------------------------
>
>                 Key: HBASE-27579
>                 URL: https://issues.apache.org/jira/browse/HBASE-27579
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Priority: Blocker
>             Fix For: 2.4.16, 2.5.3
>
>
> In CatalogJanitor.cleanMergeRegion, there is the following check:
> {code:java}
> HRegionFileSystem regionFs = null;
> try {
>   regionFs = 
> HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), 
> fs,
>     tabledir, mergedRegion, true);
> } catch (IOException e) {
>   LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
> }
> if (regionFs == null || !regionFs.hasReferences(htd)) {
>  .. do the cleanup ..
> } {code}
>  
> I think the assumption here is that an IOException would only be thrown if a 
> region doesn't exist? We had a very poorly timed NameNode failover, during 
> CatalogJanitor run, after a merge. The NameNode failover caused the 
> openRegionFromFileSystem call to fail, which logged:
> {code:java}
> WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region 
> does not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
> This region did in fact exist and had not fully compacted, so there were 
> still some lingering reference files.
> The cleanup process moves the parent regions to the archive directory, but 
> the default TTL for those files in the archive directory is only 5 minutes. 
> After that they are cleaned up and the data is now unrecoverable.
> This resulted in FileNotFoundExceptions trying to read or open this region. 
> Our only course of action was to move the lingering reference files aside, so 
> the data is unrecoverable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27579) CatalogJanitor can cause data loss due to errors during cleanMergeRegion

Reply via email to