[jira] [Commented] (HBASE-27579) CatalogJanitor can cause data loss due to errors during cleanMergeRegion

Duo Zhang (Jira) Wed, 18 Jan 2023 23:07:05 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-27579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17678513#comment-17678513
 ]


Duo Zhang commented on HBASE-27579:
-----------------------------------

It was like this when we first implemented online merge...

https://github.com/apache/hbase/commit/c7309e82efb7d6ff90d8bb891f0cd9657bae518b#diff-4b94386c60fff83c8da469ac22c81dbaf7edb3431ec9e256c3d2f82f10ac2649

In HRegionFileSystem we have this

{code}
  public static HRegionFileSystem openRegionFromFileSystem(final Configuration 
conf,
      final FileSystem fs, final Path tableDir, final HRegionInfo regionInfo, 
boolean readOnly)
      throws IOException {
    HRegionFileSystem regionFs = new HRegionFileSystem(conf, fs, tableDir, 
regionInfo);
    Path regionDir = regionFs.getRegionDir();

    if (!fs.exists(regionDir)) {
      LOG.warn("Trying to open a region that do not exists on disk: " + 
regionDir);
      throw new IOException("The specified region do not exists on disk: " + 
regionDir);
    }

    if (readOnly) {
      // Cleanup temporary directories
      regionFs.cleanupTempDir();
      regionFs.cleanupSplitsDir();
      regionFs.cleanupMergesDir();

      // if it doesn't exists, Write HRI to a file, in case we need to recover 
.META.
      regionFs.checkRegionInfoOnFilesystem();
    }

    return regionFs;
  }
{code}

I think the intention here is to check the IOException thrown manually, but the 
problem is other file system operations could also throw IOException...

I think a proper is to throw FileNotFoundException in the above code, and in 
the upper layer we check for FileNotFound exception only, for other errors, we 
just fail the cleanup operation, or at least, skip cleaning for this region.

WDYT?

Thanks.

> CatalogJanitor can cause data loss due to errors during cleanMergeRegion
> ------------------------------------------------------------------------
>
>                 Key: HBASE-27579
>                 URL: https://issues.apache.org/jira/browse/HBASE-27579
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Priority: Blocker
>             Fix For: 2.4.16, 2.5.3
>
>
> In CatalogJanitor.cleanMergeRegion, there is the following check:
> {code:java}
> HRegionFileSystem regionFs = null;
> try {
>   regionFs = 
> HRegionFileSystem.openRegionFromFileSystem(this.services.getConfiguration(), 
> fs,
>     tabledir, mergedRegion, true);
> } catch (IOException e) {
>   LOG.warn("Merged region does not exist: " + mergedRegion.getEncodedName());
> }
> if (regionFs == null || !regionFs.hasReferences(htd)) {
>  .. do the cleanup ..
> } {code}
>  
> I think the assumption here is that an IOException would only be thrown if a 
> region doesn't exist? We had a very poorly timed NameNode failover, during 
> CatalogJanitor run, after a merge. The NameNode failover caused the 
> openRegionFromFileSystem call to fail, which logged:
> {code:java}
> WARN org.apache.hadoop.hbase.master.janitor.CatalogJanitor: Merged region 
> does not exist: 32c71224852c5a4b94a3ba271b4fcb15 {code}
> This region did in fact exist and had not fully compacted, so there were 
> still some lingering reference files.
> The cleanup process moves the parent regions to the archive directory, but 
> the default TTL for those files in the archive directory is only 5 minutes. 
> After that they are cleaned up and the data is now unrecoverable.
> This resulted in FileNotFoundExceptions trying to read or open this region. 
> Our only course of action was to move the lingering reference files aside, so 
> the data is unrecoverable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-27579) CatalogJanitor can cause data loss due to errors during cleanMergeRegion

Reply via email to