[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Cosmin Lehene (Commented) (JIRA) Thu, 29 Mar 2012 13:06:45 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241586#comment-13241586
 ]


Cosmin Lehene commented on HBASE-5665:
--------------------------------------

Indeed it seems to be a problem with forced splits. I'm not sure though if the 
natural splits are safe - they seem to be, but I need to test that too. 

RegionSplitPolicy.getSplitPoint() calls Store.getSplitPoint()
Store.getSplitPoint seems to do the check. 

{code}
    for (StoreFile sf : storefiles) {
        if (sf.isReference()) {
          // Should already be enforced since we return false in this case
          assert false : "getSplitPoint() called on a region that can't split!";
          return null;
        }
{code}

BTW, we also have Store.hasReferences()
{code}
  private boolean hasReferences(Collection<StoreFile> files) {
    if (files != null && files.size() > 0) {
      for (StoreFile hsf: files) {
        if (hsf.isReference()) {
          return true;
        }
      }
    }
    return false;
  }

{code}


However here's the code in HRegion.checkSplit()
If there's an explicit split point it won't get to do the reference check.

{code}
 public byte[] checkSplit() {
    // Can't split META
    if (getRegionInfo().isMetaRegion()) {
      if (shouldForceSplit()) {
        LOG.warn("Cannot split meta regions in HBase 0.20 and above");
      }
      return null;
    }

    if (this.explicitSplitPoint != null) {
      return this.explicitSplitPoint;
    }

    if (!splitPolicy.shouldSplit()) {
      return null;
    }

    byte[] ret = splitPolicy.getSplitPoint();

    if (ret != null) {
      try {
        checkRow(ret, "calculated split");
      } catch (IOException e) {
        LOG.error("Ignoring invalid split", e);
        return null;
      }
    }
    return ret;
  }
{code}

Multiple return points + a ret variable - this could use some polishing too :)

I'm a bit puzzled about the natural split, because, I've seen the problem with 
a forced split from UI where I don't think we provide an explicit split point. 

Cosmin
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will 
> essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an 
> infinite loop trying to assign regions that seem to have the files missing 
> from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain 
> in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) 
> will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
> Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
> META
> 2012-03-28 10:57:16,321 DEBUG 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
> t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  
> compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO 
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup 
> of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; 
> Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed 
> ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at 
> org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at 
> org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at 
> org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters 
> above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the 
> path is reference or be able to recursively resolve reference paths (e.g. 
> parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open 
> /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Reply via email to