[jira] [Created] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Cosmin Lehene (Created) (JIRA) Wed, 28 Mar 2012 11:47:52 -0700

Repeated split causes HRegionServer failures and breaks table 
--------------------------------------------------------------


                 Key: HBASE-5665
                 URL: https://issues.apache.org/jira/browse/HBASE-5665
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.92.1, 0.92.0
            Reporter: Cosmin Lehene
            Priority: Blocker


Repeated splits on large tables (2 consecutive would suffice) will essentially 
"break" the table (and the cluster), unrecoverable.
The regionserver doing the split dies and the master will get into an infinite 
loop trying to assign regions that seem to have the files missing from HDFS.

The table can be disabled once. upon trying to re-enable it, it will remain in 
an intermediary state forever.

I was able to reproduce this on a smaller table consistently.

{code}
hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
{code}

Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will 
reproduce the issue almost instantly and consistently. 

{code}
2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: 
Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in 
META
2012-03-28 10:57:16,321 DEBUG 
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for 
t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), 
split_queue=10
2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: 
Running rollback/cleanup of failed split of 
t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed 
ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
java.io.IOException: Failed 
ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
        at 
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
        at 
org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
        at 
org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist: 
/hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
        at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
        at 
org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
        at 
org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
        at 
org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
        at 
org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
        at 
org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
        at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
        at 
org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
        at 
org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
        at 
org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
        ... 1 more
2012-03-28 10:57:16,345 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
{code}


http://hastebin.com/diqinibajo.avrasm

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Reply via email to