[ 
https://issues.apache.org/jira/browse/HBASE-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2244:
-------------------------

    Attachment: 2244-v5.patch

Can't add unit test for this explicit case.  The actors are the BaseScanner up 
in Master and a failing split over in a RegionServer.  The split region needs 
to be completely coherent; that is, after split it has to have references in 
place else fixup won't cut in (that there are references and there is no entry 
for a daughter is how we identify this pathology).  Then we need to have 
regionserver fail at an explicit location in a cluster test, after offlining 
parent but but somewhere in the
middle of adding the daughters to meta.  Our current testing vocabulary doesn't 
allow us do this kind of detailed fail (Needs to be fixed but not as part of 
this patch).

So, I can see the repair cutting in if I doctor the RegionServer with the 
following patch to CompactSplitThread:

{code}
Index: src/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java
===================================================================
--- src/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java       
(revision 915869)
+++ src/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java       
(working copy)
@@ -199,6 +199,13 @@
         Writables.getBytes(newRegions[1].getRegionInfo()));
     t.put(put);
     
+    // If we crash here, then the daughters will not be added and we'll have
+    // and offlined parent but no daughters to take up the slack.  hbase-2244
+    // adds fixup to the metascanners.
+
+    // REMOVE
+    if (true) throw new IOException("Fail to add other regions and to send 
split to master");
+    
     // Add new regions to META
     for (int i = 0; i < newRegions.length; i++) {
       put = new Put(newRegions[i].getRegionName());

{code}

Notice the throw IOE after parent has been offlined.  This makes it so meta 
does not get daughter entries and the master does not get split message.

In TestForceSplit, you'll see the following:

{code}
2010-02-24 19:45:16,685 ERROR [RegionServer:0.compactor] 
regionserver.CompactSplitThread(104): Compaction/Split failed for region 
test,,1267069500020
java.io.IOException: Fail to add other regions and to send split to master
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:207)
        at 
org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:95)
2010-02-24 19:45:17,101 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:18,105 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:19,111 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:20,117 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:21,122 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:22,127 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:23,133 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:24,140 DEBUG [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(72): Cycle waiting on split
2010-02-24 19:45:24,655 INFO  [Thread-103] 
master.ServerManager$ServerMonitor(129): 1 region servers, 0 dead, average load 
2.0
2010-02-24 19:45:24,920 INFO  [RegionManager.metaScanner] 
master.BaseScanner(163): RegionManager.metaScanner scanning meta region 
{server: 127.0.0.1:59090, regionname: .META.,,1, startKey: <>}
2010-02-24 19:45:24,921 INFO  [RegionManager.rootScanner] 
master.BaseScanner(163): RegionManager.rootScanner scanning meta region 
{server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey: <>}
2010-02-24 19:45:24,941 INFO  [RegionManager.rootScanner] 
master.BaseScanner(242): RegionManager.rootScanner scan of 1 row(s) of meta 
region {server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey: <>} complete
2010-02-24 19:45:24,942 WARN  [RegionManager.metaScanner] 
master.BaseScanner(379): Fixup broke split: Add missing split daughter to meta, 
daughter=REGION => {NAME => 'test,,1267069516539', STARTKEY => '', ENDKEY => 
'lqg', ENCODED => 904299994, TABLE => {{NAME => 'test', FAMILIES => [{NAME => 
'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => 
'65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}, parent=REGION => {NAME 
=> 'test,,1267069500020', STARTKEY => '', ENDKEY => '', ENCODED => 970247455, 
OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'test', FAMILIES => [{NAME 
=> 'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE 
=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
2010-02-24 19:45:24,947 WARN  [RegionManager.metaScanner] 
master.BaseScanner(379): Fixup broke split: Add missing split daughter to meta, 
daughter=REGION => {NAME => 'test,lqg,1267069516539', STARTKEY => 'lqg', ENDKEY 
=> '', ENCODED => 1813617937, TABLE => {{NAME => 'test', FAMILIES => [{NAME => 
'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => 
'65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}, parent=REGION => {NAME 
=> 'test,,1267069500020', STARTKEY => '', ENDKEY => '', ENCODED => 970247455, 
OFFLINE => true, SPLIT => true, TABLE => {{NAME => 'test', FAMILIES => [{NAME 
=> 'a', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE 
=> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
2010-02-24 19:45:24,948 INFO  [RegionManager.metaScanner] 
master.BaseScanner(242): RegionManager.metaScanner scan of 1 row(s) of meta 
region {server: 127.0.0.1:59090, regionname: .META.,,1, startKey: <>} complete
2010-02-24 19:45:24,949 INFO  [RegionManager.metaScanner] 
master.MetaScanner(132): All 1 .META. region(s) scanned
2010-02-24 19:45:25,145 DEBUG [WaitOnSplit] 
client.HConnectionManager$TableServers(780): Cache hit for row <> in tableName 
.META.: location server 127.0.0.1:59090, location region name .META.,,1
2010-02-24 19:45:25,149 DEBUG [WaitOnSplit] client.HTable$ClientScanner(1832): 
Creating scanner over .META. starting at key ''
2010-02-24 19:45:25,151 DEBUG [WaitOnSplit] client.HTable$ClientScanner(1928): 
Advancing internal scanner to startKey at ''
2010-02-24 19:45:25,151 DEBUG [WaitOnSplit] 
client.HConnectionManager$TableServers(780): Cache hit for row <> in tableName 
.META.: location server 127.0.0.1:59090, location region name .META.,,1
2010-02-24 19:45:25,157 INFO  [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(79): 
keyvalues={test,,1267069500020/info:regioninfo/1267069516682/Put/vlen=254, 
test,,1267069500020/info:server/1267069501165/Put/vlen=15, 
test,,1267069500020/info:serverstartcode/1267069501165/Put/vlen=8, 
test,,1267069500020/info:splitA/1267069516682/Put/vlen=257, 
test,,1267069500020/info:splitA_checked/1267069524944/Put/vlen=1, 
test,,1267069500020/info:splitB/1267069516682/Put/vlen=260, 
test,,1267069500020/info:splitB_checked/1267069524948/Put/vlen=1}
2010-02-24 19:45:25,158 INFO  [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(79): 
keyvalues={test,,1267069516539/info:regioninfo/1267069524943/Put/vlen=257}
2010-02-24 19:45:25,158 INFO  [WaitOnSplit] 
client.TestForceSplit$WaitOnSplit(79): 
keyvalues={test,lqg,1267069516539/info:regioninfo/1267069524948/Put/vlen=260}
2010-02-24 19:45:25,159 DEBUG [WaitOnSplit] client.HTable$ClientScanner(1915): 
Finished with scanning at REGION => {NAME => '.META.,,1', STARTKEY => '', 
ENDKEY => '', ENCODED => 1028785192, TABLE => {{NAME => '.META.', IS_META => 
'true', MEMSTORE_FLUSHSIZE => '16384', FAMILIES => [{NAME => 'historian', 
VERSIONS => '2147483647', COMPRESSION => 'NONE', TTL => '604800', BLOCKSIZE => 
'8192', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'info', VERSIONS 
=> '10', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '8192', 
IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
2010-02-24 19:45:25,160 ERROR [IPC Server handler 4 on 59090] 
regionserver.HRegionServer(862): 
org.apache.hadoop.hbase.NotServingRegionException: test,,1267069500020 is 
closing=true or closed=true
        at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1787)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1907)
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2010-02-24 19:45:25,175 DEBUG [main] client.HTable$ClientScanner(1928): 
Advancing internal scanner to startKey at 'aaa'
2010-02-24 19:45:25,177 ERROR [IPC Server handler 2 on 59090] 
regionserver.HRegionServer(864): Failed openScanner
org.apache.hadoop.hbase.NotServingRegionException: test,,1267069500020
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2279)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1858)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
        at 
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
2010-02-24 19:45:30,186 DEBUG [main] 
client.HConnectionManager$TableServers(857): Removed test,,1267069500020 for 
tableName=test from cache because of aaa
2010-02-24 19:45:30,188 DEBUG [main] 
client.HConnectionManager$TableServers(734): locateRegionInMeta attempt 0 of 3 
failed; retrying after sleep of 5000 because: No server address listed in 
.META. for region test,,1267069516539
2010-02-24 19:45:34,656 INFO  [Thread-103] 
master.ServerManager$ServerMonitor(129): 1 region servers, 0 dead, average load 
2.0
2010-02-24 19:45:34,920 INFO  [RegionManager.rootScanner] 
master.BaseScanner(163): RegionManager.rootScanner scanning meta region 
{server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey: <>}
2010-02-24 19:45:34,920 INFO  [RegionManager.metaScanner] 
master.BaseScanner(163): RegionManager.metaScanner scanning meta region 
{server: 127.0.0.1:59090, regionname: .META.,,1, startKey: <>}
2010-02-24 19:45:34,932 DEBUG [RegionManager.metaScanner] 
master.BaseScanner(572): Current assignment of test,,1267069516539 is not 
valid;  serverAddress=, startCode=0 unknown.
2010-02-24 19:45:34,934 DEBUG [RegionManager.metaScanner] 
master.BaseScanner(572): Current assignment of test,lqg,1267069516539 is not 
valid;  serverAddress=, startCode=0 unknown.
2010-02-24 19:45:34,940 INFO  [RegionManager.rootScanner] 
master.BaseScanner(242): RegionManager.rootScanner scan of 1 row(s) of meta 
region {server: 127.0.0.1:59090, regionname: -ROOT-,,0, startKey: <>} complete
2010-02-24 19:45:34,941 INFO  [RegionManager.metaScanner] 
master.BaseScanner(242): RegionManager.metaScanner scan of 3 row(s) of meta 
region {server: 127.0.0.1:59090, regionname: .META.,,1, startKey: <>} complete
2010-02-24 19:45:34,941 INFO  [RegionManager.metaScanner] 
master.MetaScanner(132): All 1 .META. region(s) scanned
2010-02-24 19:45:35,189 DEBUG [main] 
client.HConnectionManager$TableServers(734): locateRegionInMeta attempt 1 of 3 
failed; retrying after sleep of 5000 because: No server address listed in 
.META. for region test,,1267069516539
2010-02-24 19:45:35,315 INFO  [IPC Server handler 4 on 60000] 
master.RegionManager(337): Assigning region test,,1267069516539 to 
localhost,59090,1267069494948
2010-02-24 19:45:35,315 INFO  [IPC Server handler 4 on 60000] 
master.RegionManager(337): Assigning region test,lqg,1267069516539 to 
localhost,59090,1267069494948
2010-02-24 19:45:35,316 INFO  [RegionServer:0] regionserver.HRegionServer(499): 
MSG_REGION_OPEN: test,,1267069516539
2010-02-24 19:45:35,317 INFO  [RegionServer:0] regionserver.HRegionServer(499): 
MSG_REGION_OPEN: test,lqg,1267069516539
{code}

2010-02-24 19:26:50,203 and 2010-02-24 19:26:50,209 are the new stuff.  See how 
subsequently the daughter rows added by the fixup are assigned and the 
verification scan -- the test is to prove scans can ride across splits -- 
completes and the unit tests fails.

I'm going to commit the patch unless objection.

This patch adds some load on .META.: two new Gets after every split.  Shouldn't 
be too bad.  Splits are rare enough.

> META gets inconsistent in a number of crash scenarios
> -----------------------------------------------------
>
>                 Key: HBASE-2244
>                 URL: https://issues.apache.org/jira/browse/HBASE-2244
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.4
>
>         Attachments: 2244-v3.patch, 2244-v5.patch, 2244.patch
>
>
> (Forking this issue off from HBASE-2235).
> During load testing, in a number of failure scenarios (unexpected region 
> server deaths) etc., we notice that META can get inconsistent. This primarily 
> happens for regions which are in the process of being split. Manually running 
> add_table.rb seems to fix the tables meta data just fine. 
> But it would be good to do automatic cleansing (as part of META scanners 
> work) and/or avoid these inconsistent states altogether.
> For example, for a particular startkey, I see all these entries:
> {code}
> test1,1204765,1266569946560 column=info:regioninfo, timestamp=1266581302018, 
> value=REGION => {NAME => 'test1,
>                              1204765,1266569946560', STARTKEY => '1204765', 
> ENDKEY => '1441091', ENCODED => 18
>                              19368969, OFFLINE => true, SPLIT => true, TABLE 
> => {{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', 
> COMPRESSION => 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', 
> BLOCKCACHE => 'true'}]}}
>  test1,1204765,1266569946560 column=info:server, timestamp=1266570029133, 
> value=10.129.68.212:60020
>  test1,1204765,1266569946560 column=info:serverstartcode, 
> timestamp=1266570029133, value=1266562597546
>  test1,1204765,1266569946560 column=info:splitB, timestamp=1266581302018, 
> value=\x00\x071441091\x00\x00\x00\x0
>                              
> 1\x26\xE6\x1F\xDF\x27\x1Btest1,1290703,1266581233447\x00\x071290703\x00\x00\x00\x
>                              
> 05\x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x
>                              
> 00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00
>                              
> \x00\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSI
>                              
> ON\x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TT
>                              
> L\x00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00
>                              
> \x00\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04t
>                              rueh\x0FQ\xCF
>  test1,1204765,1266581233447 column=info:regioninfo, timestamp=1266609172177, 
> value=REGION => {NAME => 'test1,
>                              1204765,1266581233447', STARTKEY => '1204765', 
> ENDKEY => '1290703', ENCODED => 13
>                              73493090, OFFLINE => true, SPLIT => true, TABLE 
> => {{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', 
> COMPRESSION => 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', 
> BLOCKCACHE => 'true'}]}}
>  test1,1204765,1266581233447 column=info:server, timestamp=1266604768670, 
> value=10.129.68.213:60020
>  test1,1204765,1266581233447 column=info:serverstartcode, 
> timestamp=1266604768670, value=1266562597511
>  test1,1204765,1266581233447 column=info:splitA, timestamp=1266609172177, 
> value=\x00\x071226169\x00\x00\x00\x0
>                              
> 1\x26\xE7\xCA,\x7D\x1Btest1,1204765,1266609171581\x00\x071204765\x00\x00\x00\x05\
>                              
> x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              
> x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              
> 0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              
> x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              
> 00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              
> 0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xB9\xBD\xFEO
>  test1,1204765,1266581233447 column=info:splitB, timestamp=1266609172177, 
> value=\x00\x071290703\x00\x00\x00\x0
>                              
> 1\x26\xE7\xCA,\x7D\x1Btest1,1226169,1266609171581\x00\x071226169\x00\x00\x00\x05\
>                              
> x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              
> x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              
> 0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              
> x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              
> 00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              
> 0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xE1\xDF\xF8p
>  test1,1204765,1266609171581 column=info:regioninfo, timestamp=1266609172212, 
> value=REGION => {NAME => 'test1,
>                              1204765,1266609171581', STARTKEY => '1204765', 
> ENDKEY => '1226169', ENCODED => 21
>                              34878372, TABLE => {{NAME => 'test1', FAMILIES 
> => [{NAME => 'actions', VERSIONS =
>                              > '3', COMPRESSION => 'NONE', TTL => 
> '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                              Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to