[ 
https://issues.apache.org/jira/browse/HBASE-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836422#action_12836422
 ] 

Kannan Muthukkaruppan commented on HBASE-2244:
----------------------------------------------

Stack wrote: <<<< In the .META. listing posted above, there are some 
interesting issues. We still have a reference to a daughter, splitB, in the 
first offlined (row) region, yet the next row is a daughter that has been 
offlined itself. There may be a race in here if we're splitting fast. Let me 
check it out and see if a fix.>>>

Yes, I see several times that nested splits are happening, but the offlined 
parent row hasn't been reaped. But perhaps that in itself isn't an issue.  For 
example, corresponding to my first .META. snippet in this JIRA:

The split of test1,1204765,1266569946560 was announced @4:08:

{code}
2010-02-19 04:08:23,764 INFO org.apache.hadoop.hbase.master.ServerManager: 
Processing MSG_REPORT_SPLIT: test1,1204765,1266569946560: Daughters; 
test1,1204765,1266581233447, test1,1290703,1266581233447 from 
test013.abcxyz.com,60020,1266562597546; 1 of 3
{code}

But reclaiming the offlined parent row from .META. took time. First we  
detected one of the daughters no longer reference it @ about 11:53:
{code}
2010-02-19 11:53:46,673 DEBUG org.apache.hadoop.hbase.master.BaseScanner: 
test1,1204765,1266581233447/1373493090 no longer has references to 
test1,1204765,1266569946560
{code}

And the second daughter at about 14:01. It is only at this point we delete the 
offlined parent row:
{code}
2010-02-19 14:01:48,283 DEBUG org.apache.hadoop.hbase.master.BaseScanner: 
test1,1290703,1266581233447/580635726 no longer has references to 
test1,1204765,1266569946560
2010-02-19 14:01:48,299 INFO org.apache.hadoop.hbase.master.BaseScanner: 
Deleting region test1,1204765,1266569946560 (encoded=1819368969) because 
daughter splits no longer hold references
{code}

Naturally, given this wide window it is not uncommon to see rows corresponding 
to nested splits in .META. In most of these cases, eventually the .META. seems 
to fix itself. But it still seems odd to me that it takes so much time. 

During one of these situations, I saw the client get errors of the form:

10/02/19 09:09:37 INFO tests.MultiThreadedWriter: [22] Users = 1052116, mails = 
1M, time = 10:10:53
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact 
region server 10.129.68.212:60020 for region\
 test1,1204765,1266581233447, row '1232785', but failed after 10 attempts.

and assumed that this was related to the .META. being in a wierd state (i.e. 
offlined parent not being deleted). But looking at the logs, these client 
errors happened during a smaller period (8:49 to 9:09). And were likely due to 
other load issues on the particular region server. I will post any findings 
from that RS'es logs shortly.






> META gets inconsistent in a number of crash scenarios
> -----------------------------------------------------
>
>                 Key: HBASE-2244
>                 URL: https://issues.apache.org/jira/browse/HBASE-2244
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.20.4
>
>
> (Forking this issue off from HBASE-2235).
> During load testing, in a number of failure scenarios (unexpected region 
> server deaths) etc., we notice that META can get inconsistent. This primarily 
> happens for regions which are in the process of being split. Manually running 
> add_table.rb seems to fix the tables meta data just fine. 
> But it would be good to do automatic cleansing (as part of META scanners 
> work) and/or avoid these inconsistent states altogether.
> For example, for a particular startkey, I see all these entries:
> {code}
> test1,1204765,1266569946560 column=info:regioninfo, timestamp=1266581302018, 
> value=REGION => {NAME => 'test1,
>                              1204765,1266569946560', STARTKEY => '1204765', 
> ENDKEY => '1441091', ENCODED => 18
>                              19368969, OFFLINE => true, SPLIT => true, TABLE 
> => {{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', 
> COMPRESSION => 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', 
> BLOCKCACHE => 'true'}]}}
>  test1,1204765,1266569946560 column=info:server, timestamp=1266570029133, 
> value=10.129.68.212:60020
>  test1,1204765,1266569946560 column=info:serverstartcode, 
> timestamp=1266570029133, value=1266562597546
>  test1,1204765,1266569946560 column=info:splitB, timestamp=1266581302018, 
> value=\x00\x071441091\x00\x00\x00\x0
>                              
> 1\x26\xE6\x1F\xDF\x27\x1Btest1,1290703,1266581233447\x00\x071290703\x00\x00\x00\x
>                              
> 05\x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x
>                              
> 00\x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00
>                              
> \x00\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSI
>                              
> ON\x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TT
>                              
> L\x00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00
>                              
> \x00\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04t
>                              rueh\x0FQ\xCF
>  test1,1204765,1266581233447 column=info:regioninfo, timestamp=1266609172177, 
> value=REGION => {NAME => 'test1,
>                              1204765,1266581233447', STARTKEY => '1204765', 
> ENDKEY => '1290703', ENCODED => 13
>                              73493090, OFFLINE => true, SPLIT => true, TABLE 
> => {{NAME => 'test1', FAMILIES =>
>                               [{NAME => 'actions', VERSIONS => '3', 
> COMPRESSION => 'NONE', TTL => '2147483647'
>                              , BLOCKSIZE => '65536', IN_MEMORY => 'false', 
> BLOCKCACHE => 'true'}]}}
>  test1,1204765,1266581233447 column=info:server, timestamp=1266604768670, 
> value=10.129.68.213:60020
>  test1,1204765,1266581233447 column=info:serverstartcode, 
> timestamp=1266604768670, value=1266562597511
>  test1,1204765,1266581233447 column=info:splitA, timestamp=1266609172177, 
> value=\x00\x071226169\x00\x00\x00\x0
>                              
> 1\x26\xE7\xCA,\x7D\x1Btest1,1204765,1266609171581\x00\x071204765\x00\x00\x00\x05\
>                              
> x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              
> x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              
> 0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              
> x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              
> 00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              
> 0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xB9\xBD\xFEO
>  test1,1204765,1266581233447 column=info:splitB, timestamp=1266609172177, 
> value=\x00\x071290703\x00\x00\x00\x0
>                              
> 1\x26\xE7\xCA,\x7D\x1Btest1,1226169,1266609171581\x00\x071226169\x00\x00\x00\x05\
>                              
> x05test1\x00\x00\x00\x00\x00\x02\x00\x00\x00\x07IS_ROOT\x00\x00\x00\x05false\x00\
>                              
> x00\x00\x07IS_META\x00\x00\x00\x05false\x00\x00\x00\x01\x07\x07actions\x00\x00\x0
>                              
> 0\x07\x00\x00\x00\x0BBLOOMFILTER\x00\x00\x00\x05false\x00\x00\x00\x0BCOMPRESSION\
>                              
> x00\x00\x00\x04NONE\x00\x00\x00\x08VERSIONS\x00\x00\x00\x013\x00\x00\x00\x03TTL\x
>                              
> 00\x00\x00\x0A2147483647\x00\x00\x00\x09BLOCKSIZE\x00\x00\x00\x0565536\x00\x00\x0
>                              
> 0\x09IN_MEMORY\x00\x00\x00\x05false\x00\x00\x00\x0ABLOCKCACHE\x00\x00\x00\x04true
>                              \xE1\xDF\xF8p
>  test1,1204765,1266609171581 column=info:regioninfo, timestamp=1266609172212, 
> value=REGION => {NAME => 'test1,
>                              1204765,1266609171581', STARTKEY => '1204765', 
> ENDKEY => '1226169', ENCODED => 21
>                              34878372, TABLE => {{NAME => 'test1', FAMILIES 
> => [{NAME => 'actions', VERSIONS =
>                              > '3', COMPRESSION => 'NONE', TTL => 
> '2147483647', BLOCKSIZE => '65536', IN_MEMOR
>                              Y => 'false', BLOCKCACHE => 'true'}]}}
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to