[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

ramkrishna.s.vasudevan (JIRA) Thu, 08 Sep 2011 07:55:33 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100367#comment-13100367
 ]


ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------

@J-D

bq.You could also try doing a worst case cold startup by killing -9 all HBase 
components at the same time (more or less) and then restarting them all (also 
after data was added). Finally you could try setting a super low timeout 
setting, like 5 seconds, to trigger RIT timeouts by the hundreds.

I conducted the tests again particularly with 5 secs time out. Killed the 
cluster, started again, Randomly killed RS -> invoked balancer command also.
I was able to get back all the regions (4003 regions) among 3 RS.
hbck result was also positive
{noformat}
***** The number of timed out regions **** 938
***** The number of timed out regions **** 270
***** The number of timed out regions **** 673
***** The number of timed out regions **** 269
***** The number of timed out regions **** 941
***** The number of timed out regions **** 942
***** The number of timed out regions **** 941
{noformat}

{noformat}
Summary:
  -ROOT- is okay.
    Number of regions: 1
    Deployed on:  HOST-10-18-52-253,60020,1315480076091
  .META. is okay.
    Number of regions: 1
    Deployed on:  HOST-10-18-52-253,60020,1315480076091
  testram2 is okay.
    Number of regions: 4001
    Deployed on:  HOST-10-18-52-108,60020,1315480229321 
HOST-10-18-52-253,60020,1315480076091
0 inconsistencies detected.
Status: OK
{noformat}

> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
>                 Key: HBASE-4015
>                 URL: https://issues.apache.org/jira/browse/HBASE-4015
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4015_1_trunk.patch, HBASE-4015_2_trunk.patch, 
> HBASE-4015_reprepared_trunk_2.patch, Timeoutmonitor with state diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition 
> generator, mostly making things worse rather than better. It does it's own 
> thing for a while without caring for what's happening in the rest of the 
> master.
> The first thing that needs to happen is that the regions should not be 
> processed in one big batch, because that sometimes can take minutes to 
> process (meanwhile a region that timed out opening might have opened, then 
> what happens is it will be reassigned by the TimeoutMonitor generating the 
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure 
> how to do it in a scalable way in this case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

Reply via email to