[
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13100367#comment-13100367
]
ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------
@J-D
bq.You could also try doing a worst case cold startup by killing -9 all HBase
components at the same time (more or less) and then restarting them all (also
after data was added). Finally you could try setting a super low timeout
setting, like 5 seconds, to trigger RIT timeouts by the hundreds.
I conducted the tests again particularly with 5 secs time out. Killed the
cluster, started again, Randomly killed RS -> invoked balancer command also.
I was able to get back all the regions (4003 regions) among 3 RS.
hbck result was also positive
{noformat}
***** The number of timed out regions **** 938
***** The number of timed out regions **** 270
***** The number of timed out regions **** 673
***** The number of timed out regions **** 269
***** The number of timed out regions **** 941
***** The number of timed out regions **** 942
***** The number of timed out regions **** 941
{noformat}
{noformat}
Summary:
-ROOT- is okay.
Number of regions: 1
Deployed on: HOST-10-18-52-253,60020,1315480076091
.META. is okay.
Number of regions: 1
Deployed on: HOST-10-18-52-253,60020,1315480076091
testram2 is okay.
Number of regions: 4001
Deployed on: HOST-10-18-52-108,60020,1315480229321
HOST-10-18-52-253,60020,1315480076091
0 inconsistencies detected.
Status: OK
{noformat}
> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
> Key: HBASE-4015
> URL: https://issues.apache.org/jira/browse/HBASE-4015
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: ramkrishna.s.vasudevan
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: HBASE-4015_1_trunk.patch, HBASE-4015_2_trunk.patch,
> HBASE-4015_reprepared_trunk_2.patch, Timeoutmonitor with state diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition
> generator, mostly making things worse rather than better. It does it's own
> thing for a while without caring for what's happening in the rest of the
> master.
> The first thing that needs to happen is that the regions should not be
> processed in one big batch, because that sometimes can take minutes to
> process (meanwhile a region that timed out opening might have opened, then
> what happens is it will be reassigned by the TimeoutMonitor generating the
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure
> how to do it in a scalable way in this case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira