[
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084007#comment-13084007
]
ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------
bq. You are working on TRUNK Ram?
Yes Stack
bq. Won't your code have to check for both REALLOCATE and OFFLINE and the
presence of either mean its ok to procede to OPENING (and then aren't
REALLOCATE and OFFLINE the 'same' state because the presence of either will
mean proceed to OPENING?).
Yes this is what my patch does. But why we do the same operation for both
state?
this is because previously if there is a change in state other than OFFLINE
while moving to OPENING we were aborting, now this an additional state which
says its ok to go to OPENING if you find me in RE_ALLOCATE and if the server
name in me is same as your RS address. This avoids the problem of unnecessary
region getting hijacked though the RS was doing his work correctly.
bq.So, why not just add machine name to OFFLINE? Then we don't need REALLOCATE
state?
This you have already told like currently there is no version that is passed
from master to rs. Thats why a new state. If this had been possible then
OFFLINE with version passed by master would have been sufficient.
bq.So, figuring how to do deal with timeout of regions in PENDING_OPEN is one
aspect of this issue, right? The verification of state over in timeout monitor
before acting is another aspect?
Yes stack.. we have covered both these aspects and also the points told by JD.
Taking action on timeout immediately and a mechanism for both master and RS to
know what happened as part of timeout and who ever wins the race succeeds.
bq.(I believe it acts a little differently from 0.90 because of recent work
done in here).
Reg timeout monitor the one major change is now the CLSOING state node is
created by master itself and it was done by RS as in 0.90. Apart from this i
dint find any big difference till now. As part of HBASE-4083 we have introduced
the return types from Open RegionHandler which takes care of scenarios where a
race condition happens between the master changes to RE_ALLOCATE by the time
the RS has moved to OPENED.
> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
> Key: HBASE-4015
> URL: https://issues.apache.org/jira/browse/HBASE-4015
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: ramkrishna.s.vasudevan
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: HBASE-4015_1_trunk.patch, Timeoutmonitor with state
> diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition
> generator, mostly making things worse rather than better. It does it's own
> thing for a while without caring for what's happening in the rest of the
> master.
> The first thing that needs to happen is that the regions should not be
> processed in one big batch, because that sometimes can take minutes to
> process (meanwhile a region that timed out opening might have opened, then
> what happens is it will be reassigned by the TimeoutMonitor generating the
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure
> how to do it in a scalable way in this case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira