[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

ramkrishna.s.vasudevan (JIRA) Fri, 12 Aug 2011 01:53:09 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084007#comment-13084007
 ]


ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------

bq. You are working on TRUNK Ram?
Yes Stack

bq. Won't your code have to check for both REALLOCATE and OFFLINE and the 
presence of either mean its ok to procede to OPENING (and then aren't 
REALLOCATE and OFFLINE the 'same' state because the presence of either will 
mean proceed to OPENING?).

Yes this is what my patch does.  But why we do the same operation for both 
state?
this is because previously if there is a change in state other than OFFLINE 
while moving to OPENING we were aborting, now this an additional state which 
says its ok to go to OPENING if you find me in RE_ALLOCATE and if the server 
name in me is same as your RS address. This avoids the problem of unnecessary 
region getting hijacked though the RS was doing his work correctly.

bq.So, why not just add machine name to OFFLINE? Then we don't need REALLOCATE 
state? 
This you have already told like currently there is no version that is passed 
from master to rs. Thats why a new state.  If this had been possible then 
OFFLINE with version passed by master would have been sufficient.

bq.So, figuring how to do deal with timeout of regions in PENDING_OPEN is one 
aspect of this issue, right? The verification of state over in timeout monitor 
before acting is another aspect?
Yes stack.. we have covered both these aspects and also the points told by JD.  
Taking action on timeout immediately and a mechanism for both master and RS to 
know what happened as part of timeout and who ever wins the race succeeds.  

bq.(I believe it acts a little differently from 0.90 because of recent work 
done in here).

Reg timeout monitor the one major change is now the CLSOING state node is 
created by master itself and it was done by RS as in 0.90.  Apart from this i 
dint find any big difference till now. As part of HBASE-4083 we have introduced 
the return types from Open RegionHandler which takes care of scenarios where a 
race condition happens between the master changes to RE_ALLOCATE by the time 
the RS has moved to OPENED.



> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
>                 Key: HBASE-4015
>                 URL: https://issues.apache.org/jira/browse/HBASE-4015
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4015_1_trunk.patch, Timeoutmonitor with state 
> diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition 
> generator, mostly making things worse rather than better. It does it's own 
> thing for a while without caring for what's happening in the rest of the 
> master.
> The first thing that needs to happen is that the regions should not be 
> processed in one big batch, because that sometimes can take minutes to 
> process (meanwhile a region that timed out opening might have opened, then 
> what happens is it will be reassigned by the TimeoutMonitor generating the 
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure 
> how to do it in a scalable way in this case.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4015) Refactor the TimeoutMonitor to make it less racy

Reply via email to