[
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082242#comment-13082242
]
ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------
@Stack,
{noformat}
Do we need new RS_ALLOCATE state? Could we just have OFFLINE plus your
suggestion of adding RS name so its OFFLINE+RS_TO_OPEN_REGION_ON?
{noformat}
Yes this may also be possible. But we thought of introducing a new state so
that there is a clear distinction whether reallocation has happened or not and
also handling of the new state may be cleaner than changing the behaviour in
the existing state.
{noformat}
What happens if we assign the region back to RS1 (it can happen).
{noformat}
Yes. we have considered this scenario also. If the region is reallocated to
the same RS there are two flows
-> If the state is OPENING in zk but it is still not added to online regions
list in RS then any subsequent call from MASTER to RS with RE_ALLOCATE state
will succeed but the previous processing from OPENING to OPEN will fail.
-> In the second case if the region is added to the online regions list then
the RS will say ALREADY_OPENED and before removing from RIT in master we will
check if the node is deleted if not it will not be removed from RIT. Hence the
state will be in PENDING_OPEN so subsequent timeout monitor call will handle
it.
Pls provide your suggestions.
> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
> Key: HBASE-4015
> URL: https://issues.apache.org/jira/browse/HBASE-4015
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: ramkrishna.s.vasudevan
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: Timeoutmonitor with state diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition
> generator, mostly making things worse rather than better. It does it's own
> thing for a while without caring for what's happening in the rest of the
> master.
> The first thing that needs to happen is that the regions should not be
> processed in one big batch, because that sometimes can take minutes to
> process (meanwhile a region that timed out opening might have opened, then
> what happens is it will be reassigned by the TimeoutMonitor generating the
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure
> how to do it in a scalable way in this case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira