[
https://issues.apache.org/jira/browse/HBASE-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084060#comment-13084060
]
ramkrishna.s.vasudevan commented on HBASE-4015:
-----------------------------------------------
@Stack,
Was seeing the possibility of using OFFLINE state. Thought of few things
-> Now we need to change behaviour in all the cases in timeoutmonitor to
preempt the node to OFFLINE with RS name.
->Before changing to OFFLINE see what is the state in RS. If still
OFFLINE/OPENING change it to OFFLINE+Servername address
->After changing it to OFFLINE get the latest version and pass it to the RS
from Master which inturn goes to the OpenRegionHandler.
->This will be needed when we transit from OFFLINE to OPENING to ensure whether
the current transition from OFFLINE to OPENING is for timeout call or previous
OFFLINE to OPENING did not happen.
->also the servername is necessary to avoid processing of the transition by the
RS who is no longer owner of the znode.
->And even in normal flow(normal assign flow) we need to add the servername of
RS along with OFFLINE who will process the unassigned node
These will be the highlevel changes that we need to make in the current patch
if we need to avoid the new state.
> Refactor the TimeoutMonitor to make it less racy
> ------------------------------------------------
>
> Key: HBASE-4015
> URL: https://issues.apache.org/jira/browse/HBASE-4015
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: ramkrishna.s.vasudevan
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: HBASE-4015_1_trunk.patch, Timeoutmonitor with state
> diagrams.pdf
>
>
> The current implementation of the TimeoutMonitor acts like a race condition
> generator, mostly making things worse rather than better. It does it's own
> thing for a while without caring for what's happening in the rest of the
> master.
> The first thing that needs to happen is that the regions should not be
> processed in one big batch, because that sometimes can take minutes to
> process (meanwhile a region that timed out opening might have opened, then
> what happens is it will be reassigned by the TimeoutMonitor generating the
> never ending PENDING_OPEN situation).
> Those operations should also be done more atomically, although I'm not sure
> how to do it in a scalable way in this case.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira