RE: HBASE-4060 - TimeOutMonitor refactoring

Ramkrishna S Vasudevan Thu, 04 Aug 2011 07:38:30 -0700

Hi 

I was able to find to identify why PENDING_OPEN happens and under what
scenario.  The defect HBASE-3937 is the one where we can identify why there
are race conditions in TimeoutMonitor.



The following is the scenario. This scenario few people may already be aware
of
but i would like to highlight few thing in this,

1) Master says Region A to be opened in RS1.  While it was about to open the
NN went down hence instantiating the Region A was a failure.  So now the
state of the Znode is OPENING.  (JD was interested to know why opening did
not happen in the defect HBASE-3937).
But for the above scenario to happen it took sometime as RS1 was busy and so
the master deducted it as too much time taken in PENDING_OPEN state and
added it to the assign list.
2) By the time as the znode state is changed to OPENING the master gets a
call back and the current state in master is OPENING.  But we dont do much
about it.  
3) Now the list of regions in the assign list populated in step 1 is now
taken and the Region A is allocated to RS2. Before this the state in master
memory is updated to PENDING_OPEN.
4) It tries to open but it is not able to do it as the Region A is already
hijacked by RS1.
5) Now again PENDING_OPEN timeout happens and again Region A is tried to
assign to RS1.  
6) Here again version mismatch occurs and the state continues to be in
PENDING_OPEN.  The existing code handles the version mismatches but the
version is created by 
the RS and only the RS is aware of the version.

Points to be noted:
==================
->Assignment is done in batch.
->Though the master memory state for the Region A is updated to OPENING we
are not able to make use of it as already we have populated the assign list.
-> And other than that we have actually handled other scenarios like what if
the timeout happens when it is in OPENING state.  In that case we try to
OFFLINE the state in znode so that fresh allocation can happen.  And also we
check the current state also 
before handling it. 

Our soln 1: 
======== 
-> Do not add it to assign list.  Instead invoke future task then and there
when we deduct timeout has happened for new assignemt. 
-> Add one more state RE_ALLOCATE whenever the master deducts the previous
assignment has timedout. 
-> Before changing to RE_ALLOCATE check if the state is altered by RS if not
change it to RE_ALLOCATE. 
-> Similar change to be done in RS so that before it changes the node from
OFFLINE->OPENING-OPENED he will check if the state is RE_ALLOCATE if so RS
is for sure aware that the master has taken control of the node because the
RS was too slow in processing the region assignment. 
-> If the master finds that before changing the state to RE_ALLOCATE if the
state has changed it means the RS has done his job correctly and so stops
from changing to RE_ALLOCATE. 

This new state RE_ALLOCATE will help both MASTER and RS to know about the
state. 
This is a first cut solution.   
Reviews and suggestions are welcome.  If you find any problems in this soln
pls do specify.   

Any other solution if you have pls feel free to share.


Regards
Ram

-----Original Message-----
From: Ted Yu [mailto:[email protected]] 
Sent: Thursday, August 04, 2011 7:49 AM
To: [email protected]
Subject: Re: HBASE-4060 - TimeOutMonitor refactoring

Bring the following discussion to public.
HBASE-4015 is in the critical path of 0.92

Cheers

On Wed, Aug 3, 2011 at 8:12 AM, Ramkrishna S Vasudevan <
[email protected]> wrote:

> Hi JD
>
> I was working on finalising a strategy to avoid Timeoutmonitor race
> condition.  I have few queries when i tried reproducing the issue and
while
> going through the code.
> The scenario that is mentioned in the defect where the region is left in
> PENDING_OPEN state when RS1 who was first not opening the region, moved
the
> state from OFFLINE to OPENING when the RS2 started opening the same
region.
>
>
> When i tried to reproduce and went thro the code if the RS that tries to
> make the state changes from OFFLINE->OPENING->OPENED we always check for
> the
> version of the znode before proceeding with the state updation.
> So for the above mentioned scenario I get a log saying
> "Region already hijacked? "
>
> Pls correct me if am wrong? Could you brief me more on the problem that
> causes this race condition.
>
> We are working on a strategy so that every RS is made aware whether it
> should take up the assignment or not by implementing some STATEs which is
> visible to both master and RS.
>
> Once am clear with the real root cause i will upload our idea of
overcoming
> the race condition.
>
> Thanks & Regards
> Ram
>
>
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of
> Jean-Daniel Cryans
> Sent: Tuesday, August 02, 2011 3:52 AM
> To: [email protected]; [email protected]
> Cc: stack; Ted Yu
> Subject: Re: HBASE-4060 - TimeOutMonitor refactoring
>
> I've not started working on this yet, happy to review your ideas/code Ram.
>
> Thanks,
>
> J-D
>
> On Fri, Jul 29, 2011 at 7:54 AM, Ted Yu <[email protected]> wrote:
> > Copying J-D.
> >
> > On Fri, Jul 29, 2011 at 7:38 AM, Ramkrishna S Vasudevan
> > <[email protected]> wrote:
> >>
> >> Hi Ted/Stack,
> >>
> >>
> >>
> >> We analyzed and found similar issues are occurring even in our cluster
a
> >> couple of times.
> >>
> >>
> >>
> >> So we are very much interested in taking it up though we have not yet
> >> analyzed/started the ground work on it.  I would also like to know if
> any
> >> one is currently working on it.  Particularly JD was very much keen on
> this
> >> issue.
> >>
> >>
> >>
> >> Even if you guys have a plan or solution for that I would like to take
> >> part in it or even ready to implement few things as part of it.
> >>
> >>
> >>
> >> I would like to know your comments and suggestions on this.
> >>
> >>
> >>
> >> Regards
> >>
> >> Ram
> >>
> >>
> >>
> >>
> >>
> >> P.S: Plz do reply to the id in CC also as i will be in travel over the
> >> weekend.
> >>
> >>
> >>
> >>
> >
>
>

RE: HBASE-4060 - TimeOutMonitor refactoring

Reply via email to