Hi I was able to find to identify why PENDING_OPEN happens and under what scenario. The defect HBASE-3937 is the one where we can identify why there are race conditions in TimeoutMonitor.
The following is the scenario. This scenario few people may already be aware of but i would like to highlight few thing in this, 1) Master says Region A to be opened in RS1. While it was about to open the NN went down hence instantiating the Region A was a failure. So now the state of the Znode is OPENING. (JD was interested to know why opening did not happen in the defect HBASE-3937). But for the above scenario to happen it took sometime as RS1 was busy and so the master deducted it as too much time taken in PENDING_OPEN state and added it to the assign list. 2) By the time as the znode state is changed to OPENING the master gets a call back and the current state in master is OPENING. But we dont do much about it. 3) Now the list of regions in the assign list populated in step 1 is now taken and the Region A is allocated to RS2. Before this the state in master memory is updated to PENDING_OPEN. 4) It tries to open but it is not able to do it as the Region A is already hijacked by RS1. 5) Now again PENDING_OPEN timeout happens and again Region A is tried to assign to RS1. 6) Here again version mismatch occurs and the state continues to be in PENDING_OPEN. The existing code handles the version mismatches but the version is created by the RS and only the RS is aware of the version. Points to be noted: ================== ->Assignment is done in batch. ->Though the master memory state for the Region A is updated to OPENING we are not able to make use of it as already we have populated the assign list. -> And other than that we have actually handled other scenarios like what if the timeout happens when it is in OPENING state. In that case we try to OFFLINE the state in znode so that fresh allocation can happen. And also we check the current state also before handling it. Our soln 1: ======== -> Do not add it to assign list. Instead invoke future task then and there when we deduct timeout has happened for new assignemt. -> Add one more state RE_ALLOCATE whenever the master deducts the previous assignment has timedout. -> Before changing to RE_ALLOCATE check if the state is altered by RS if not change it to RE_ALLOCATE. -> Similar change to be done in RS so that before it changes the node from OFFLINE->OPENING-OPENED he will check if the state is RE_ALLOCATE if so RS is for sure aware that the master has taken control of the node because the RS was too slow in processing the region assignment. -> If the master finds that before changing the state to RE_ALLOCATE if the state has changed it means the RS has done his job correctly and so stops from changing to RE_ALLOCATE. This new state RE_ALLOCATE will help both MASTER and RS to know about the state. This is a first cut solution. Reviews and suggestions are welcome. If you find any problems in this soln pls do specify. Any other solution if you have pls feel free to share. Regards Ram -----Original Message----- From: Ted Yu [mailto:[email protected]] Sent: Thursday, August 04, 2011 7:49 AM To: [email protected] Subject: Re: HBASE-4060 - TimeOutMonitor refactoring Bring the following discussion to public. HBASE-4015 is in the critical path of 0.92 Cheers On Wed, Aug 3, 2011 at 8:12 AM, Ramkrishna S Vasudevan < [email protected]> wrote: > Hi JD > > I was working on finalising a strategy to avoid Timeoutmonitor race > condition. I have few queries when i tried reproducing the issue and while > going through the code. > The scenario that is mentioned in the defect where the region is left in > PENDING_OPEN state when RS1 who was first not opening the region, moved the > state from OFFLINE to OPENING when the RS2 started opening the same region. > > > When i tried to reproduce and went thro the code if the RS that tries to > make the state changes from OFFLINE->OPENING->OPENED we always check for > the > version of the znode before proceeding with the state updation. > So for the above mentioned scenario I get a log saying > "Region already hijacked? " > > Pls correct me if am wrong? Could you brief me more on the problem that > causes this race condition. > > We are working on a strategy so that every RS is made aware whether it > should take up the assignment or not by implementing some STATEs which is > visible to both master and RS. > > Once am clear with the real root cause i will upload our idea of overcoming > the race condition. > > Thanks & Regards > Ram > > > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > Jean-Daniel Cryans > Sent: Tuesday, August 02, 2011 3:52 AM > To: [email protected]; [email protected] > Cc: stack; Ted Yu > Subject: Re: HBASE-4060 - TimeOutMonitor refactoring > > I've not started working on this yet, happy to review your ideas/code Ram. > > Thanks, > > J-D > > On Fri, Jul 29, 2011 at 7:54 AM, Ted Yu <[email protected]> wrote: > > Copying J-D. > > > > On Fri, Jul 29, 2011 at 7:38 AM, Ramkrishna S Vasudevan > > <[email protected]> wrote: > >> > >> Hi Ted/Stack, > >> > >> > >> > >> We analyzed and found similar issues are occurring even in our cluster a > >> couple of times. > >> > >> > >> > >> So we are very much interested in taking it up though we have not yet > >> analyzed/started the ground work on it. I would also like to know if > any > >> one is currently working on it. Particularly JD was very much keen on > this > >> issue. > >> > >> > >> > >> Even if you guys have a plan or solution for that I would like to take > >> part in it or even ready to implement few things as part of it. > >> > >> > >> > >> I would like to know your comments and suggestions on this. > >> > >> > >> > >> Regards > >> > >> Ram > >> > >> > >> > >> > >> > >> P.S: Plz do reply to the id in CC also as i will be in travel over the > >> weekend. > >> > >> > >> > >> > > > >
