Re: Review Request: Cleanup of RIT timeouts and server shutdown handling

Jonathan Gray Mon, 01 Nov 2010 19:02:35 -0700


> On 2010-11-01 14:48:51, stack wrote:
> > Commit after reviewing the below.  This patch makes a HUGE difference.  My 
> > trashing rolling restart of a 10 node cluster of 3k regions now works 
> > (mostly).  Seems like an errant region the odd time I test but thats for 
> > hbck to figure now....  Great work JGray.


Okay, will fix up minor stuff below and commit.  Thanks for review and all the 
testing!


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java, line 
> > 501
> > <http://review.cloudera.org/r/1143/diff/5/?file=16357#file16357line501>
> >
> >     How did this ever work?  Why didn't typing catch this?

This code never did work right.  This is all new master code.  Also, the HSI 
returning method I think in MetaReader is new from after I did fixes related to 
TestMasterFailover.

It would only be an issue if an RS died and came back up, and the new instance 
of it was assigned regions before the shutdown handling of the previous 
instance even started.

That's the killer part of the test you've been doing and this is definitely a 
key part of this fix.


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 111
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line111>
> >
> >     We doing this or not?  Seems like we are.  Should we change this back 
> > to plain HashMap... else when concurrent map it gives impression that its 
> > ok to meddle in regionPlan w/o first synchronizing?

we mostly do but not always.  let's open another jira for this because there's 
this AssignmentManager.updateTimers() call that "tickles" stuff that maybe 
should not hold a lock?

Opened HBASE-3188 (did not target to 0.90 but we can certainly do now)


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 644
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line644>
> >
> >     want to doc this new param? (Not important -- its self-describing)

adding javadoc


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 841
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line841>
> >
> >     We don't have to do the remove anymore?  Passing force new plan does 
> > this for us?

Yeah.  Javadoc states that if true and plan exists, will create a new one.


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 935
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line935>
> >
> >     Should this test go into a method of its own?  Is it repeated elsewhere?

No, that's only place.  If method, would take three args so probably not much 
simpler.


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 998
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line998>
> >
> >     What happens if state.isClosing now? (You dropped it from else if here)

We don't force out of isClosing state this way anymore.  In new timeout 
handling, we will only send another close rpc if the RS never transitions to 
CLOSING.  Once the RS transitions to CLOSING (state.isClosing()) it does one of 
three things: it finishes (goes to CLOSED), dies (RS expire), or fails (deletes 
node).  We handle those accordingly.


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 1012
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1012>
> >
> >     Are all this.regions synchronized?  Same for this.regionPlans?  You 
> > might want to give it all a review before commit?

regions looked fine when i did a review of it and RIT.  regionPlans needs 
review.  HBASE-3188


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 1027
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1027>
> >
> >     I think spell it out rather than NSRE in the message.

done


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 1047
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1047>
> >
> >     Is this ok?  Calling assign inside a synchronize on RIT?  Is that done 
> > everywhere?  Or, this is the assign that takes a RegionState.. and expects 
> > the synchronize to be in place?

It seems okay except looking at it more, we end up doing a bit of work under 
RIT lock that we don't need to.  And all locking needed is done within 
assign().  Walked through it three times, we should move assign outside RIT 
lock.

Will do that.  Need to keep RIT lock around forceRegionStateToOffline() though


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 1549
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1549>
> >
> >     Nothing to do here?  Job for hbck?

It's hard to imagine how it would happen, so must be hbck I guess.  hbck repair 
would wipe all ZK state, ensure region is closed everywhere, then do assign w/ 
set zk offline=true


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, 
> > line 1569
> > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1569>
> >
> >     What happens here?  I suppose we used shutdown server?

Changing to "Node did not exist, can't time this out".  I guess this is hbck.

I'm trying not to capture every odd-ball case that "should be" impossible (or 
when possible, they should be ignored).

So we could see these things happening, when master gets really overloaded and 
backed up, but when this is the case, we should ignore these states as events 
will be triggered down the road to handle things accordingly.


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java, line 
> > 584
> > <http://review.cloudera.org/r/1143/diff/5/?file=16360#file16360line584>
> >
> >     Thats an awkward sentence for a debug message...

rewritten as english.


> On 2010-11-01 14:48:51, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java,
> >  line 66
> > <http://review.cloudera.org/r/1143/diff/5/?file=16361#file16361line66>
> >
> >     We were not using this?  Probably for the best.  Probably stale by time 
> > we went to act on it?

it's implied now.  this is historical from first implementation of the handler 
stuff.  we used to have one handler for multiple states, but now CLOSED has 
it's own handler so we don't need the state anymore.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1143/#review1755
-----------------------------------------------------------


On 2010-11-01 11:47:35, Jonathan Gray wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/1143/
> -----------------------------------------------------------
> 
> (Updated 2010-11-01 11:47:35)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> Does cleanup of RIT timeouts according to document in progress.  Still 
> finishing document but I'd like to get this patch tested before finalizing it.
> 
> Also found some strange stuff in server shutdown handling that could have 
> easily led to some double assignment issues that stack was seeing.
> 
> 
> This addresses bug HBASE-3181.
>     http://issues.apache.org/jira/browse/HBASE-3181
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1029789 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 
> 1029789 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1029789 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 
> 1029789 
>   
> trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java
>  1029789 
>   
> trunk/src/main/java/org/apache/hadoop/hbase/master/handler/EnableTableHandler.java
>  1029789 
>   
> trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java
>  1029789 
>   
> trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
>  1029789 
>   trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1029789 
>   trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 
> 1029789 
>   trunk/src/test/java/org/apache/hadoop/hbase/master/TestRollingRestart.java 
> 1029789 
> 
> Diff: http://review.cloudera.org/r/1143/diff
> 
> 
> Testing
> -------
> 
> Working on tests now.  This definitely changes some behavior that is tested 
> in the new TestMasterFailover so need to figure if the test should change or 
> whether we need to handle things like CLOSING.  Maybe let it timeout a few 
> times?
> 
> 
> Thanks,
> 
> Jonathan
> 
>

Re: Review Request: Cleanup of RIT timeouts and server shutdown handling

Reply via email to