[ 
https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207340#comment-13207340
 ] 

stack commented on HBASE-5200:
------------------------------

Attached unit test stands up an AssignmentManager and then manufactures the 
condition that Ram describes.  The test gets stuck and timesout after five 
seconds because the znode is not cleared on master failover (as per Ram 
description).

Ram, your patch no longer applies to TRUNK seemingly.

Why you make a hash w/ preset size of 1?

{code}
+  private Set<String> regionsProcessed = new HashSet<String>(1);
{code}

Is this the right name for this hash?  Should it be 
regionsProcessedJoiningCluster or some such?

The regionsProcessed hash is of a String.  I see in 
handleRegionWhileFailOverInProgress that we always get the regioninfo from 
meta.  Isn't possible that in processRegionInTransition we may have done this 
already?  That it may be non-null?  If so, shouldn't we keep it around so we 
don't have to go to the .META. every time but only for those cases where 
regioninfo is indeed null?  Would that mean changing regionsProcessed to be a 
Map of String to HRI?

Isn't getHRegionInfo repeating code from earlier up in 
processRegionInTransition?

If so, change it so that there is only one place where we go to meta... have 
both places call your new getRegionInfo method.

Why do this:

{code}
+      hri = p.getFirst();
+      return hri;
{code}

Why not just do return p.getFirst();?

Is everything shifted right because of this test?

{code}
+      if (regionState == null
+          && !regionsProcessed.contains(encodedRegionName)) {

{code}

If so, shouldn't we just take the opposite of the above and return immediately 
if regionState is non-null and in regionsProcesed as in:

{code}
if (regionsState != null && regionsProcessed.contains(encodedRegionName)) 
return;
{code}

This would make your change less substantial.

It seems wrong that we are putting stuff into RIT in two places; in 
processRegionsInTransition and in handlRegion if we happen to be fielding a 
call back before failover has had a chance to run.

Would the fb trick of NOT processing callbacks during master failover help 
here?  At least for the scope of the AM.joinCluster?

Is this a good name for this  method?  handleRegionWhileFailOverInProgress  
Should it be checkFailover or some such?

The test I attached only checks the CLOSING state.  We should extend it to do 
the other states OPENING, etc.?

I can help with this.

Also, how did you figure out this bug.  It must have taken a bunch of head 
banging to figure that this was indeed what was going on.  Good stuff Ram.




                
> AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the 
> region assignment inconsistent
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5200
>                 URL: https://issues.apache.org/jira/browse/HBASE-5200
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.5
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.94.0, 0.90.7, 0.92.1
>
>         Attachments: 5200-test.txt, 5200-v2.txt, HBASE-5200.patch, 
> HBASE-5200_1.patch, 
> TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml, 
> hbase-5200_90_latest.patch
>
>
> This is the scenario
> Consider a case where the balancer is going on thus trying to close regions 
> in a RS.
> Before we could close a master switch happens.  
> On Master switch the set of nodes that are in RIT is collected and we first 
> get Data and start watching the node
> After that the node data is added into RIT.
> Now by this time (before adding to RIT) if the RS to which close was called 
> does a transition in AM.handleRegion() we miss the handling saying RIT state 
> was null.
> {code}
> 2012-01-13 10:50:46,358 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> a66d281d231dfcaea97c270698b26b6f from server 
> HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> c12e53bfd48ddc5eec507d66821c4d23 from server 
> HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> 59ae13de8c1eb325a0dd51f4902d2052 from server 
> HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> f45bc9614d7575f35244849af85aa078 from server 
> HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> cc3ecd7054fe6cd4a1159ed92fd62641 from server 
> HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> 3af40478a17fee96b4a192b22c90d5a2 from server 
> HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> e6096a8466e730463e10d3d61f809b92 from server 
> HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> 4806781a1a23066f7baed22b4d237e24 from server 
> HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
> d69e104131accaefe21dcc01fddc7629 from server 
> HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
> not in expected PENDING_CLOSE or CLOSING states
> {code}
> In branch the CLOSING node is created by RS thus leading to more 
> inconsistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to