[ https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207785#comment-13207785 ]
ramkrishna.s.vasudevan commented on HBASE-5200: ----------------------------------------------- @Stack First of all thanks for the testcase. {code} Would the fb trick of NOT processing callbacks during master failover help here? At least for the scope of the AM.joinCluster? {code} This part i did not go through as i did not find time. {code} . Isn't possible that in processRegionInTransition we may have done this already? {code} The check in handleRegion or in processRegionInTransition will be exclusive. It will be done only in one place. {code} It seems wrong that we are putting stuff into RIT in two places; in processRegionsInTransition and in handlRegion if we happen to be fielding a call back before failover has had a chance to run. {code} Though we do this in two places either procesRIIT or handleREgion only will execute thus the RIT population is neeeded to help process the current flow. {code} applied to TRUNK. However, TestAssignmentManager#testBalanceOnMasterFailover fails with or without the patch. {code} The test case had few problems. -> The region was not transitioned after the CLOSED transition got a call back for assigning it. So there was no RS to process the assign. -> the gate variable was not getting reset. -> One more thing is we will get a call back only after we do the ZKAssign.getDataandWatch. But in testcase we were getting a call back just after am.joinCluster. So i have done some modifications. Once again thanks for the test case which helped to verify the scenarios. Please provide your suggestions. The FB approach i need some time if we have to check that and implement here. > AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the > region assignment inconsistent > --------------------------------------------------------------------------------------------------------- > > Key: HBASE-5200 > URL: https://issues.apache.org/jira/browse/HBASE-5200 > Project: HBase > Issue Type: Bug > Affects Versions: 0.90.5 > Reporter: ramkrishna.s.vasudevan > Assignee: ramkrishna.s.vasudevan > Fix For: 0.94.0, 0.90.7, 0.92.1 > > Attachments: 5200-test.txt, 5200-v2.txt, HBASE-5200.patch, > HBASE-5200_1.patch, HBASE-5200_trunk_latest_with_test_2.patch, > TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml, > hbase-5200_90_latest.patch > > > This is the scenario > Consider a case where the balancer is going on thus trying to close regions > in a RS. > Before we could close a master switch happens. > On Master switch the set of nodes that are in RIT is collected and we first > get Data and start watching the node > After that the node data is added into RIT. > Now by this time (before adding to RIT) if the RS to which close was called > does a transition in AM.handleRegion() we miss the handling saying RIT state > was null. > {code} > 2012-01-13 10:50:46,358 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > a66d281d231dfcaea97c270698b26b6f from server > HOST-192-168-47-205,20020,1326363111288 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,358 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > c12e53bfd48ddc5eec507d66821c4d23 from server > HOST-192-168-47-205,20020,1326363111288 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,358 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > 59ae13de8c1eb325a0dd51f4902d2052 from server > HOST-192-168-47-205,20020,1326363111288 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,359 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > f45bc9614d7575f35244849af85aa078 from server > HOST-192-168-47-205,20020,1326363111288 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,359 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > cc3ecd7054fe6cd4a1159ed92fd62641 from server > HOST-192-168-47-204,20020,1326342744518 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,359 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > 3af40478a17fee96b4a192b22c90d5a2 from server > HOST-192-168-47-205,20020,1326363111288 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,359 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > e6096a8466e730463e10d3d61f809b92 from server > HOST-192-168-47-204,20020,1326342744518 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,359 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > 4806781a1a23066f7baed22b4d237e24 from server > HOST-192-168-47-204,20020,1326342744518 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > 2012-01-13 10:50:46,359 WARN > org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region > d69e104131accaefe21dcc01fddc7629 from server > HOST-192-168-47-205,20020,1326363111288 but region was in the state null and > not in expected PENDING_CLOSE or CLOSING states > {code} > In branch the CLOSING node is created by RS thus leading to more > inconsistency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira