[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185369#comment-13185369 ]
Jean-Daniel Cryans commented on HBASE-5120: ------------------------------------------- Testing the patch with a low timeout, I can answer the question in the code that asks "We don't abort if the delete node returns false. Is there any such corner case?" and yes here it is: {noformat} 2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326415997208, server=null 2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 2012-01-13 00:53:39,053 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining) 2012-01-13 00:53:39,254 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:53:39,255 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019254, server=null the timestamp. 2012-01-13 00:53:39,256 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. ,which we are already trying to CLOSE. for 0784c045e00205949461cb21b8f4cd6a 2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. state=PENDING_CLOSE, ts=1326416019256, server=null 2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 2012-01-13 00:54:09,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining) 2012-01-13 00:54:09,126 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:54:09,127 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.', STARTKEY => '0006605550', ENDKEY => '0006616035', ENCODED => 0784c045e00205949461cb21b8f4cd6a,} was offlined but the table was in DISABLING state 2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state M_ZK_REGION_CLOSING 2012-01-13 00:54:09,128 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Attempting to delete unassigned node 0784c045e00205949461cb21b8f4cd6a in M_ZK_REGION_CLOSING state but node is in RS_ZK_REGION_CLOSED state 2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for 0784c045e00205949461cb21b8f4cd6a that is in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,140 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r12s38,62023,1326415651391, region=0784c045e00205949461cb21b8f4cd6a 2012-01-13 00:54:09,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 0784c045e00205949461cb21b8f4cd6a from server sv4r12s38,62023,1326415651391 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 00:54:09,148 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region 0784c045e00205949461cb21b8f4cd6a in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,148 ERROR org.apache.hadoop.hbase.master.AssignmentManager: The deletion of the CLOSED node for the region 0784c045e00205949461cb21b8f4cd6a returned true 2012-01-13 00:54:09,148 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r12s38,62023,1326415651391 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. but we are not serving it for 0784c045e00205949461cb21b8f4cd6a {noformat} I turned out ok even if I had 7 regions that did that. I also got the "CLOSING/CLOSED node for the region x already deleted" message, here's the context: {noformat} 2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326415995575, server=null 2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 2012-01-13 00:53:39,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining) 2012-01-13 00:53:39,152 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:53:39,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: update TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019152, server=null the timestamp. 2012-01-13 00:53:39,172 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:CLOSE for the region:TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. ,which we are already trying to CLOSE. for f6c862f9c16b15b9227c2ed46865fb48 2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. state=PENDING_CLOSE, ts=1326416019172, server=null 2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 2012-01-13 00:54:09,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining) 2012-01-13 00:54:09,076 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempting to unassign region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is already PENDING_CLOSE but forcing to send a CLOSE RPC again 2012-01-13 00:54:09,091 INFO org.apache.hadoop.hbase.master.AssignmentManager: While trying to recover the table TestTable to DISABLED state the region {NAME => 'TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.', STARTKEY => '0004613400', ENDKEY => '0004623885', ENCODED => f6c862f9c16b15b9227c2ed46865fb48,} was offlined but the table was in DISABLING state 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r27s44,62023,1326415651133, region=f6c862f9c16b15b9227c2ed46865fb48 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f6c862f9c16b15b9227c2ed46865fb48 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,115 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region f6c862f9c16b15b9227c2ed46865fb48 in expected state RS_ZK_REGION_CLOSED 2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db06d62 Deleting existing unassigned node for f6c862f9c16b15b9227c2ed46865fb48 that is in expected state M_ZK_REGION_CLOSING 2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: CLOSING/CLOSED node for the region f6c862f9c16b15b9227c2ed46865fb48 already deleted 2012-01-13 00:54:09,128 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server sv4r27s44,62023,1326415651133 returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. but we are not serving it for f6c862f9c16b15b9227c2ed46865fb48 {noformat} Again it turned out fine. I'm +1 after fixing the comment about the corner case. > Timeout monitor races with table disable handler > ------------------------------------------------ > > Key: HBASE-5120 > URL: https://issues.apache.org/jira/browse/HBASE-5120 > Project: HBase > Issue Type: Bug > Affects Versions: 0.92.0 > Reporter: Zhihong Yu > Assignee: ramkrishna.s.vasudevan > Priority: Blocker > Fix For: 0.94.0, 0.92.1 > > Attachments: HBASE-5120.patch, HBASE-5120_1.patch, > HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch > > > Here is what J-D described here: > https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 > I think I will retract from my statement that it "used to be extremely racy > and caused more troubles than it fixed", on my first test I got a stuck > region in transition instead of being able to recover. The timeout was set to > 2 minutes to be sure I hit it. > First the region gets closed > {quote} > 2012-01-04 00:16:25,811 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > sv4r5s38,62023,1325635980913 for region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > {quote} > 2 minutes later it times out: > {quote} > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636185810, server=null > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,027 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > (offlining) > {quote} > 100ms later the master finally gets the event: > {quote} > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, > region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED > event for 1a4b111bcc228043e89f59c4c3f6a791 > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so > deleting ZK node and removing from regions in transition, skipping assignment > of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Deleting existing unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED > 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Successfully deleted unassigned node for > region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED > {quote} > At this point everything is fine, the region was processed as closed. But > wait, remember that line where it said it was going to force an unassign? > {quote} > 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Creating unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state > 2012-01-04 00:18:30,328 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Server null returned > java.lang.NullPointerException: Passed server is null for > 1a4b111bcc228043e89f59c4c3f6a791 > {quote} > Now the master is confused, it recreated the RIT znode but the region doesn't > even exist anymore. It even tries to shut it down but is blocked by NPEs. Now > this is what's going on. > The late ZK notification that the znode was deleted (but it got recreated > after): > {quote} > 2012-01-04 00:19:33,285 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: The znode of region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been > deleted. > {quote} > Then it prints this, and much later tries to unassign it again: > {quote} > 2012-01-04 00:19:46,607 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region > to clear regions in transition; > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636310328, server=null > ... > 2012-01-04 00:20:39,623 DEBUG > org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region > to clear regions in transition; > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636310328, server=null > 2012-01-04 00:20:39,864 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636310328, server=null > 2012-01-04 00:20:39,864 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:20:39,865 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > (offlining) > 2012-01-04 00:20:39,865 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it > is not currently assigned anywhere > {quote} > And this is still ongoing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira