[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185369#comment-13185369
 ] 

Jean-Daniel Cryans commented on HBASE-5120:
-------------------------------------------

Testing the patch with a low timeout, I can answer the question in the code 
that asks "We don't abort if the delete node returns false. Is there any such 
corner case?" and yes here it is:

{noformat}
2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 
state=PENDING_CLOSE, ts=1326415997208, server=null
2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been PENDING_CLOSE for too long, running forced unassign again on 
region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
2012-01-13 00:53:39,053 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Starting unassignment of region 
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
2012-01-13 00:53:39,254 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Attempting to unassign region 
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is 
already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:53:39,255 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
update TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 
state=PENDING_CLOSE, ts=1326416019254, server=null the timestamp.
2012-01-13 00:53:39,256 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Server sv4r12s38,62023,1326415651391 returned 
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: 
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: 
Received:CLOSE for the 
region:TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 
,which we are already trying to CLOSE. for 0784c045e00205949461cb21b8f4cd6a
2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. 
state=PENDING_CLOSE, ts=1326416019256, server=null
2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been PENDING_CLOSE for too long, running forced unassign again on 
region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
2012-01-13 00:54:09,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Starting unassignment of region 
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
2012-01-13 00:54:09,126 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Attempting to unassign region 
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is 
already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:54:09,127 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
While trying to recover the table TestTable to DISABLED state the region {NAME 
=> 'TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.', 
STARTKEY => '0006605550', ENDKEY => '0006616035', ENCODED => 
0784c045e00205949461cb21b8f4cd6a,} was offlined but the table was in DISABLING 
state
2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x134589d3db06d62 Deleting existing unassigned node for 
0784c045e00205949461cb21b8f4cd6a that is in expected state M_ZK_REGION_CLOSING
2012-01-13 00:54:09,128 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x134589d3db06d62 Attempting to delete unassigned node 
0784c045e00205949461cb21b8f4cd6a in M_ZK_REGION_CLOSING state but node is in 
RS_ZK_REGION_CLOSED state
2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x134589d3db06d62 Deleting existing unassigned node for 
0784c045e00205949461cb21b8f4cd6a that is in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,140 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_CLOSED, server=sv4r12s38,62023,1326415651391, 
region=0784c045e00205949461cb21b8f4cd6a
2012-01-13 00:54:09,140 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received CLOSED for region 0784c045e00205949461cb21b8f4cd6a from server 
sv4r12s38,62023,1326415651391 but region was in  the state null and not in 
expected PENDING_CLOSE or CLOSING states
2012-01-13 00:54:09,148 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region 
0784c045e00205949461cb21b8f4cd6a in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,148 ERROR org.apache.hadoop.hbase.master.AssignmentManager: 
The deletion of the CLOSED node for the region 0784c045e00205949461cb21b8f4cd6a 
returned true
2012-01-13 00:54:09,148 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Server sv4r12s38,62023,1326415651391 returned 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Received close for 
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. but we are 
not serving it for 0784c045e00205949461cb21b8f4cd6a
{noformat}

I turned out ok even if I had 7 regions that did that.

I also got the "CLOSING/CLOSED node for the region x already deleted" message, 
here's the context:

{noformat}
2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 
state=PENDING_CLOSE, ts=1326415995575, server=null
2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been PENDING_CLOSE for too long, running forced unassign again on 
region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:53:39,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Starting unassignment of region 
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
2012-01-13 00:53:39,152 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Attempting to unassign region 
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is 
already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:53:39,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
update TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 
state=PENDING_CLOSE, ts=1326416019152, server=null the timestamp.
2012-01-13 00:53:39,172 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Server sv4r27s44,62023,1326415651133 returned 
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: 
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: 
Received:CLOSE for the 
region:TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 
,which we are already trying to CLOSE. for f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Regions in transition timed out:  
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. 
state=PENDING_CLOSE, ts=1326416019172, server=null
2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Region has been PENDING_CLOSE for too long, running forced unassign again on 
region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:54:09,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Starting unassignment of region 
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
2012-01-13 00:54:09,076 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Attempting to unassign region 
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is 
already PENDING_CLOSE but forcing to send a CLOSE RPC again 
2012-01-13 00:54:09,091 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
While trying to recover the table TestTable to DISABLED state the region {NAME 
=> 'TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.', 
STARTKEY => '0004613400', ENDKEY => '0004623885', ENCODED => 
f6c862f9c16b15b9227c2ed46865fb48,} was offlined but the table was in DISABLING 
state
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_CLOSED, server=sv4r27s44,62023,1326415651133, 
region=f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,106 DEBUG 
org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
event for f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Table being disabled so deleting ZK node and removing from regions in 
transition, skipping assignment of region 
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x134589d3db06d62 Deleting existing unassigned node for 
f6c862f9c16b15b9227c2ed46865fb48 that is in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,115 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region 
f6c862f9c16b15b9227c2ed46865fb48 in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x134589d3db06d62 Deleting existing unassigned node for 
f6c862f9c16b15b9227c2ed46865fb48 that is in expected state M_ZK_REGION_CLOSING
2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
CLOSING/CLOSED node for the region f6c862f9c16b15b9227c2ed46865fb48 already 
deleted
2012-01-13 00:54:09,128 INFO org.apache.hadoop.hbase.master.AssignmentManager: 
Server sv4r27s44,62023,1326415651133 returned 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Received close for 
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. but we are 
not serving it for f6c862f9c16b15b9227c2ed46865fb48
{noformat}

Again it turned out fine.

I'm +1 after fixing the comment about the corner case.
                
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
>                 Key: HBASE-5120
>                 URL: https://issues.apache.org/jira/browse/HBASE-5120
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: Zhihong Yu
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Blocker
>             Fix For: 0.94.0, 0.92.1
>
>         Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG 
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
> to clear regions in transition; 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> 2012-01-04 00:20:39,865 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it 
> is not currently assigned anywhere
> {quote}
> And this is still ongoing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to