[
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185369#comment-13185369
]
Jean-Daniel Cryans commented on HBASE-5120:
-------------------------------------------
Testing the patch with a low timeout, I can answer the question in the code
that asks "We don't abort if the delete node returns false. Is there any such
corner case?" and yes here it is:
{noformat}
2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
state=PENDING_CLOSE, ts=1326415997208, server=null
2012-01-13 00:53:39,053 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_CLOSE for too long, running forced unassign again on
region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
2012-01-13 00:53:39,053 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
2012-01-13 00:53:39,254 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Attempting to unassign region
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is
already PENDING_CLOSE but forcing to send a CLOSE RPC again
2012-01-13 00:53:39,255 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
update TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
state=PENDING_CLOSE, ts=1326416019254, server=null the timestamp.
2012-01-13 00:53:39,256 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Server sv4r12s38,62023,1326415651391 returned
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException:
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException:
Received:CLOSE for the
region:TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
,which we are already trying to CLOSE. for 0784c045e00205949461cb21b8f4cd6a
2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
state=PENDING_CLOSE, ts=1326416019256, server=null
2012-01-13 00:54:09,051 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_CLOSE for too long, running forced unassign again on
region=TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.
2012-01-13 00:54:09,051 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. (offlining)
2012-01-13 00:54:09,126 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Attempting to unassign region
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. which is
already PENDING_CLOSE but forcing to send a CLOSE RPC again
2012-01-13 00:54:09,127 INFO org.apache.hadoop.hbase.master.AssignmentManager:
While trying to recover the table TestTable to DISABLED state the region {NAME
=> 'TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a.',
STARTKEY => '0006605550', ENDKEY => '0006616035', ENCODED =>
0784c045e00205949461cb21b8f4cd6a,} was offlined but the table was in DISABLING
state
2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db06d62 Deleting existing unassigned node for
0784c045e00205949461cb21b8f4cd6a that is in expected state M_ZK_REGION_CLOSING
2012-01-13 00:54:09,128 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db06d62 Attempting to delete unassigned node
0784c045e00205949461cb21b8f4cd6a in M_ZK_REGION_CLOSING state but node is in
RS_ZK_REGION_CLOSED state
2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db06d62 Deleting existing unassigned node for
0784c045e00205949461cb21b8f4cd6a that is in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,140 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=RS_ZK_REGION_CLOSED, server=sv4r12s38,62023,1326415651391,
region=0784c045e00205949461cb21b8f4cd6a
2012-01-13 00:54:09,140 WARN org.apache.hadoop.hbase.master.AssignmentManager:
Received CLOSED for region 0784c045e00205949461cb21b8f4cd6a from server
sv4r12s38,62023,1326415651391 but region was in the state null and not in
expected PENDING_CLOSE or CLOSING states
2012-01-13 00:54:09,148 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region
0784c045e00205949461cb21b8f4cd6a in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,148 ERROR org.apache.hadoop.hbase.master.AssignmentManager:
The deletion of the CLOSED node for the region 0784c045e00205949461cb21b8f4cd6a
returned true
2012-01-13 00:54:09,148 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Server sv4r12s38,62023,1326415651391 returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
TestTable,0006605550,1326415764458.0784c045e00205949461cb21b8f4cd6a. but we are
not serving it for 0784c045e00205949461cb21b8f4cd6a
{noformat}
I turned out ok even if I had 7 regions that did that.
I also got the "CLOSING/CLOSED node for the region x already deleted" message,
here's the context:
{noformat}
2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
state=PENDING_CLOSE, ts=1326415995575, server=null
2012-01-13 00:53:39,144 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_CLOSE for too long, running forced unassign again on
region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:53:39,145 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
2012-01-13 00:53:39,152 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Attempting to unassign region
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is
already PENDING_CLOSE but forcing to send a CLOSE RPC again
2012-01-13 00:53:39,169 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
update TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
state=PENDING_CLOSE, ts=1326416019152, server=null the timestamp.
2012-01-13 00:53:39,172 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Server sv4r27s44,62023,1326415651133 returned
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException:
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException:
Received:CLOSE for the
region:TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
,which we are already trying to CLOSE. for f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Regions in transition timed out:
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
state=PENDING_CLOSE, ts=1326416019172, server=null
2012-01-13 00:54:09,070 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Region has been PENDING_CLOSE for too long, running forced unassign again on
region=TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:54:09,070 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Starting unassignment of region
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. (offlining)
2012-01-13 00:54:09,076 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Attempting to unassign region
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. which is
already PENDING_CLOSE but forcing to send a CLOSE RPC again
2012-01-13 00:54:09,091 INFO org.apache.hadoop.hbase.master.AssignmentManager:
While trying to recover the table TestTable to DISABLED state the region {NAME
=> 'TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.',
STARTKEY => '0004613400', ENDKEY => '0004623885', ENCODED =>
f6c862f9c16b15b9227c2ed46865fb48,} was offlined but the table was in DISABLING
state
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Handling transition=RS_ZK_REGION_CLOSED, server=sv4r27s44,62023,1326415651133,
region=f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,106 DEBUG
org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED
event for f6c862f9c16b15b9227c2ed46865fb48
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
Table being disabled so deleting ZK node and removing from regions in
transition, skipping assignment of region
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48.
2012-01-13 00:54:09,106 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db06d62 Deleting existing unassigned node for
f6c862f9c16b15b9227c2ed46865fb48 that is in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,115 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db06d62 Successfully deleted unassigned node for region
f6c862f9c16b15b9227c2ed46865fb48 in expected state RS_ZK_REGION_CLOSED
2012-01-13 00:54:09,127 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
master:62003-0x134589d3db06d62 Deleting existing unassigned node for
f6c862f9c16b15b9227c2ed46865fb48 that is in expected state M_ZK_REGION_CLOSING
2012-01-13 00:54:09,128 DEBUG org.apache.hadoop.hbase.master.AssignmentManager:
CLOSING/CLOSED node for the region f6c862f9c16b15b9227c2ed46865fb48 already
deleted
2012-01-13 00:54:09,128 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Server sv4r27s44,62023,1326415651133 returned
org.apache.hadoop.hbase.NotServingRegionException:
org.apache.hadoop.hbase.NotServingRegionException: Received close for
TestTable,0004613400,1326415764450.f6c862f9c16b15b9227c2ed46865fb48. but we are
not serving it for f6c862f9c16b15b9227c2ed46865fb48
{noformat}
Again it turned out fine.
I'm +1 after fixing the comment about the corner case.
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.92.0
> Reporter: Zhihong Yu
> Assignee: ramkrishna.s.vasudevan
> Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch,
> HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy
> and caused more troubles than it fixed", on my first test I got a stuck
> region in transition instead of being able to recover. The timeout was set to
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to
> sv4r5s38,62023,1325635980913 for region
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913,
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so
> deleting ZK node and removing from regions in transition, skipping assignment
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:62003-0x134589d3db03587 Deleting existing unassigned node for
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:62003-0x134589d3db03587 Creating unassigned node for
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned
> java.lang.NullPointerException: Passed server is null for
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region
> to clear regions in transition;
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region
> to clear regions in transition;
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> (offlining)
> 2012-01-04 00:20:39,865 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it
> is not currently assigned anywhere
> {quote}
> And this is still ongoing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira