[
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179938#comment-13179938
]
stack commented on HBASE-5120:
------------------------------
When would it ever make sense that TimeoutMonitor ask a regionserver RE-close a
region? It implies that somehow the regionserver 'forgot' to do the close or
that it dropped the message. If either happened, we have bigger problems and
having the TM do a new close will only make it worse (If a RS 'forgot' or
'failed' a close, its probably on its way out and its ephemeral node is about
to vanish and master shutdown handler will do the close fixups; if it does not
work this way, we should make it so?).
TM might make some sense for region openings but even here, I'd think the above
applies; i.e. if a RS can't open a region, it should kill itself. If the region
is bad, its unlikely any RS will open it unless the error transient. For this
latter case the TM running every 30 minutes, while conservative, is probably
fine for a default.
I think we should leave the TM at its current 30 minute timeout and undo this
issue as critical against 0.92.
There are outstanding race conditions in hbase around shutdown handler and
master operations that deserve more attention that this IMO.
> Timeout monitor races with table disable handler
> ------------------------------------------------
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.92.0
> Reporter: Zhihong Yu
> Priority: Blocker
> Fix For: 0.92.0, 0.94.0
>
> Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy
> and caused more troubles than it fixed", on my first test I got a stuck
> region in transition instead of being able to recover. The timeout was set to
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to
> sv4r5s38,62023,1325635980913 for region
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Handling
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913,
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so
> deleting ZK node and removing from regions in transition, skipping assignment
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:62003-0x134589d3db03587 Deleting existing unassigned node for
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
> master:62003-0x134589d3db03587 Creating unassigned node for
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned
> java.lang.NullPointerException: Passed server is null for
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-04 00:19:46,607 DEBUG
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region
> to clear regions in transition;
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636310328, server=null
> ...
> 2012-01-04 00:20:39,623 DEBUG
> org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region
> to clear regions in transition;
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
> out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> state=PENDING_CLOSE, ts=1325636310328, server=null
> 2012-01-04 00:20:39,864 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:20:39,865 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> (offlining)
> 2012-01-04 00:20:39,865 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. but it
> is not currently assigned anywhere
> {quote}
> And this is still ongoing.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira