[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184205#comment-13184205 ] Zhihong Yu edited comment on HBASE-5120 at 1/11/12 5:17 PM: Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ? {code} +LOG.debug("The deletion of the CLOSED node for the region " ++ region.getEncodedName() + " returned " + deleteNode); {code} was (Author: zhi...@ebaysf.com): Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ? > Timeout monitor races with table disable handler > > > Key: HBASE-5120 > URL: https://issues.apache.org/jira/browse/HBASE-5120 > Project: HBase > Issue Type: Bug >Affects Versions: 0.92.0 >Reporter: Zhihong Yu >Priority: Blocker > Fix For: 0.94.0, 0.92.1 > > Attachments: HBASE-5120.patch, HBASE-5120_1.patch, > HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch > > > Here is what J-D described here: > https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 > I think I will retract from my statement that it "used to be extremely racy > and caused more troubles than it fixed", on my first test I got a stuck > region in transition instead of being able to recover. The timeout was set to > 2 minutes to be sure I hit it. > First the region gets closed > {quote} > 2012-01-04 00:16:25,811 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > sv4r5s38,62023,1325635980913 for region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > {quote} > 2 minutes later it times out: > {quote} > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636185810, server=null > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,027 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > (offlining) > {quote} > 100ms later the master finally gets the event: > {quote} > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, > region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED > event for 1a4b111bcc228043e89f59c4c3f6a791 > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so > deleting ZK node and removing from regions in transition, skipping assignment > of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Deleting existing unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED > 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Successfully deleted unassigned node for > region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED > {quote} > At this point everything is fine, the region was processed as closed. But > wait, remember that line where it said it was going to force an unassign? > {quote} > 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Creating unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state > 2012-01-04 00:18:30,328 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Server null returned > java.lang.NullPointerException: Passed server is null for > 1a4b111bcc228043e89f59c4c3f6a791 > {quote} > Now the master is confused, it recreated the RIT znode but the region doesn't > even exist anymore. It even tries to shut it down but is blocked by NPEs. Now > this is what's going on. > The late ZK notification that the znode was deleted (but it got recreated > after): > {quote} > 2012-01-04 00:19:33,285 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: The znode of region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been > deleted. > {quote} > Then it prints this, and much later tries to unassign it again: > {quote} > 2012-01-
[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183380#comment-13183380 ] Zhihong Yu edited comment on HBASE-5120 at 1/10/12 5:07 PM: {code} +LOG.debug("The deletion of the CLOSED node returned " + deleteNode); {code} I think the above should be at ERROR level. Also, please log the region name. was (Author: zhi...@ebaysf.com): {code} +LOG.debug("The deletion of the CLOSED node returned " + deleteNode); {code} I think the above should be at ERROR level. Also, please make the sentence syntactically correct. > Timeout monitor races with table disable handler > > > Key: HBASE-5120 > URL: https://issues.apache.org/jira/browse/HBASE-5120 > Project: HBase > Issue Type: Bug >Affects Versions: 0.92.0 >Reporter: Zhihong Yu >Priority: Blocker > Fix For: 0.94.0, 0.92.1 > > Attachments: HBASE-5120.patch, HBASE-5120_1.patch, > HBASE-5120_2.patch, HBASE-5120_3.patch > > > Here is what J-D described here: > https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 > I think I will retract from my statement that it "used to be extremely racy > and caused more troubles than it fixed", on my first test I got a stuck > region in transition instead of being able to recover. The timeout was set to > 2 minutes to be sure I hit it. > First the region gets closed > {quote} > 2012-01-04 00:16:25,811 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > sv4r5s38,62023,1325635980913 for region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > {quote} > 2 minutes later it times out: > {quote} > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636185810, server=null > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,027 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > (offlining) > {quote} > 100ms later the master finally gets the event: > {quote} > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, > region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED > event for 1a4b111bcc228043e89f59c4c3f6a791 > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so > deleting ZK node and removing from regions in transition, skipping assignment > of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Deleting existing unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED > 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Successfully deleted unassigned node for > region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED > {quote} > At this point everything is fine, the region was processed as closed. But > wait, remember that line where it said it was going to force an unassign? > {quote} > 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Creating unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state > 2012-01-04 00:18:30,328 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Server null returned > java.lang.NullPointerException: Passed server is null for > 1a4b111bcc228043e89f59c4c3f6a791 > {quote} > Now the master is confused, it recreated the RIT znode but the region doesn't > even exist anymore. It even tries to shut it down but is blocked by NPEs. Now > this is what's going on. > The late ZK notification that the znode was deleted (but it got recreated > after): > {quote} > 2012-01-04 00:19:33,285 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: The znode of region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been > deleted. > {quote} > Then it prints this, and much later tries to
[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179660#comment-13179660 ] Lars Hofhansl edited comment on HBASE-5120 at 1/4/12 5:17 PM: -- Hey Ram, I know you said this is not done, yet... {code} if (t instanceof NullPointerException) { removeRegionInTransition(region); ... {code} Could we instead add a null check at the relevant point and deal with it there (or maybe throw another exception)? (Dealing with this after the NPE occurred leaves a bad taste... What if we introduce another bug later that also causes an NPE? That would go unnoticed for a while.) was (Author: lhofhansl): Hey Ram, I know you said this is not done, yet... {code} if (t instanceof NullPointerException) { removeRegionInTransition(region); ... {code} Could we instead add a null check at the relevant point and deal with it there (or maybe throw another exception)? (Dealing with this after the NPE occurred leaves a bad taste... What we introduce another bug later that also causes an NPE, that will go unnoticed for a while.) > Timeout monitor races with table disable handler > > > Key: HBASE-5120 > URL: https://issues.apache.org/jira/browse/HBASE-5120 > Project: HBase > Issue Type: Bug >Affects Versions: 0.92.0 >Reporter: Zhihong Yu >Priority: Blocker > Attachments: HBASE-5120.patch > > > Here is what J-D described here: > https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 > I think I will retract from my statement that it "used to be extremely racy > and caused more troubles than it fixed", on my first test I got a stuck > region in transition instead of being able to recover. The timeout was set to > 2 minutes to be sure I hit it. > First the region gets closed > {quote} > 2012-01-04 00:16:25,811 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > sv4r5s38,62023,1325635980913 for region > test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > {quote} > 2 minutes later it times out: > {quote} > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > state=PENDING_CLOSE, ts=1325636185810, server=null > 2012-01-04 00:18:30,026 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,027 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of > region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > (offlining) > {quote} > 100ms later the master finally gets the event: > {quote} > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, > region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED > event for 1a4b111bcc228043e89f59c4c3f6a791 > 2012-01-04 00:18:30,129 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so > deleting ZK node and removing from regions in transition, skipping assignment > of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. > 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Deleting existing unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED > 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Successfully deleted unassigned node for > region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED > {quote} > At this point everything is fine, the region was processed as closed. But > wait, remember that line where it said it was going to force an unassign? > {quote} > 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:62003-0x134589d3db03587 Creating unassigned node for > 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state > 2012-01-04 00:18:30,328 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Server null returned > java.lang.NullPointerException: Passed server is null for > 1a4b111bcc228043e89f59c4c3f6a791 > {quote} > Now the master is confused, it recreated the RIT znode but the region doesn't > even exist anymore. It even tries to shut it down but is block