[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-11 Thread Zhihong Yu (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184205#comment-13184205
 ] 

Zhihong Yu edited comment on HBASE-5120 at 1/11/12 5:17 PM:


Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?
{code}
+LOG.debug("The deletion of the CLOSED node for the region "
++ region.getEncodedName() + " returned " + deleteNode);
{code}

  was (Author: zhi...@ebaysf.com):
Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ?
  
> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to unassign it again:
> {quote}
> 2012-01-

[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-10 Thread Zhihong Yu (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183380#comment-13183380
 ] 

Zhihong Yu edited comment on HBASE-5120 at 1/10/12 5:07 PM:


{code}
+LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please log the region name.

  was (Author: zhi...@ebaysf.com):
{code}
+LOG.debug("The deletion of the CLOSED node returned " + deleteNode);
{code}
I think the above should be at ERROR level. Also, please make the sentence 
syntactically correct.
  
> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Fix For: 0.94.0, 0.92.1
>
> Attachments: HBASE-5120.patch, HBASE-5120_1.patch, 
> HBASE-5120_2.patch, HBASE-5120_3.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
> this is what's going on.
> The late ZK notification that the znode was deleted (but it got recreated 
> after):
> {quote}
> 2012-01-04 00:19:33,285 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
> deleted.
> {quote}
> Then it prints this, and much later tries to 

[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-04 Thread Lars Hofhansl (Issue Comment Edited) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179660#comment-13179660
 ] 

Lars Hofhansl edited comment on HBASE-5120 at 1/4/12 5:17 PM:
--

Hey Ram, I know you said this is not done, yet...
{code}
if (t instanceof NullPointerException) {
  removeRegionInTransition(region);
...
{code}

Could we instead add a null check at the relevant point and deal with it there 
(or maybe throw another exception)? (Dealing with this after the NPE occurred 
leaves a bad taste... What if we introduce another bug later that also causes 
an NPE? That would go unnoticed for a while.)


  was (Author: lhofhansl):
Hey Ram, I know you said this is not done, yet...
{code}
if (t instanceof NullPointerException) {
  removeRegionInTransition(region);
...
{code}

Could we instead add a null check at the relevant point and deal with it there 
(or maybe throw another exception)? (Dealing with this after the NPE occurred 
leaves a bad taste... What we introduce another bug later that also causes an 
NPE, that will go unnoticed for a while.)

  
> Timeout monitor races with table disable handler
> 
>
> Key: HBASE-5120
> URL: https://issues.apache.org/jira/browse/HBASE-5120
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 0.92.0
>Reporter: Zhihong Yu
>Priority: Blocker
> Attachments: HBASE-5120.patch
>
>
> Here is what J-D described here:
> https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
> I think I will retract from my statement that it "used to be extremely racy 
> and caused more troubles than it fixed", on my first test I got a stuck 
> region in transition instead of being able to recover. The timeout was set to 
> 2 minutes to be sure I hit it.
> First the region gets closed
> {quote}
> 2012-01-04 00:16:25,811 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
> sv4r5s38,62023,1325635980913 for region 
> test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> {quote}
> 2 minutes later it times out:
> {quote}
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> state=PENDING_CLOSE, ts=1325636185810, server=null
> 2012-01-04 00:18:30,026 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
> PENDING_CLOSE for too long, running forced unassign again on 
> region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,027 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
> region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
> (offlining)
> {quote}
> 100ms later the master finally gets the event:
> {quote}
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Handling 
> transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
> region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
> event for 1a4b111bcc228043e89f59c4c3f6a791
> 2012-01-04 00:18:30,129 DEBUG 
> org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
> deleting ZK node and removing from regions in transition, skipping assignment 
> of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
> 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Deleting existing unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
> 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
> region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
> {quote}
> At this point everything is fine, the region was processed as closed. But 
> wait, remember that line where it said it was going to force an unassign?
> {quote}
> 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:62003-0x134589d3db03587 Creating unassigned node for 
> 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
> 2012-01-04 00:18:30,328 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
> java.lang.NullPointerException: Passed server is null for 
> 1a4b111bcc228043e89f59c4c3f6a791
> {quote}
> Now the master is confused, it recreated the RIT znode but the region doesn't 
> even exist anymore. It even tries to shut it down but is block