[ 
https://issues.apache.org/jira/browse/HBASE-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108385#comment-13108385
 ] 

Ming Ma commented on HBASE-4446:
--------------------------------

Good point, Todd. Thanks, Ted. Here is why the master didn't handle this. Note, 
part of the log below comes from the new code. The issue is by the time 
assignmentmanager gets the notification, the RS isn't online anymore. Thus the 
processing based on ZK callback is skipped.

2011-09-19 22:04:54,506 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Attempted to handle region transition for server but server is not online: 
miweng_test,1??s$? >,1316493502701.6409ae717931daee3705f3e7d33d85b5.


2011-09-19 22:22:06,561 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
While timing out a region in state OPENING, found ZK node in unexpected state: 
RS_ZK_REGION_FAILED_OPEN region= miweng_test,1\xC8\xFAs$\xB7 
>,1316493502701.6409a
e717931daee3705f3e7d33d85b5.


That also means we can fix the issue in a different way. Why does 
AssignmentManager.handleRegion have to inforce the following condition and rely 
on TimeoutMonitor and ServerShutdownHandler to kick in? At least for certain 
states like RS_ZK_REGION_FAILED_OPEN, RS_ZK_REGION_CLOSED, 
AssignmentManager.handleRegion can still process the event even though the RS 
is down.

      // Verify this is a known server
      if (!serverManager.isServerOnline(sn) &&
          !this.master.getServerName().equals(sn)) {
        LOG.warn("Attempted to handle region transition for server but " +
          "server is not online: " + Bytes.toString(data.getRegionName()));
        return;
      }



> Rolling restart RSs scenario, regions could stay in OPENING state
> -----------------------------------------------------------------
>
>                 Key: HBASE-4446
>                 URL: https://issues.apache.org/jira/browse/HBASE-4446
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>         Attachments: HBASE-4446-trunk.patch
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, 
> wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start 
> RS2, wait for 2 seconds, etc. Region sometimes can just stay in OPENING state 
> even after timeoutmonitor period.
> 2011-09-19 08:10:33,131 WARN 
> org.apache.hadoop.hbase.master.AssignmentManager: While timing out a region 
> in state OPENING, found ZK node in unexpected state: RS_ZK_REGION_FAILED_OPEN
> The issue - RS was shutdown when a region is being opened, it was 
> transitioned to RS_ZK_REGION_FAILED_OPEN in ZK. In timeoutmonitor, it didn't 
> take care of RS_ZK_REGION_FAILED_OPEN.
> processOpeningState
> ...
>    else if (dataInZNode.getEventType() != EventType.RS_ZK_REGION_OPENING &&
>         LOG.warn("While timing out a region in state OPENING, "
>             + "found ZK node in unexpected state: "
>             + dataInZNode.getEventType());
>         return;
>       }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to