[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Jean-Daniel Cryans (JIRA) Fri, 22 May 2009 12:32:15 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712226#action_12712226
 ]


Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------

I actually tried to do the same, I didn't get the "failed to create" exception 
but got this (it never stops): 

{code}
2009-05-22 14:59:48,126 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master 
for 445473 milliseconds - retrying
2009-05-22 14:59:49,127 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 0 time(s).
2009-05-22 14:59:50,128 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 1 time(s).
2009-05-22 14:59:51,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 2 time(s).
2009-05-22 14:59:52,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 3 time(s).
2009-05-22 14:59:53,130 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 4 time(s).
2009-05-22 14:59:54,131 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 5 time(s).
2009-05-22 14:59:55,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 6 time(s).
2009-05-22 14:59:56,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 7 time(s).
2009-05-22 14:59:57,133 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 8 time(s).
2009-05-22 14:59:58,134 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect 
to server: /192.168.1.81:62000. Already tried 9 time(s).
2009-05-22 14:59:58,135 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10
{code}

We don't get this forever when the master is restarted on the same node because 
HRS.hbaseMaster is at the same place. In fact the problem is in this code:

{code}
public void process(WatchedEvent event) {
    EventType type = event.getType();
    KeeperState state = event.getState();
    LOG.info("Got ZooKeeper event, state: " + state + ", type: " +
              type + ", path: " + event.getPath());

    // Ignore events if we're shutting down.
    if (stopRequested.get()) {
      LOG.debug("Ignoring ZooKeeper event while shutting down");
      return;
    }

    if (state == KeeperState.Expired) {
      LOG.error("ZooKeeper session expired");
      restart();
    } else if (type == EventType.NodeCreated) {
      getMaster();

      // ZooKeeper watches are one time only, so we need to re-register our 
watch.
      watchMasterAddress();
    }
  }
{code}

I see that the node is deleted but I never see it being created because we 
don't set a watch after a NodeDeleted tho we should because we will never know 
when the master comes back. This should be changed. Instead, we have set a 
watch when the master node is deleted and then set a watch on the folder to see 
when it's recreated. 

> When a new master comes up, regionservers should continue with their region 
> assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1302
>                 URL: https://issues.apache.org/jira/browse/HBASE-1302
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Nitay Joffe
>            Assignee: Jean-Daniel Cryans
>             Fix For: 0.20.0
>
>         Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up 
> somewhere else. When this happens, the new master will scan everything and 
> reassign all the regions, which is not ideal. Instead of doing that, we 
> should keep the region assignments from the last master. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1302) When a new master comes up, regionservers should continue with their region assignments from the last master

Reply via email to