[
https://issues.apache.org/jira/browse/HBASE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712226#action_12712226
]
Jean-Daniel Cryans commented on HBASE-1302:
-------------------------------------------
I actually tried to do the same, I didn't get the "failed to create" exception
but got this (it never stops):
{code}
2009-05-22 14:59:48,126 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report to master
for 445473 milliseconds - retrying
2009-05-22 14:59:49,127 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 0 time(s).
2009-05-22 14:59:50,128 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 1 time(s).
2009-05-22 14:59:51,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 2 time(s).
2009-05-22 14:59:52,129 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 3 time(s).
2009-05-22 14:59:53,130 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 4 time(s).
2009-05-22 14:59:54,131 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 5 time(s).
2009-05-22 14:59:55,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 6 time(s).
2009-05-22 14:59:56,132 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 7 time(s).
2009-05-22 14:59:57,133 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 8 time(s).
2009-05-22 14:59:58,134 INFO org.apache.hadoop.ipc.HBaseClass: Retrying connect
to server: /192.168.1.81:62000. Already tried 9 time(s).
2009-05-22 14:59:58,135 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Exceeded max retries: 10
{code}
We don't get this forever when the master is restarted on the same node because
HRS.hbaseMaster is at the same place. In fact the problem is in this code:
{code}
public void process(WatchedEvent event) {
EventType type = event.getType();
KeeperState state = event.getState();
LOG.info("Got ZooKeeper event, state: " + state + ", type: " +
type + ", path: " + event.getPath());
// Ignore events if we're shutting down.
if (stopRequested.get()) {
LOG.debug("Ignoring ZooKeeper event while shutting down");
return;
}
if (state == KeeperState.Expired) {
LOG.error("ZooKeeper session expired");
restart();
} else if (type == EventType.NodeCreated) {
getMaster();
// ZooKeeper watches are one time only, so we need to re-register our
watch.
watchMasterAddress();
}
}
{code}
I see that the node is deleted but I never see it being created because we
don't set a watch after a NodeDeleted tho we should because we will never know
when the master comes back. This should be changed. Instead, we have set a
watch when the master node is deleted and then set a watch on the folder to see
when it's recreated.
> When a new master comes up, regionservers should continue with their region
> assignments from the last master
> ------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-1302
> URL: https://issues.apache.org/jira/browse/HBASE-1302
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: master, regionserver
> Affects Versions: 0.20.0
> Reporter: Nitay Joffe
> Assignee: Jean-Daniel Cryans
> Fix For: 0.20.0
>
> Attachments: hbase-1302-v1.patch, hbase-1302-v2.patch
>
>
> After HBASE-1205, we can now handle a master going down and coming up
> somewhere else. When this happens, the new master will scan everything and
> reassign all the regions, which is not ideal. Instead of doing that, we
> should keep the region assignments from the last master.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.