[
https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934924#action_12934924
]
Jonathan Gray commented on HBASE-3266:
--------------------------------------
Yeah, I think as it is currently the HMaster is using the startup/heartbeat
messages to determine which RS are online. As I commented in the other jira,
we should see why they were not doing so.
We should do some reconciliation between what we find in ZK and what we think
is online based on RPCs, but not sure exactly what course we would take in a
state like this.
> Master does not seem to properly scan ZK for running RS during startup
> ----------------------------------------------------------------------
>
> Key: HBASE-3266
> URL: https://issues.apache.org/jira/browse/HBASE-3266
> Project: HBase
> Issue Type: Bug
> Components: master
> Affects Versions: 0.90.0
> Reporter: Todd Lipcon
> Priority: Critical
>
> I was in the situation described by HBASE-3265, where I had a number of RS
> waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting
> on checkins. To get past this, I restarted one of the region servers. The
> restarted server checked in, and the master began its startup.
> At this point the master started scanning /hbase/.logs for things to split.
> It correctly identified that the RS on haus01 was running (this is the one I
> restarted):
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
> Log folder
> hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
> belongs to an existing region server
> but then incorrectly decided that the RS on haus02 was down:
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
> Log folder
> hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
> doesn't belong to a known region server, splitting
> However ZK shows that this RS is up:
> [zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
> [haus04.sf.cloudera.com,60020,1290498411533,
> haus05.sf.cloudera.com,60020,1290498411520,
> haus03.sf.cloudera.com,60020,1290498411518,
> haus01.sf.cloudera.com,60020,1290500443143,
> haus02.sf.cloudera.com,60020,1290498411450]
> splitLogsAfterStartup seems to check ServerManager.onlineServers, which best
> I can tell is derived from heartbeats and not from ZK (sorry if I got some of
> this wrong, still new to this new codebase)
> Of course, the master went into an infinite splitting loop at this point
> since haus02 is up and renewing its DFS lease on its logs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.