Master does not seem to properly scan ZK for running RS during startup
----------------------------------------------------------------------
Key: HBASE-3266
URL: https://issues.apache.org/jira/browse/HBASE-3266
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.90.0
Reporter: Todd Lipcon
Priority: Critical
I was in the situation described by HBASE-3265, where I had a number of RS
waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on
checkins. To get past this, I restarted one of the region servers. The
restarted server checked in, and the master began its startup.
At this point the master started scanning /hbase/.logs for things to split. It
correctly identified that the RS on haus01 was running (this is the one I
restarted):
2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
Log folder
hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
belongs to an existing region server
but then incorrectly decided that the RS on haus02 was down:
2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem:
Log folder
hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
doesn't belong to a known region server, splitting
However ZK shows that this RS is up:
[zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
[haus04.sf.cloudera.com,60020,1290498411533,
haus05.sf.cloudera.com,60020,1290498411520,
haus03.sf.cloudera.com,60020,1290498411518,
haus01.sf.cloudera.com,60020,1290500443143,
haus02.sf.cloudera.com,60020,1290498411450]
splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I
can tell is derived from heartbeats and not from ZK (sorry if I got some of
this wrong, still new to this new codebase)
Of course, the master went into an infinite splitting loop at this point since
haus02 is up and renewing its DFS lease on its logs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.