[ 
https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-3266:
-------------------------

    Fix Version/s:     (was: 0.90.0)
                   0.92.0

Chatting with Jon, there is a problem here if zk does not agree with online 
servers.  This would happen if HRS is stuck.  We could add a reconcile to the 
master on startup such that if a discrepancy, then master could expire the HRS 
in effect killing it.  This would be good (especially if could be done in a 
non-racey way).

But thought is that this condition should be extremely rare especially since 
HBASE-3265 went in  AND given that we'd like to do away with heartbeating 
altogether, lets just punt this out to 0.92 rather than hack up some messy 
reconcile.

Moving out.  Please move back in if disagree.

> Master does not seem to properly scan ZK for running RS during startup
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3266
>                 URL: https://issues.apache.org/jira/browse/HBASE-3266
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>             Fix For: 0.92.0
>
>
> I was in the situation described by HBASE-3265, where I had a number of RS 
> waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting 
> on checkins. To get past this, I restarted one of the region servers. The 
> restarted server checked in, and the master began its startup.
> At this point the master started scanning /hbase/.logs for things to split. 
> It correctly identified that the RS on haus01 was running (this is the one I 
> restarted):
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143
>  belongs to an existing region server
> but then incorrectly decided that the RS on haus02 was down:
> 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
> Log folder 
> hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450
>  doesn't belong to a known region server, splitting
> However ZK shows that this RS is up:
> [zk: haus01.sf.cloudera.com:2222(CONNECTED) 3] ls /hbase/rs
> [haus04.sf.cloudera.com,60020,1290498411533, 
> haus05.sf.cloudera.com,60020,1290498411520, 
> haus03.sf.cloudera.com,60020,1290498411518, 
> haus01.sf.cloudera.com,60020,1290500443143, 
> haus02.sf.cloudera.com,60020,1290498411450]
> splitLogsAfterStartup seems to check ServerManager.onlineServers, which best 
> I can tell is derived from heartbeats and not from ZK (sorry if I got some of 
> this wrong, still new to this new codebase)
> Of course, the master went into an infinite splitting loop at this point 
> since haus02 is up and renewing its DFS lease on its logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to