I just created https://issues.apache.org/jira/browse/HBASE-2174

We handle addresses in different ways depending on which part of the
code you're in. We should correct that everywhere by implementing a
solution that also solves what you guys are seeing.

J-D

On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
<kan...@facebook.com> wrote:
> @Joy: The info stored in .META. for various regions as well as in the 
> ephemeral nodes for region servers in zookeeper are both already IP address 
> based. So doesn't look like multi-homing and/or the other flexibilities you 
> mention were a design goal as far as I can tell.
>
> Regarding: <<< doesn't the reverse ip lookup just once at RS startup 
> time?>>>, what seems to be happening is this:
>
> A regionServer periodically sends a regionServerReport (RPC call) to the 
> master. A HServerInfo argument is passed as an argument and it identifies the 
> sending region server's identity in IP address format.
>
> The master, in ServerManager class, maintains a serversToServerInfo map which 
> is hostname based. Every time a master receives a regionServerReport it 
> converts the IP address based name to a hostname via the info.getServerName() 
> call. Normally this call returns the hostname, but we suspect that during the 
> DNS flakiness, it returned an IP address based string. And so, this caused 
> ServerManager.java to think that it was hearing from a new server. And this 
> lead to:
>
>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>    if (storedInfo == null) {
>      if (LOG.isDebugEnabled()) {
>        LOG.debug("Received report from unknown server -- telling it " +   
> <<============
>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  
> <<============
>      }
>
> and bad things down the road.
>
> The above error message in our logs (example below) indeed identified the 
> host in IP address syntax, even though normally the getServerName call would 
> return the info in hostname format.
>
> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 
> 10.129.68.203,60020,1263605543210
>
> This affected three of our test clusters at the same time!
>
> Perhaps all we need to do is to change the ServerManager's internal maps to 
> all be IP based? That way we avoid/bypass the master having to look up the 
> hostname on every heartbeat.
>
> regards,
> Kannan
> ________________________________________
> From: Joydeep Sarma [jsensa...@gmail.com]
> Sent: Friday, January 29, 2010 1:20 AM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: Cannot locate root region
>
> hadoop also uses the hostnames. if a host is multi-homed - it's
> hostname is a better identifier (which still allows it to use
> different nics/ips for actual traffic). it can help in the case the
> cluster is migrated for example (all the ips change). one could have
> the same hostname resolve to different ips depending on who's doing
> the lookup (this happens in AWS where the same elastic hostname
> resolves to private or public ip depending on where the peer is. so
> clients can talk from outside AWS via public ips and master etc. can
> talk over private ips).
>
> so lots of reasons i guess. doesn't the reverse ip lookup just once at
> RS startup time? (wondering how this reconciles with the  DNS being
> flaky after the cluster was up and running).
>
> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
> <kranganat...@facebook.com> wrote:
>>
>> We did some more digging into this and here is the theory.
>>
>> 1. The regionservers use their local ip to lookup their hostnames and pass 
>> that to the HMaster. The HMaster finds the server info by using this 
>> hostname as the key in the HashMap.
>>
>> HRegionServer.java
>> reinitialize() -
>> this.serverInfo = new HServerInfo(new HServerAddress(
>>      new InetSocketAddress(address.getBindAddress(),
>>      this.server.getListenerAddress().getPort())), 
>> System.currentTimeMillis(),
>>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
>>
>> In run() -
>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>              serverInfo, outboundArray, getMostLoadedRegions());
>>
>>
>> 2. I have observed in the past that there could be some DNS flakiness which 
>> causes the IP address of the machines to be returned as their hostnames. 
>> Guessing this is what happened.
>>
>>
>> 3. The HMaster looks in the map for the above IP address (masquerading as 
>> the server name). It gets and does not find the entry in its map. So it 
>> assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
>>
>>
>> 4. The region server that receives it is in fact already running (under its 
>> real hostname) and enters the "HMaster panic" mode and bad stuff happens.
>>
>> ServerManager.java in regionServerReport() -
>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>    if (storedInfo == null) {
>>      // snip...
>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>    }
>>
>>
>> Any reason why we use the hostname instead of the ip address in the map that 
>> stores the regionserver info?
>>
>> Thanks
>> Karthik
>>
>>
>> -----Original Message-----
>> From: Karthik Ranganathan [mailto:kranganat...@facebook.com]
>> Sent: Thursday, January 28, 2010 3:58 PM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Cannot locate root region
>>
>> Hey guys,
>>
>> Ran into some issues while testing and wanted to understand what has 
>> happened better. Got the following exception when I went to the web UI
>>
>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, 
>> row '', but failed after 3 attempts.
>> Exceptions:
>> org.apache.hadoop.hbase.NotServingRegionException: 
>> org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>        at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>        at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>
>>
>> From a program that reads from a HBase table:
>> java.lang.reflect.UndeclaredThrowableException
>>        at $Proxy1.getRegionInfo(Unknown Source)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>        at 
>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>> <snip>
>>
>>
>> Followed  up on the hmaster's log:
>>
>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: 
>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: 
>> 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All 
>> 1 .META. region(s) scanned
>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
>> Received report from unknown server -- telling it to 
>> MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210
>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: 
>> Received start message from: 
>> hbasetest004.ash1.facebook.com,60020,1264706494600
>> 2010-01-28 11:21:36,649 DEBUG 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode 
>> /hbase/rs/1264706494600 with data 10.129.68.203:60020
>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
>> handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME 
>> => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', 
>> BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from 
>> 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: 
>> test1
>> org.apache.hadoop.hbase.TableExistsException: test1
>>        at 
>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>        at 
>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>
>> From a hregionserver's logs:
>>
>> 2010-01-28 11:20:22,589 DEBUG 
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB 
>> (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, 
>> Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>> 2010-01-28 11:21:22,588 DEBUG 
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB 
>> (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, 
>> Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN
>> 2010-01-28 11:22:18,794 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP
>>
>>
>> The code says the following:
>>              case MSG_CALL_SERVER_STARTUP:
>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can also
>>                // get it when the master is panicking because for instance
>>                // the HDFS has been yanked out from under it.  Be wary of
>>                // this message.
>>
>> Any ideas on what is going on? The best I can come up with is perhaps a 
>> flaky DNS - would that explain this? This happened on three of our test 
>> clusters at almost the same time. Also, what is the most graceful/simplest 
>> way to recover from this?
>>
>>
>> Thanks
>> Karthik
>>
>>
>

Reply via email to