Re: Cannot locate root region

Stack Fri, 29 Jan 2010 12:09:21 -0800

On Fri, Jan 29, 2010 at 11:29 AM, Joydeep Sarma <jsensa...@gmail.com> wrote:
> i meant even if we were using hostnames for RS registration (which i
> think has lot of advantages - not necessarily in our environment
> though) -


Agreed, we should use hostnames for the advantages it gives listed
earlier in this thread...

 the master processing of the heartbeat (or whatever it's
> processing) shouldn't require a forward lookup. if it needs the ip
> address - it already has that via the connection object.
>

Agreed.

There is an old issue to address current dumb lookup on each heartbeat
opening us to damage if DNS starts flapping:
https://issues.apache.org/jira/browse/HBASE-1679.  Its not yet fixed.

St.Ack



> On Fri, Jan 29, 2010 at 11:19 AM, Karthik Ranganathan
> <kranganat...@facebook.com> wrote:
>> Yup totally - either name or ip would work. Not sure if there is a pro or a 
>> con to choosing either one - but thought it better to use the ip as that 
>> always remains the same (no resolve required) and used to open the sockets.
>>
>> @jd-cryans: Saw your JIRA update: "One example of weirdness is when the 
>> region server is told which address to use according to the master:"
>>
>> Was meaning to ask about that too :) its all good now.
>>
>> Thanks
>> Karthik
>>
>>
>> -----Original Message-----
>> From: Joydeep Sarma [mailto:jsensa...@gmail.com]
>> Sent: Friday, January 29, 2010 11:01 AM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: Cannot locate root region
>>
>> hmmm .. if the master doesn't need the RS ip address at this point -
>> seems like it should be able to use the hostname offered by the RS
>> directly?
>>
>> On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan
>> <kranganat...@facebook.com> wrote:
>>> The master does another lookup independent of the region server using the 
>>> hostname given by the region server:
>>>
>>> ServerManager.java, regionServerReport() does:
>>>    HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); 
>>> // info.getServerName() is hostname
>>>
>>> Which eventually does:
>>>        HServerAddress.getHostname()
>>>
>>> HServerAddress' constructor creates the InetSocketAddress from the 
>>> hostname:port, which involves mapping the hostname to the ip address using 
>>> a lookup.
>>>
>>> Thanks
>>> Karthik
>>>
>>>
>>> -----Original Message-----
>>> From: Joydeep Sarma [mailto:jsensa...@gmail.com]
>>> Sent: Friday, January 29, 2010 9:46 AM
>>> To: hbase-dev@hadoop.apache.org
>>> Subject: Re: Cannot locate root region
>>>
>>> @Kannan - Karthik's mail said the reverse lookup happens in the RS
>>> (not the master). the master simply tried to match the offered
>>> hostname.
>>>
>>> i dont know whose reading is right - but if it's the RS - i didn't
>>> understand why that wasn't just the reverse lookup done once at
>>> bootstrap time (which wouldn't be affected by ongoing DNS badness).
>>>
>>>
>>> On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jdcry...@apache.org> 
>>> wrote:
>>>> I just created https://issues.apache.org/jira/browse/HBASE-2174
>>>>
>>>> We handle addresses in different ways depending on which part of the
>>>> code you're in. We should correct that everywhere by implementing a
>>>> solution that also solves what you guys are seeing.
>>>>
>>>> J-D
>>>>
>>>> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan
>>>> <kan...@facebook.com> wrote:
>>>>> @Joy: The info stored in .META. for various regions as well as in the 
>>>>> ephemeral nodes for region servers in zookeeper are both already IP 
>>>>> address based. So doesn't look like multi-homing and/or the other 
>>>>> flexibilities you mention were a design goal as far as I can tell.
>>>>>
>>>>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup 
>>>>> time?>>>, what seems to be happening is this:
>>>>>
>>>>> A regionServer periodically sends a regionServerReport (RPC call) to the 
>>>>> master. A HServerInfo argument is passed as an argument and it identifies 
>>>>> the sending region server's identity in IP address format.
>>>>>
>>>>> The master, in ServerManager class, maintains a serversToServerInfo map 
>>>>> which is hostname based. Every time a master receives a 
>>>>> regionServerReport it converts the IP address based name to a hostname 
>>>>> via the info.getServerName() call. Normally this call returns the 
>>>>> hostname, but we suspect that during the DNS flakiness, it returned an IP 
>>>>> address based string. And so, this caused ServerManager.java to think 
>>>>> that it was hearing from a new server. And this lead to:
>>>>>
>>>>>  HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
>>>>>    if (storedInfo == null) {
>>>>>      if (LOG.isDebugEnabled()) {
>>>>>        LOG.debug("Received report from unknown server -- telling it " +   
>>>>> <<============
>>>>>          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  
>>>>> <<============
>>>>>      }
>>>>>
>>>>> and bad things down the road.
>>>>>
>>>>> The above error message in our logs (example below) indeed identified the 
>>>>> host in IP address syntax, even though normally the getServerName call 
>>>>> would return the info in hostname format.
>>>>>
>>>>> 2010-01-28 11:21:34,539 DEBUG 
>>>>> org.apache.hadoop.hbase.master.ServerManager: Received report from 
>>>>> unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 
>>>>> 10.129.68.203,60020,1263605543210
>>>>>
>>>>> This affected three of our test clusters at the same time!
>>>>>
>>>>> Perhaps all we need to do is to change the ServerManager's internal maps 
>>>>> to all be IP based? That way we avoid/bypass the master having to look up 
>>>>> the hostname on every heartbeat.
>>>>>
>>>>> regards,
>>>>> Kannan
>>>>> ________________________________________
>>>>> From: Joydeep Sarma [jsensa...@gmail.com]
>>>>> Sent: Friday, January 29, 2010 1:20 AM
>>>>> To: hbase-dev@hadoop.apache.org
>>>>> Subject: Re: Cannot locate root region
>>>>>
>>>>> hadoop also uses the hostnames. if a host is multi-homed - it's
>>>>> hostname is a better identifier (which still allows it to use
>>>>> different nics/ips for actual traffic). it can help in the case the
>>>>> cluster is migrated for example (all the ips change). one could have
>>>>> the same hostname resolve to different ips depending on who's doing
>>>>> the lookup (this happens in AWS where the same elastic hostname
>>>>> resolves to private or public ip depending on where the peer is. so
>>>>> clients can talk from outside AWS via public ips and master etc. can
>>>>> talk over private ips).
>>>>>
>>>>> so lots of reasons i guess. doesn't the reverse ip lookup just once at
>>>>> RS startup time? (wondering how this reconciles with the  DNS being
>>>>> flaky after the cluster was up and running).
>>>>>
>>>>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
>>>>> <kranganat...@facebook.com> wrote:
>>>>>>
>>>>>> We did some more digging into this and here is the theory.
>>>>>>
>>>>>> 1. The regionservers use their local ip to lookup their hostnames and 
>>>>>> pass that to the HMaster. The HMaster finds the server info by using 
>>>>>> this hostname as the key in the HashMap.
>>>>>>
>>>>>> HRegionServer.java
>>>>>> reinitialize() -
>>>>>> this.serverInfo = new HServerInfo(new HServerAddress(
>>>>>>      new InetSocketAddress(address.getBindAddress(),
>>>>>>      this.server.getListenerAddress().getPort())), 
>>>>>> System.currentTimeMillis(),
>>>>>>      this.conf.getInt("hbase.regionserver.info.port", 60030), 
>>>>>> machineName);
>>>>>>
>>>>>> In run() -
>>>>>> HMsg msgs[] = hbaseMaster.regionServerReport(
>>>>>>              serverInfo, outboundArray, getMostLoadedRegions());
>>>>>>
>>>>>>
>>>>>> 2. I have observed in the past that there could be some DNS flakiness 
>>>>>> which causes the IP address of the machines to be returned as their 
>>>>>> hostnames. Guessing this is what happened.
>>>>>>
>>>>>>
>>>>>> 3. The HMaster looks in the map for the above IP address (masquerading 
>>>>>> as the server name). It gets and does not find the entry in its map. So 
>>>>>> it assumes that this is a new region server and issues a 
>>>>>> CALL_SERVER_STARTUP.
>>>>>>
>>>>>>
>>>>>> 4. The region server that receives it is in fact already running (under 
>>>>>> its real hostname) and enters the "HMaster panic" mode and bad stuff 
>>>>>> happens.
>>>>>>
>>>>>> ServerManager.java in regionServerReport() -
>>>>>>    HServerInfo storedInfo = 
>>>>>> serversToServerInfo.get(info.getServerName());
>>>>>>    if (storedInfo == null) {
>>>>>>      // snip...
>>>>>>      return new HMsg[] {CALL_SERVER_STARTUP};
>>>>>>    }
>>>>>>
>>>>>>
>>>>>> Any reason why we use the hostname instead of the ip address in the map 
>>>>>> that stores the regionserver info?
>>>>>>
>>>>>> Thanks
>>>>>> Karthik
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Karthik Ranganathan [mailto:kranganat...@facebook.com]
>>>>>> Sent: Thursday, January 28, 2010 3:58 PM
>>>>>> To: hbase-dev@hadoop.apache.org
>>>>>> Subject: Cannot locate root region
>>>>>>
>>>>>> Hey guys,
>>>>>>
>>>>>> Ran into some issues while testing and wanted to understand what has 
>>>>>> happened better. Got the following exception when I went to the web UI
>>>>>>
>>>>>> Trying to contact region server 10.129.68.204:60020 for region 
>>>>>> .META.,,1, row '', but failed after 3 attempts.
>>>>>> Exceptions:
>>>>>> org.apache.hadoop.hbase.NotServingRegionException: 
>>>>>> org.apache.hadoop.hbase.NotServingRegionException: .META.,,1
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837)
>>>>>>        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>>>>>>        at 
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>
>>>>>>
>>>>>> From a program that reads from a HBase table:
>>>>>> java.lang.reflect.UndeclaredThrowableException
>>>>>>        at $Proxy1.getRegionInfo(Unknown Source)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675)
>>>>>> <snip>
>>>>>>
>>>>>>
>>>>>> Followed  up on the hmaster's log:
>>>>>>
>>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: 
>>>>>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: 
>>>>>> 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete
>>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: 
>>>>>> All 1 .META. region(s) scanned
>>>>>> 2010-01-28 11:21:34,539 DEBUG 
>>>>>> org.apache.hadoop.hbase.master.ServerManager: Received report from 
>>>>>> unknown server -- telling it to MSG_CALL_SERVER_STARTUP: 
>>>>>> 10.129.68.203,60020,1263605543210
>>>>>> 2010-01-28 11:21:35,622 INFO 
>>>>>> org.apache.hadoop.hbase.master.ServerManager: Received start message 
>>>>>> from: hbasetest004.ash1.facebook.com,60020,1264706494600
>>>>>> 2010-01-28 11:21:36,649 DEBUG 
>>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode 
>>>>>> /hbase/rs/1264706494600 with data 10.129.68.203:60020
>>>>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC 
>>>>>> Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES 
>>>>>> => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => 
>>>>>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 
>>>>>> 'true'}]}) from 10.131.29.183:63308: error: 
>>>>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>>>> org.apache.hadoop.hbase.TableExistsException: test1
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756)
>>>>>>        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>>>        at 
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
>>>>>>        at 
>>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>>>>>>
>>>>>> From a hregionserver's logs:
>>>>>>
>>>>>> 2010-01-28 11:20:22,589 DEBUG 
>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
>>>>>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), 
>>>>>> Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, 
>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, 
>>>>>> Evicted/Run=NaN
>>>>>> 2010-01-28 11:21:22,588 DEBUG 
>>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
>>>>>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), 
>>>>>> Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, 
>>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, 
>>>>>> Evicted/Run=NaN
>>>>>> 2010-01-28 11:22:18,794 INFO 
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>>>>>> MSG_CALL_SERVER_STARTUP
>>>>>>
>>>>>>
>>>>>> The code says the following:
>>>>>>              case MSG_CALL_SERVER_STARTUP:
>>>>>>                // We the MSG_CALL_SERVER_STARTUP on startup but we can 
>>>>>> also
>>>>>>                // get it when the master is panicking because for 
>>>>>> instance
>>>>>>                // the HDFS has been yanked out from under it.  Be wary of
>>>>>>                // this message.
>>>>>>
>>>>>> Any ideas on what is going on? The best I can come up with is perhaps a 
>>>>>> flaky DNS - would that explain this? This happened on three of our test 
>>>>>> clusters at almost the same time. Also, what is the most 
>>>>>> graceful/simplest way to recover from this?
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>> Karthik
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Cannot locate root region

Reply via email to