hadoop also uses the hostnames. if a host is multi-homed - it's hostname is a better identifier (which still allows it to use different nics/ips for actual traffic). it can help in the case the cluster is migrated for example (all the ips change). one could have the same hostname resolve to different ips depending on who's doing the lookup (this happens in AWS where the same elastic hostname resolves to private or public ip depending on where the peer is. so clients can talk from outside AWS via public ips and master etc. can talk over private ips).
so lots of reasons i guess. doesn't the reverse ip lookup just once at RS startup time? (wondering how this reconciles with the DNS being flaky after the cluster was up and running). On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan <kranganat...@facebook.com> wrote: > > We did some more digging into this and here is the theory. > > 1. The regionservers use their local ip to lookup their hostnames and pass > that to the HMaster. The HMaster finds the server info by using this hostname > as the key in the HashMap. > > HRegionServer.java > reinitialize() - > this.serverInfo = new HServerInfo(new HServerAddress( > new InetSocketAddress(address.getBindAddress(), > this.server.getListenerAddress().getPort())), System.currentTimeMillis(), > this.conf.getInt("hbase.regionserver.info.port", 60030), machineName); > > In run() - > HMsg msgs[] = hbaseMaster.regionServerReport( > serverInfo, outboundArray, getMostLoadedRegions()); > > > 2. I have observed in the past that there could be some DNS flakiness which > causes the IP address of the machines to be returned as their hostnames. > Guessing this is what happened. > > > 3. The HMaster looks in the map for the above IP address (masquerading as the > server name). It gets and does not find the entry in its map. So it assumes > that this is a new region server and issues a CALL_SERVER_STARTUP. > > > 4. The region server that receives it is in fact already running (under its > real hostname) and enters the "HMaster panic" mode and bad stuff happens. > > ServerManager.java in regionServerReport() - > HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); > if (storedInfo == null) { > // snip... > return new HMsg[] {CALL_SERVER_STARTUP}; > } > > > Any reason why we use the hostname instead of the ip address in the map that > stores the regionserver info? > > Thanks > Karthik > > > -----Original Message----- > From: Karthik Ranganathan [mailto:kranganat...@facebook.com] > Sent: Thursday, January 28, 2010 3:58 PM > To: hbase-dev@hadoop.apache.org > Subject: Cannot locate root region > > Hey guys, > > Ran into some issues while testing and wanted to understand what has happened > better. Got the following exception when I went to the web UI > > Trying to contact region server 10.129.68.204:60020 for region .META.,,1, row > '', but failed after 3 attempts. > Exceptions: > org.apache.hadoop.hbase.NotServingRegionException: > org.apache.hadoop.hbase.NotServingRegionException: .META.,,1 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837) > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > From a program that reads from a HBase table: > java.lang.reflect.UndeclaredThrowableException > at $Proxy1.getRegionInfo(Unknown Source) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601) > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675) > <snip> > > > Followed up on the hmaster's log: > > 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: > RegionManager.metaScanner scan of 1 row(s) of meta region {server: > 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete > 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All > 1 .META. region(s) scanned > 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: > 10.129.68.203,60020,1263605543210 > 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: > Received start message from: > hbasetest004.ash1.facebook.com,60020,1264706494600 > 2010-01-28 11:21:36,649 DEBUG > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode > /hbase/rs/1264706494600 with data 10.129.68.203:60020 > 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME => > 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE > => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from > 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: > test1 > org.apache.hadoop.hbase.TableExistsException: test1 > at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792) > at org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > From a hregionserver's logs: > > 2010-01-28 11:20:22,589 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB > (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, > Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, > Evicted/Run=NaN > 2010-01-28 11:21:22,588 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > Cache Stats: Sizes: Total=19.661453MB (20616528), Free=2377.0137MB > (2492479408), Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, > Miss=0, Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, > Evicted/Run=NaN > 2010-01-28 11:22:18,794 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP > > > The code says the following: > case MSG_CALL_SERVER_STARTUP: > // We the MSG_CALL_SERVER_STARTUP on startup but we can also > // get it when the master is panicking because for instance > // the HDFS has been yanked out from under it. Be wary of > // this message. > > Any ideas on what is going on? The best I can come up with is perhaps a flaky > DNS - would that explain this? This happened on three of our test clusters at > almost the same time. Also, what is the most graceful/simplest way to recover > from this? > > > Thanks > Karthik > >