I just created https://issues.apache.org/jira/browse/HBASE-2174
We handle addresses in different ways depending on which part of the code you're in. We should correct that everywhere by implementing a solution that also solves what you guys are seeing. J-D On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan <kan...@facebook.com> wrote: > @Joy: The info stored in .META. for various regions as well as in the > ephemeral nodes for region servers in zookeeper are both already IP address > based. So doesn't look like multi-homing and/or the other flexibilities you > mention were a design goal as far as I can tell. > > Regarding: <<< doesn't the reverse ip lookup just once at RS startup > time?>>>, what seems to be happening is this: > > A regionServer periodically sends a regionServerReport (RPC call) to the > master. A HServerInfo argument is passed as an argument and it identifies the > sending region server's identity in IP address format. > > The master, in ServerManager class, maintains a serversToServerInfo map which > is hostname based. Every time a master receives a regionServerReport it > converts the IP address based name to a hostname via the info.getServerName() > call. Normally this call returns the hostname, but we suspect that during the > DNS flakiness, it returned an IP address based string. And so, this caused > ServerManager.java to think that it was hearing from a new server. And this > lead to: > > HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); > if (storedInfo == null) { > if (LOG.isDebugEnabled()) { > LOG.debug("Received report from unknown server -- telling it " + > <<============ > "to " + CALL_SERVER_STARTUP + ": " + info.getServerName()); > <<============ > } > > and bad things down the road. > > The above error message in our logs (example below) indeed identified the > host in IP address syntax, even though normally the getServerName call would > return the info in hostname format. > > 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP: > 10.129.68.203,60020,1263605543210 > > This affected three of our test clusters at the same time! > > Perhaps all we need to do is to change the ServerManager's internal maps to > all be IP based? That way we avoid/bypass the master having to look up the > hostname on every heartbeat. > > regards, > Kannan > ________________________________________ > From: Joydeep Sarma [jsensa...@gmail.com] > Sent: Friday, January 29, 2010 1:20 AM > To: hbase-dev@hadoop.apache.org > Subject: Re: Cannot locate root region > > hadoop also uses the hostnames. if a host is multi-homed - it's > hostname is a better identifier (which still allows it to use > different nics/ips for actual traffic). it can help in the case the > cluster is migrated for example (all the ips change). one could have > the same hostname resolve to different ips depending on who's doing > the lookup (this happens in AWS where the same elastic hostname > resolves to private or public ip depending on where the peer is. so > clients can talk from outside AWS via public ips and master etc. can > talk over private ips). > > so lots of reasons i guess. doesn't the reverse ip lookup just once at > RS startup time? (wondering how this reconciles with the DNS being > flaky after the cluster was up and running). > > On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan > <kranganat...@facebook.com> wrote: >> >> We did some more digging into this and here is the theory. >> >> 1. The regionservers use their local ip to lookup their hostnames and pass >> that to the HMaster. The HMaster finds the server info by using this >> hostname as the key in the HashMap. >> >> HRegionServer.java >> reinitialize() - >> this.serverInfo = new HServerInfo(new HServerAddress( >> new InetSocketAddress(address.getBindAddress(), >> this.server.getListenerAddress().getPort())), >> System.currentTimeMillis(), >> this.conf.getInt("hbase.regionserver.info.port", 60030), machineName); >> >> In run() - >> HMsg msgs[] = hbaseMaster.regionServerReport( >> serverInfo, outboundArray, getMostLoadedRegions()); >> >> >> 2. I have observed in the past that there could be some DNS flakiness which >> causes the IP address of the machines to be returned as their hostnames. >> Guessing this is what happened. >> >> >> 3. The HMaster looks in the map for the above IP address (masquerading as >> the server name). It gets and does not find the entry in its map. So it >> assumes that this is a new region server and issues a CALL_SERVER_STARTUP. >> >> >> 4. The region server that receives it is in fact already running (under its >> real hostname) and enters the "HMaster panic" mode and bad stuff happens. >> >> ServerManager.java in regionServerReport() - >> HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); >> if (storedInfo == null) { >> // snip... >> return new HMsg[] {CALL_SERVER_STARTUP}; >> } >> >> >> Any reason why we use the hostname instead of the ip address in the map that >> stores the regionserver info? >> >> Thanks >> Karthik >> >> >> -----Original Message----- >> From: Karthik Ranganathan [mailto:kranganat...@facebook.com] >> Sent: Thursday, January 28, 2010 3:58 PM >> To: hbase-dev@hadoop.apache.org >> Subject: Cannot locate root region >> >> Hey guys, >> >> Ran into some issues while testing and wanted to understand what has >> happened better. Got the following exception when I went to the web UI >> >> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, >> row '', but failed after 3 attempts. >> Exceptions: >> org.apache.hadoop.hbase.NotServingRegionException: >> org.apache.hadoop.hbase.NotServingRegionException: .META.,,1 >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254) >> at >> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837) >> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >> >> >> From a program that reads from a HBase table: >> java.lang.reflect.UndeclaredThrowableException >> at $Proxy1.getRegionInfo(Unknown Source) >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985) >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625) >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601) >> at >> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675) >> <snip> >> >> >> Followed up on the hmaster's log: >> >> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: >> RegionManager.metaScanner scan of 1 row(s) of meta region {server: >> 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete >> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: All >> 1 .META. region(s) scanned >> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: >> Received report from unknown server -- telling it to >> MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210 >> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: >> Received start message from: >> hbasetest004.ash1.facebook.com,60020,1264706494600 >> 2010-01-28 11:21:36,649 DEBUG >> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode >> /hbase/rs/1264706494600 with data 10.129.68.203:60020 >> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server >> handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME >> => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', >> BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from >> 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: >> test1 >> org.apache.hadoop.hbase.TableExistsException: test1 >> at >> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792) >> at >> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756) >> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >> >> From a hregionserver's logs: >> >> 2010-01-28 11:20:22,589 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB >> (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, >> Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN >> 2010-01-28 11:21:22,588 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), Max=2396.675MB >> (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, Evictions=0, >> Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, Evicted/Run=NaN >> 2010-01-28 11:22:18,794 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP >> >> >> The code says the following: >> case MSG_CALL_SERVER_STARTUP: >> // We the MSG_CALL_SERVER_STARTUP on startup but we can also >> // get it when the master is panicking because for instance >> // the HDFS has been yanked out from under it. Be wary of >> // this message. >> >> Any ideas on what is going on? The best I can come up with is perhaps a >> flaky DNS - would that explain this? This happened on three of our test >> clusters at almost the same time. Also, what is the most graceful/simplest >> way to recover from this? >> >> >> Thanks >> Karthik >> >> >