Yup totally - either name or ip would work. Not sure if there is a pro or a con to choosing either one - but thought it better to use the ip as that always remains the same (no resolve required) and used to open the sockets.
@jd-cryans: Saw your JIRA update: "One example of weirdness is when the region server is told which address to use according to the master:" Was meaning to ask about that too :) its all good now. Thanks Karthik -----Original Message----- From: Joydeep Sarma [mailto:jsensa...@gmail.com] Sent: Friday, January 29, 2010 11:01 AM To: hbase-dev@hadoop.apache.org Subject: Re: Cannot locate root region hmmm .. if the master doesn't need the RS ip address at this point - seems like it should be able to use the hostname offered by the RS directly? On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan <kranganat...@facebook.com> wrote: > The master does another lookup independent of the region server using the > hostname given by the region server: > > ServerManager.java, regionServerReport() does: > HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); // > info.getServerName() is hostname > > Which eventually does: > HServerAddress.getHostname() > > HServerAddress' constructor creates the InetSocketAddress from the > hostname:port, which involves mapping the hostname to the ip address using a > lookup. > > Thanks > Karthik > > > -----Original Message----- > From: Joydeep Sarma [mailto:jsensa...@gmail.com] > Sent: Friday, January 29, 2010 9:46 AM > To: hbase-dev@hadoop.apache.org > Subject: Re: Cannot locate root region > > @Kannan - Karthik's mail said the reverse lookup happens in the RS > (not the master). the master simply tried to match the offered > hostname. > > i dont know whose reading is right - but if it's the RS - i didn't > understand why that wasn't just the reverse lookup done once at > bootstrap time (which wouldn't be affected by ongoing DNS badness). > > > On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jdcry...@apache.org> > wrote: >> I just created https://issues.apache.org/jira/browse/HBASE-2174 >> >> We handle addresses in different ways depending on which part of the >> code you're in. We should correct that everywhere by implementing a >> solution that also solves what you guys are seeing. >> >> J-D >> >> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan >> <kan...@facebook.com> wrote: >>> @Joy: The info stored in .META. for various regions as well as in the >>> ephemeral nodes for region servers in zookeeper are both already IP address >>> based. So doesn't look like multi-homing and/or the other flexibilities you >>> mention were a design goal as far as I can tell. >>> >>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup >>> time?>>>, what seems to be happening is this: >>> >>> A regionServer periodically sends a regionServerReport (RPC call) to the >>> master. A HServerInfo argument is passed as an argument and it identifies >>> the sending region server's identity in IP address format. >>> >>> The master, in ServerManager class, maintains a serversToServerInfo map >>> which is hostname based. Every time a master receives a regionServerReport >>> it converts the IP address based name to a hostname via the >>> info.getServerName() call. Normally this call returns the hostname, but we >>> suspect that during the DNS flakiness, it returned an IP address based >>> string. And so, this caused ServerManager.java to think that it was hearing >>> from a new server. And this lead to: >>> >>> HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); >>> if (storedInfo == null) { >>> if (LOG.isDebugEnabled()) { >>> LOG.debug("Received report from unknown server -- telling it " + >>> <<============ >>> "to " + CALL_SERVER_STARTUP + ": " + info.getServerName()); >>> <<============ >>> } >>> >>> and bad things down the road. >>> >>> The above error message in our logs (example below) indeed identified the >>> host in IP address syntax, even though normally the getServerName call >>> would return the info in hostname format. >>> >>> 2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: >>> Received report from unknown server -- telling it to >>> MSG_CALL_SERVER_STARTUP: 10.129.68.203,60020,1263605543210 >>> >>> This affected three of our test clusters at the same time! >>> >>> Perhaps all we need to do is to change the ServerManager's internal maps to >>> all be IP based? That way we avoid/bypass the master having to look up the >>> hostname on every heartbeat. >>> >>> regards, >>> Kannan >>> ________________________________________ >>> From: Joydeep Sarma [jsensa...@gmail.com] >>> Sent: Friday, January 29, 2010 1:20 AM >>> To: hbase-dev@hadoop.apache.org >>> Subject: Re: Cannot locate root region >>> >>> hadoop also uses the hostnames. if a host is multi-homed - it's >>> hostname is a better identifier (which still allows it to use >>> different nics/ips for actual traffic). it can help in the case the >>> cluster is migrated for example (all the ips change). one could have >>> the same hostname resolve to different ips depending on who's doing >>> the lookup (this happens in AWS where the same elastic hostname >>> resolves to private or public ip depending on where the peer is. so >>> clients can talk from outside AWS via public ips and master etc. can >>> talk over private ips). >>> >>> so lots of reasons i guess. doesn't the reverse ip lookup just once at >>> RS startup time? (wondering how this reconciles with the DNS being >>> flaky after the cluster was up and running). >>> >>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan >>> <kranganat...@facebook.com> wrote: >>>> >>>> We did some more digging into this and here is the theory. >>>> >>>> 1. The regionservers use their local ip to lookup their hostnames and pass >>>> that to the HMaster. The HMaster finds the server info by using this >>>> hostname as the key in the HashMap. >>>> >>>> HRegionServer.java >>>> reinitialize() - >>>> this.serverInfo = new HServerInfo(new HServerAddress( >>>> new InetSocketAddress(address.getBindAddress(), >>>> this.server.getListenerAddress().getPort())), >>>> System.currentTimeMillis(), >>>> this.conf.getInt("hbase.regionserver.info.port", 60030), machineName); >>>> >>>> In run() - >>>> HMsg msgs[] = hbaseMaster.regionServerReport( >>>> serverInfo, outboundArray, getMostLoadedRegions()); >>>> >>>> >>>> 2. I have observed in the past that there could be some DNS flakiness >>>> which causes the IP address of the machines to be returned as their >>>> hostnames. Guessing this is what happened. >>>> >>>> >>>> 3. The HMaster looks in the map for the above IP address (masquerading as >>>> the server name). It gets and does not find the entry in its map. So it >>>> assumes that this is a new region server and issues a CALL_SERVER_STARTUP. >>>> >>>> >>>> 4. The region server that receives it is in fact already running (under >>>> its real hostname) and enters the "HMaster panic" mode and bad stuff >>>> happens. >>>> >>>> ServerManager.java in regionServerReport() - >>>> HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); >>>> if (storedInfo == null) { >>>> // snip... >>>> return new HMsg[] {CALL_SERVER_STARTUP}; >>>> } >>>> >>>> >>>> Any reason why we use the hostname instead of the ip address in the map >>>> that stores the regionserver info? >>>> >>>> Thanks >>>> Karthik >>>> >>>> >>>> -----Original Message----- >>>> From: Karthik Ranganathan [mailto:kranganat...@facebook.com] >>>> Sent: Thursday, January 28, 2010 3:58 PM >>>> To: hbase-dev@hadoop.apache.org >>>> Subject: Cannot locate root region >>>> >>>> Hey guys, >>>> >>>> Ran into some issues while testing and wanted to understand what has >>>> happened better. Got the following exception when I went to the web UI >>>> >>>> Trying to contact region server 10.129.68.204:60020 for region .META.,,1, >>>> row '', but failed after 3 attempts. >>>> Exceptions: >>>> org.apache.hadoop.hbase.NotServingRegionException: >>>> org.apache.hadoop.hbase.NotServingRegionException: .META.,,1 >>>> at >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254) >>>> at >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837) >>>> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >>>> >>>> >>>> From a program that reads from a HBase table: >>>> java.lang.reflect.UndeclaredThrowableException >>>> at $Proxy1.getRegionInfo(Unknown Source) >>>> at >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985) >>>> at >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625) >>>> at >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601) >>>> at >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675) >>>> <snip> >>>> >>>> >>>> Followed up on the hmaster's log: >>>> >>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: >>>> 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete >>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> All 1 .META. region(s) scanned >>>> 2010-01-28 11:21:34,539 DEBUG >>>> org.apache.hadoop.hbase.master.ServerManager: Received report from unknown >>>> server -- telling it to MSG_CALL_SERVER_STARTUP: >>>> 10.129.68.203,60020,1263605543210 >>>> 2010-01-28 11:21:35,622 INFO org.apache.hadoop.hbase.master.ServerManager: >>>> Received start message from: >>>> hbasetest004.ash1.facebook.com,60020,1264706494600 >>>> 2010-01-28 11:21:36,649 DEBUG >>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode >>>> /hbase/rs/1264706494600 with data 10.129.68.203:60020 >>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server >>>> handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES => [{NAME >>>> => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', >>>> BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}) from >>>> 10.131.29.183:63308: error: org.apache.hadoop.hbase.TableExistsException: >>>> test1 >>>> org.apache.hadoop.hbase.TableExistsException: test1 >>>> at >>>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792) >>>> at >>>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756) >>>> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >>>> >>>> From a hregionserver's logs: >>>> >>>> 2010-01-28 11:20:22,589 DEBUG >>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), >>>> Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, >>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, >>>> Evicted/Run=NaN >>>> 2010-01-28 11:21:22,588 DEBUG >>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), >>>> Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, >>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, >>>> Evicted/Run=NaN >>>> 2010-01-28 11:22:18,794 INFO >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_CALL_SERVER_STARTUP >>>> >>>> >>>> The code says the following: >>>> case MSG_CALL_SERVER_STARTUP: >>>> // We the MSG_CALL_SERVER_STARTUP on startup but we can also >>>> // get it when the master is panicking because for instance >>>> // the HDFS has been yanked out from under it. Be wary of >>>> // this message. >>>> >>>> Any ideas on what is going on? The best I can come up with is perhaps a >>>> flaky DNS - would that explain this? This happened on three of our test >>>> clusters at almost the same time. Also, what is the most graceful/simplest >>>> way to recover from this? >>>> >>>> >>>> Thanks >>>> Karthik >>>> >>>> >>> >> >