On Fri, Jan 29, 2010 at 11:29 AM, Joydeep Sarma <jsensa...@gmail.com> wrote: > i meant even if we were using hostnames for RS registration (which i > think has lot of advantages - not necessarily in our environment > though) -
Agreed, we should use hostnames for the advantages it gives listed earlier in this thread... the master processing of the heartbeat (or whatever it's > processing) shouldn't require a forward lookup. if it needs the ip > address - it already has that via the connection object. > Agreed. There is an old issue to address current dumb lookup on each heartbeat opening us to damage if DNS starts flapping: https://issues.apache.org/jira/browse/HBASE-1679. Its not yet fixed. St.Ack > On Fri, Jan 29, 2010 at 11:19 AM, Karthik Ranganathan > <kranganat...@facebook.com> wrote: >> Yup totally - either name or ip would work. Not sure if there is a pro or a >> con to choosing either one - but thought it better to use the ip as that >> always remains the same (no resolve required) and used to open the sockets. >> >> @jd-cryans: Saw your JIRA update: "One example of weirdness is when the >> region server is told which address to use according to the master:" >> >> Was meaning to ask about that too :) its all good now. >> >> Thanks >> Karthik >> >> >> -----Original Message----- >> From: Joydeep Sarma [mailto:jsensa...@gmail.com] >> Sent: Friday, January 29, 2010 11:01 AM >> To: hbase-dev@hadoop.apache.org >> Subject: Re: Cannot locate root region >> >> hmmm .. if the master doesn't need the RS ip address at this point - >> seems like it should be able to use the hostname offered by the RS >> directly? >> >> On Fri, Jan 29, 2010 at 10:44 AM, Karthik Ranganathan >> <kranganat...@facebook.com> wrote: >>> The master does another lookup independent of the region server using the >>> hostname given by the region server: >>> >>> ServerManager.java, regionServerReport() does: >>> HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); >>> // info.getServerName() is hostname >>> >>> Which eventually does: >>> HServerAddress.getHostname() >>> >>> HServerAddress' constructor creates the InetSocketAddress from the >>> hostname:port, which involves mapping the hostname to the ip address using >>> a lookup. >>> >>> Thanks >>> Karthik >>> >>> >>> -----Original Message----- >>> From: Joydeep Sarma [mailto:jsensa...@gmail.com] >>> Sent: Friday, January 29, 2010 9:46 AM >>> To: hbase-dev@hadoop.apache.org >>> Subject: Re: Cannot locate root region >>> >>> @Kannan - Karthik's mail said the reverse lookup happens in the RS >>> (not the master). the master simply tried to match the offered >>> hostname. >>> >>> i dont know whose reading is right - but if it's the RS - i didn't >>> understand why that wasn't just the reverse lookup done once at >>> bootstrap time (which wouldn't be affected by ongoing DNS badness). >>> >>> >>> On Fri, Jan 29, 2010 at 9:39 AM, Jean-Daniel Cryans <jdcry...@apache.org> >>> wrote: >>>> I just created https://issues.apache.org/jira/browse/HBASE-2174 >>>> >>>> We handle addresses in different ways depending on which part of the >>>> code you're in. We should correct that everywhere by implementing a >>>> solution that also solves what you guys are seeing. >>>> >>>> J-D >>>> >>>> On Fri, Jan 29, 2010 at 8:33 AM, Kannan Muthukkaruppan >>>> <kan...@facebook.com> wrote: >>>>> @Joy: The info stored in .META. for various regions as well as in the >>>>> ephemeral nodes for region servers in zookeeper are both already IP >>>>> address based. So doesn't look like multi-homing and/or the other >>>>> flexibilities you mention were a design goal as far as I can tell. >>>>> >>>>> Regarding: <<< doesn't the reverse ip lookup just once at RS startup >>>>> time?>>>, what seems to be happening is this: >>>>> >>>>> A regionServer periodically sends a regionServerReport (RPC call) to the >>>>> master. A HServerInfo argument is passed as an argument and it identifies >>>>> the sending region server's identity in IP address format. >>>>> >>>>> The master, in ServerManager class, maintains a serversToServerInfo map >>>>> which is hostname based. Every time a master receives a >>>>> regionServerReport it converts the IP address based name to a hostname >>>>> via the info.getServerName() call. Normally this call returns the >>>>> hostname, but we suspect that during the DNS flakiness, it returned an IP >>>>> address based string. And so, this caused ServerManager.java to think >>>>> that it was hearing from a new server. And this lead to: >>>>> >>>>> HServerInfo storedInfo = serversToServerInfo.get(info.getServerName()); >>>>> if (storedInfo == null) { >>>>> if (LOG.isDebugEnabled()) { >>>>> LOG.debug("Received report from unknown server -- telling it " + >>>>> <<============ >>>>> "to " + CALL_SERVER_STARTUP + ": " + info.getServerName()); >>>>> <<============ >>>>> } >>>>> >>>>> and bad things down the road. >>>>> >>>>> The above error message in our logs (example below) indeed identified the >>>>> host in IP address syntax, even though normally the getServerName call >>>>> would return the info in hostname format. >>>>> >>>>> 2010-01-28 11:21:34,539 DEBUG >>>>> org.apache.hadoop.hbase.master.ServerManager: Received report from >>>>> unknown server -- telling it to MSG_CALL_SERVER_STARTUP: >>>>> 10.129.68.203,60020,1263605543210 >>>>> >>>>> This affected three of our test clusters at the same time! >>>>> >>>>> Perhaps all we need to do is to change the ServerManager's internal maps >>>>> to all be IP based? That way we avoid/bypass the master having to look up >>>>> the hostname on every heartbeat. >>>>> >>>>> regards, >>>>> Kannan >>>>> ________________________________________ >>>>> From: Joydeep Sarma [jsensa...@gmail.com] >>>>> Sent: Friday, January 29, 2010 1:20 AM >>>>> To: hbase-dev@hadoop.apache.org >>>>> Subject: Re: Cannot locate root region >>>>> >>>>> hadoop also uses the hostnames. if a host is multi-homed - it's >>>>> hostname is a better identifier (which still allows it to use >>>>> different nics/ips for actual traffic). it can help in the case the >>>>> cluster is migrated for example (all the ips change). one could have >>>>> the same hostname resolve to different ips depending on who's doing >>>>> the lookup (this happens in AWS where the same elastic hostname >>>>> resolves to private or public ip depending on where the peer is. so >>>>> clients can talk from outside AWS via public ips and master etc. can >>>>> talk over private ips). >>>>> >>>>> so lots of reasons i guess. doesn't the reverse ip lookup just once at >>>>> RS startup time? (wondering how this reconciles with the DNS being >>>>> flaky after the cluster was up and running). >>>>> >>>>> On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan >>>>> <kranganat...@facebook.com> wrote: >>>>>> >>>>>> We did some more digging into this and here is the theory. >>>>>> >>>>>> 1. The regionservers use their local ip to lookup their hostnames and >>>>>> pass that to the HMaster. The HMaster finds the server info by using >>>>>> this hostname as the key in the HashMap. >>>>>> >>>>>> HRegionServer.java >>>>>> reinitialize() - >>>>>> this.serverInfo = new HServerInfo(new HServerAddress( >>>>>> new InetSocketAddress(address.getBindAddress(), >>>>>> this.server.getListenerAddress().getPort())), >>>>>> System.currentTimeMillis(), >>>>>> this.conf.getInt("hbase.regionserver.info.port", 60030), >>>>>> machineName); >>>>>> >>>>>> In run() - >>>>>> HMsg msgs[] = hbaseMaster.regionServerReport( >>>>>> serverInfo, outboundArray, getMostLoadedRegions()); >>>>>> >>>>>> >>>>>> 2. I have observed in the past that there could be some DNS flakiness >>>>>> which causes the IP address of the machines to be returned as their >>>>>> hostnames. Guessing this is what happened. >>>>>> >>>>>> >>>>>> 3. The HMaster looks in the map for the above IP address (masquerading >>>>>> as the server name). It gets and does not find the entry in its map. So >>>>>> it assumes that this is a new region server and issues a >>>>>> CALL_SERVER_STARTUP. >>>>>> >>>>>> >>>>>> 4. The region server that receives it is in fact already running (under >>>>>> its real hostname) and enters the "HMaster panic" mode and bad stuff >>>>>> happens. >>>>>> >>>>>> ServerManager.java in regionServerReport() - >>>>>> HServerInfo storedInfo = >>>>>> serversToServerInfo.get(info.getServerName()); >>>>>> if (storedInfo == null) { >>>>>> // snip... >>>>>> return new HMsg[] {CALL_SERVER_STARTUP}; >>>>>> } >>>>>> >>>>>> >>>>>> Any reason why we use the hostname instead of the ip address in the map >>>>>> that stores the regionserver info? >>>>>> >>>>>> Thanks >>>>>> Karthik >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Karthik Ranganathan [mailto:kranganat...@facebook.com] >>>>>> Sent: Thursday, January 28, 2010 3:58 PM >>>>>> To: hbase-dev@hadoop.apache.org >>>>>> Subject: Cannot locate root region >>>>>> >>>>>> Hey guys, >>>>>> >>>>>> Ran into some issues while testing and wanted to understand what has >>>>>> happened better. Got the following exception when I went to the web UI >>>>>> >>>>>> Trying to contact region server 10.129.68.204:60020 for region >>>>>> .META.,,1, row '', but failed after 3 attempts. >>>>>> Exceptions: >>>>>> org.apache.hadoop.hbase.NotServingRegionException: >>>>>> org.apache.hadoop.hbase.NotServingRegionException: .META.,,1 >>>>>> at >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2254) >>>>>> at >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1837) >>>>>> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) >>>>>> at >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >>>>>> >>>>>> >>>>>> From a program that reads from a HBase table: >>>>>> java.lang.reflect.UndeclaredThrowableException >>>>>> at $Proxy1.getRegionInfo(Unknown Source) >>>>>> at >>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:985) >>>>>> at >>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:625) >>>>>> at >>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:601) >>>>>> at >>>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:675) >>>>>> <snip> >>>>>> >>>>>> >>>>>> Followed up on the hmaster's log: >>>>>> >>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>>>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: >>>>>> 10.129.68.204:60020, regionname: .META.,,1, startKey: <>} complete >>>>>> 2010-01-28 11:21:16,148 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>>>> All 1 .META. region(s) scanned >>>>>> 2010-01-28 11:21:34,539 DEBUG >>>>>> org.apache.hadoop.hbase.master.ServerManager: Received report from >>>>>> unknown server -- telling it to MSG_CALL_SERVER_STARTUP: >>>>>> 10.129.68.203,60020,1263605543210 >>>>>> 2010-01-28 11:21:35,622 INFO >>>>>> org.apache.hadoop.hbase.master.ServerManager: Received start message >>>>>> from: hbasetest004.ash1.facebook.com,60020,1264706494600 >>>>>> 2010-01-28 11:21:36,649 DEBUG >>>>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode >>>>>> /hbase/rs/1264706494600 with data 10.129.68.203:60020 >>>>>> 2010-01-28 11:21:40,704 INFO org.apache.hadoop.ipc.HBaseServer: IPC >>>>>> Server handler 39 on 60000, call createTable({NAME => 'test1', FAMILIES >>>>>> => [{NAME => 'cf1', VERSIONS => '3', COMPRESSION => 'NONE', TTL => >>>>>> '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => >>>>>> 'true'}]}) from 10.131.29.183:63308: error: >>>>>> org.apache.hadoop.hbase.TableExistsException: test1 >>>>>> org.apache.hadoop.hbase.TableExistsException: test1 >>>>>> at >>>>>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:792) >>>>>> at >>>>>> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:756) >>>>>> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >>>>>> at >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) >>>>>> at >>>>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >>>>>> >>>>>> From a hregionserver's logs: >>>>>> >>>>>> 2010-01-28 11:20:22,589 DEBUG >>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), >>>>>> Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, >>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, >>>>>> Evicted/Run=NaN >>>>>> 2010-01-28 11:21:22,588 DEBUG >>>>>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>>> Total=19.661453MB (20616528), Free=2377.0137MB (2492479408), >>>>>> Max=2396.675MB (2513095936), Counts: Blocks=0, Access=0, Hit=0, Miss=0, >>>>>> Evictions=0, Evicted=0, Ratios: Hit Ratio=NaN%, Miss Ratio=NaN%, >>>>>> Evicted/Run=NaN >>>>>> 2010-01-28 11:22:18,794 INFO >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: >>>>>> MSG_CALL_SERVER_STARTUP >>>>>> >>>>>> >>>>>> The code says the following: >>>>>> case MSG_CALL_SERVER_STARTUP: >>>>>> // We the MSG_CALL_SERVER_STARTUP on startup but we can >>>>>> also >>>>>> // get it when the master is panicking because for >>>>>> instance >>>>>> // the HDFS has been yanked out from under it. Be wary of >>>>>> // this message. >>>>>> >>>>>> Any ideas on what is going on? The best I can come up with is perhaps a >>>>>> flaky DNS - would that explain this? This happened on three of our test >>>>>> clusters at almost the same time. Also, what is the most >>>>>> graceful/simplest way to recover from this? >>>>>> >>>>>> >>>>>> Thanks >>>>>> Karthik >>>>>> >>>>>> >>>>> >>>> >>> >> >