Thanks Bryan, you pointed me to the right direction. Now the problem is gone: I removed hbase.regionserver.dns.interface and hbase.master.dns.interface from my configs, it was set to eth0 before. Here's a snippet of code that helped me find the problem by doing exactly the same DNS queries HBase does:
import java.net.InetSocketAddress; import org.apache.hadoop.net.DNS; class Address { public static void main(String[] args) { InetSocketAddress address; try { String machineName = DNS.getDefaultHost("default", "default"); System.out.println("my default machineName=" + machineName); address = new InetSocketAddress(machineName, 80); System.out.println("my default hostname=" + address.getAddress().getHostName() + " ip=" + address.getAddress().getHostAddress()); address = new InetSocketAddress(address.getAddress().getHostAddress(), 80); System.out.println("my default reverse hostname=" + address.getAddress().getHostName() + " ip=" + address.getAddress().getHostAddress()); machineName = DNS.getDefaultHost("eth0", "default"); System.out.println("my eth0 machineName=" + machineName); address = new InetSocketAddress(machineName, 80); System.out.println("my eth0 hostname=" + address.getAddress().getHostName() + " ip=" + address.getAddress().getHostAddress()); address = new InetSocketAddress(address.getAddress().getHostAddress(), 80); System.out.println("my eth0 reverse hostname=" + address.getAddress().getHostName() + " ip=" + address.getAddress().getHostAddress()); } catch (java.net.UnknownHostException e) { throw new RuntimeException(e); } } } Just an idea: perhaps HBase could do DNS check when starting, by calling DNS.getDefaultHost(), then resolving hostname to IP, and then back to hostname and comparing with original one? Cheers, -- Viktors On Fri, May 14, 2010 at 8:48 AM, Bryan McCormick <br...@readpath.com> wrote: > I had a similar upgrade experience from 20.3 to 20.4. > > The master started off continuously reassigning regions as quickly as it > could. Looking at the Master web UI for listing a table, it listed the > regions properly (spread across the regionservers). But looking at the > individual regionservers web UI (the list of tables on a regionserver) it > appeared that each regionserver thought that it had a copy of every region. > So the total number of regions reported was 5x normal for my 5 node cluster. > > After a little while of this continuos reassigning, it appears that the > regionserver holding .META. would have an issue writing updates to HDFS and > then force .META. to reassign. Looking at the logs, the only error on the > regionserver was: > > 2010-05-07 22:34:21,699 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: > java.io.IOException: Cannot open filename > /hbase/.META./1028785192/info/2937322648368577689 > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1824) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767) > at java.io.DataInputStream.read(DataInputStream.java:132) > at > org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1018) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:98) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:68) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:72) > at > org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1304) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.initHeap(HRegion.java:1850) > at > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1883) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1906) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) > > > And the only errors on the datanodes were (there were many of these, I'm > including just one): > > 2010-05-07 22:34:21,701 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.0.0.61:50010, > storageID=DS-548401723-10.0.0.61-50010-1258275076629, infoPort=50075, > ipcPort=50020):DataXceiver > java.io.IOException: Block blk_-3471558578366937156_600043 is not valid. > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) > at java.lang.Thread.run(Thread.java:619) > > After this bad move of .META. I would get errors in the master log stating > that hregioninfo was empty for each region so the regions were being deleted. > A few minutes after this HBase reported through the web UI and through the > hbase shell list command that there were no tables on my cluster. Luckily it > didn't appear that data was erased and a restart of hbase/hdfs started the > whole process over again. > > Eventually I noticed after my third run through this that hbase seemed to be > mixing ips (10.0.0.61) and fqdns (h1.readpath.com) in the log lines. So I > made sure to add all hosts to each server's /etc/hosts and then push this out > to all of the servers(instead of each server only having it's own name in > /etc/hosts as had been working in 20.3). It appears that 20.4 might be more > finicky about dns resolution. Once I did this the master stopped continually > reassigning the regions. > > Bryan > > > On May 13, 2010, at 8:09 PM, Stack wrote: > >> Whats the shelll say? Does it see the tables consistently? Can you >> count your content consistently? >> St.Ack >> >> On Thu, May 13, 2010 at 4:53 PM, Viktors Rotanovs >> <viktors.rotan...@gmail.com> wrote: >>> Hi, >>> >>> after upgrading from 0.20.3 to 0.20.4 a list of tables almost >>> immediately becomes inconsistent - master.jsp shows no tables even >>> after creating test table in hbase shell, tables which were available >>> before start randomly appearing and disappearing, etc. Upgrading was >>> done by stopping, upgrading code, and then starting (no dump/restore >>> was done). >>> I didn't investigate yet, just checking if somebody had the same >>> problem or if I did upgrade right (I had exactly the same issue in the >>> past when trying to apply HBASE-2174 manually). >>> >>> Environment: >>> Small tables, <100k rows >>> Amazon EC2, "c1.xlarge" instance type with Ubuntu 9.10 and EBS root, >>> HBase installed manually >>> 1 master (namenode + jobtracker + master), 3 slaves (tasktracker + >>> datanode + regionserver + zookeeper) >>> Hadoop 0.20.1+169.68~1.karmic-cdh2 from Cloudera distribution >>> Flaky DNS issue present, happens about once per day even with dnsmasq >>> installed (heartbeat every 1s, dnsmasq forwards requests once per >>> minute), DDNS set for internal hostnames. >>> >>> This is a testing cluster, nothing important on it. >>> >>> Cheers, >>> -- Viktors >>> > > -- http://rotanovs.com - personal blog | http://www.hitgeist.com - fastest growing websites