Re: tables disappearing after upgrading 0.20.3 => 0.20.4

Viktors Rotanovs Fri, 14 May 2010 03:31:12 -0700

Thanks Bryan, you pointed me to the right direction. Now the problem is gone:
I removed hbase.regionserver.dns.interface and
hbase.master.dns.interface from my configs, it was set to eth0 before.
Here's a snippet of code that helped me find the problem by doing
exactly the same DNS queries HBase does:


import java.net.InetSocketAddress;
import org.apache.hadoop.net.DNS;
class Address {
public static void main(String[] args) {
    InetSocketAddress address;
    try {
      String machineName = DNS.getDefaultHost("default", "default");
      System.out.println("my default machineName=" + machineName);
      address = new InetSocketAddress(machineName, 80);
      System.out.println("my default hostname=" +
address.getAddress().getHostName() + " ip=" +
address.getAddress().getHostAddress());
      address = new
InetSocketAddress(address.getAddress().getHostAddress(), 80);
      System.out.println("my default reverse hostname=" +
address.getAddress().getHostName() + " ip=" +
address.getAddress().getHostAddress());

      machineName = DNS.getDefaultHost("eth0", "default");
      System.out.println("my eth0 machineName=" + machineName);
      address = new InetSocketAddress(machineName, 80);
      System.out.println("my eth0 hostname=" +
address.getAddress().getHostName() + " ip=" +
address.getAddress().getHostAddress());
      address = new
InetSocketAddress(address.getAddress().getHostAddress(), 80);
      System.out.println("my eth0 reverse hostname=" +
address.getAddress().getHostName() + " ip=" +
address.getAddress().getHostAddress());

    } catch (java.net.UnknownHostException e) {
      throw new RuntimeException(e);
    }
}
}

Just an idea: perhaps HBase could do DNS check when starting, by
calling DNS.getDefaultHost(), then resolving hostname to IP, and then
back to hostname and comparing with original one?

Cheers,
-- Viktors

On Fri, May 14, 2010 at 8:48 AM, Bryan McCormick <br...@readpath.com> wrote:
> I had a similar upgrade experience from 20.3 to 20.4.
>
> The master started off continuously reassigning regions as quickly as it 
> could. Looking at the Master web UI for listing a table, it listed the 
> regions properly (spread across the regionservers). But looking at the 
> individual regionservers web UI (the list of tables on a regionserver) it 
> appeared that each regionserver thought that it had a copy of every region. 
> So the total number of regions reported was 5x normal for my 5 node cluster.
>
> After a little while of this continuos reassigning, it appears that the 
> regionserver holding .META. would have an issue writing updates to HDFS and 
> then force .META. to reassign. Looking at the logs, the only error on the 
> regionserver was:
>
> 2010-05-07 22:34:21,699 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: 
> java.io.IOException: Cannot open filename 
> /hbase/.META./1028785192/info/2937322648368577689
>        at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497)
>        at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1824)
>        at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638)
>        at 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767)
>        at java.io.DataInputStream.read(DataInputStream.java:132)
>        at 
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100)
>        at 
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1018)
>        at 
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966)
>        at 
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291)
>        at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:98)
>        at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:68)
>        at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:72)
>        at 
> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1304)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.initHeap(HRegion.java:1850)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1883)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1906)
>        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>
> And the only errors on the datanodes were (there were many of these, I'm 
> including just one):
>
> 2010-05-07 22:34:21,701 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(10.0.0.61:50010, 
> storageID=DS-548401723-10.0.0.61-50010-1258275076629, infoPort=50075, 
> ipcPort=50020):DataXceiver
> java.io.IOException: Block blk_-3471558578366937156_600043 is not valid.
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734)
>        at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722)
>        at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172)
>        at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>        at java.lang.Thread.run(Thread.java:619)
>
> After this bad move of .META. I would get errors in the master log stating 
> that hregioninfo was empty for each region so the regions were being deleted. 
> A few minutes after this HBase reported through the web UI and through the 
> hbase shell list command that there were no tables on my cluster. Luckily it 
> didn't appear that data was erased and a restart of hbase/hdfs started the 
> whole process over again.
>
> Eventually I noticed after my third run through this that hbase seemed to be 
> mixing ips (10.0.0.61) and fqdns (h1.readpath.com) in the log lines. So I 
> made sure to add all hosts to each server's /etc/hosts and then push this out 
> to all of the servers(instead of each server only having it's own name in 
> /etc/hosts as had been working in 20.3). It appears that 20.4 might be more 
> finicky about dns resolution. Once I did this the master stopped continually 
> reassigning the regions.
>
> Bryan
>
>
> On May 13, 2010, at 8:09 PM, Stack wrote:
>
>> Whats the shelll say?  Does it see the tables consistently?  Can you
>> count your content consistently?
>> St.Ack
>>
>> On Thu, May 13, 2010 at 4:53 PM, Viktors Rotanovs
>> <viktors.rotan...@gmail.com> wrote:
>>> Hi,
>>>
>>> after upgrading from 0.20.3 to 0.20.4 a list of tables almost
>>> immediately becomes inconsistent - master.jsp shows no tables even
>>> after creating test table in hbase shell, tables which were available
>>> before start randomly appearing and disappearing, etc. Upgrading was
>>> done by stopping, upgrading code, and then starting (no dump/restore
>>> was done).
>>> I didn't investigate yet, just checking if somebody had the same
>>> problem or if I did upgrade right (I had exactly the same issue in the
>>> past when trying to apply HBASE-2174 manually).
>>>
>>> Environment:
>>> Small tables, <100k rows
>>> Amazon EC2, "c1.xlarge" instance type with Ubuntu 9.10 and EBS root,
>>> HBase installed manually
>>> 1 master (namenode + jobtracker + master), 3 slaves (tasktracker +
>>> datanode + regionserver + zookeeper)
>>> Hadoop 0.20.1+169.68~1.karmic-cdh2 from Cloudera distribution
>>> Flaky DNS issue present, happens about once per day even with dnsmasq
>>> installed (heartbeat every 1s, dnsmasq forwards requests once per
>>> minute), DDNS set for internal hostnames.
>>>
>>> This is a testing cluster, nothing important on it.
>>>
>>> Cheers,
>>> -- Viktors
>>>
>
>



-- 
http://rotanovs.com - personal blog | http://www.hitgeist.com -
fastest growing websites

Re: tables disappearing after upgrading 0.20.3 => 0.20.4

Reply via email to