I had a similar upgrade experience from 20.3 to 20.4. The master started off continuously reassigning regions as quickly as it could. Looking at the Master web UI for listing a table, it listed the regions properly (spread across the regionservers). But looking at the individual regionservers web UI (the list of tables on a regionserver) it appeared that each regionserver thought that it had a copy of every region. So the total number of regions reported was 5x normal for my 5 node cluster.
After a little while of this continuos reassigning, it appears that the regionserver holding .META. would have an issue writing updates to HDFS and then force .META. to reassign. Looking at the logs, the only error on the regionserver was: 2010-05-07 22:34:21,699 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Cannot open filename /hbase/.META./1028785192/info/2937322648368577689 at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1497) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1824) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1018) at org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966) at org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1291) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:98) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:68) at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:72) at org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1304) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.initHeap(HRegion.java:1850) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1883) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1906) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) And the only errors on the datanodes were (there were many of these, I'm including just one): 2010-05-07 22:34:21,701 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.0.0.61:50010, storageID=DS-548401723-10.0.0.61-50010-1258275076629, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Block blk_-3471558578366937156_600043 is not valid. at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java:734) at org.apache.hadoop.hdfs.server.datanode.FSDataset.getLength(FSDataset.java:722) at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:172) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) After this bad move of .META. I would get errors in the master log stating that hregioninfo was empty for each region so the regions were being deleted. A few minutes after this HBase reported through the web UI and through the hbase shell list command that there were no tables on my cluster. Luckily it didn't appear that data was erased and a restart of hbase/hdfs started the whole process over again. Eventually I noticed after my third run through this that hbase seemed to be mixing ips (10.0.0.61) and fqdns (h1.readpath.com) in the log lines. So I made sure to add all hosts to each server's /etc/hosts and then push this out to all of the servers(instead of each server only having it's own name in /etc/hosts as had been working in 20.3). It appears that 20.4 might be more finicky about dns resolution. Once I did this the master stopped continually reassigning the regions. Bryan On May 13, 2010, at 8:09 PM, Stack wrote: > Whats the shelll say? Does it see the tables consistently? Can you > count your content consistently? > St.Ack > > On Thu, May 13, 2010 at 4:53 PM, Viktors Rotanovs > <viktors.rotan...@gmail.com> wrote: >> Hi, >> >> after upgrading from 0.20.3 to 0.20.4 a list of tables almost >> immediately becomes inconsistent - master.jsp shows no tables even >> after creating test table in hbase shell, tables which were available >> before start randomly appearing and disappearing, etc. Upgrading was >> done by stopping, upgrading code, and then starting (no dump/restore >> was done). >> I didn't investigate yet, just checking if somebody had the same >> problem or if I did upgrade right (I had exactly the same issue in the >> past when trying to apply HBASE-2174 manually). >> >> Environment: >> Small tables, <100k rows >> Amazon EC2, "c1.xlarge" instance type with Ubuntu 9.10 and EBS root, >> HBase installed manually >> 1 master (namenode + jobtracker + master), 3 slaves (tasktracker + >> datanode + regionserver + zookeeper) >> Hadoop 0.20.1+169.68~1.karmic-cdh2 from Cloudera distribution >> Flaky DNS issue present, happens about once per day even with dnsmasq >> installed (heartbeat every 1s, dnsmasq forwards requests once per >> minute), DDNS set for internal hostnames. >> >> This is a testing cluster, nothing important on it. >> >> Cheers, >> -- Viktors >>