Please ignore my earlier response. That does look suspiciously like client error.
Can you correlate the block /mnt/8d8aa987-017c-4f4c-822c-fe53f4163cee/current/subdir18/blk_-6090437519946022724 to a file? And this file to a region? What is in the master log as relates to this region around the time of the block errors? - Andy ________________________________ From: elsif <[email protected]> To: [email protected] Sent: Wed, October 21, 2009 2:36:12 PM Subject: Re: HBase Exceptions on version 0.20.1 stack wrote: > On Wed, Oct 21, 2009 at 8:16 AM, elsif <[email protected]> wrote: > > >> There are 239 "Block blk_-xxx is not valid errors", 522 "BlockInfo not >> found in volumeMap" errors, and 208 "BlockAlreadyExistsException" found >> in the hadoop logs over 12 hours of running the test. >> >> > > Above are from application-level (hbase) or datanode logs? If you trace any > of the above -- follow the block name -- in the NN are the blocks lost or do > you see replicas taking over or recoveries triggered? > > > The above error counts are from the datanode logs. It looks like blocks are being requested after they are deleted (HBASE-1894): hdfs:2009-10-21 11:30:55,580 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.29.12:50010, storageID=DS-1209147803-10.1.29.12-50010-1651186044136, infoPort=50075, ipcPort=50020):Transmitted block blk_-6090437519946022724_469173 to /10.1.29.20:50010 hdfs:2009-10-21 11:30:59,938 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleting block blk_-6090437519946022724_469173 file /mnt/8d8aa987-017c-4f4c-822c-fe53f4163cee/current/subdir18/blk_-6090437519946022724 hdfs:2009-10-21 11:59:09,509 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.1.29.12:50010, storageID=DS-1209147803-10.1.29.12-50010-1651186044136, infoPort=50075, ipcPort=50020):Got exception while serving blk_-6090437519946022724_469173 to /10.1.29.14: hdfs:java.io.IOException: Block blk_-6090437519946022724_469173 is not valid. >> I understand that I am loading the cluster - that is the point of the >> test, but I don't think that this should result in data loss. Failed >> inserts at the client level I can handle, but loss of data that was >> previously thought to be stored in hbase is a major issue. Are there >> plans to make hbase more resilient to load based failures? >> >> >> > It looks like there'll be data loss going by a few of the exceptions you > provide originally. Here's a couple of comments: > > "No live nodes contain current block" Usually we see this if the > client-side hadoop has not been patched with hdfs-127/hadoop-4681. Your > test program doesn't seem to have come across. Mind attaching it to an > issue so I can try it? Going by the way you started your test program, you > should have the hbase patched hadoop first in your CLASSPATH so you should > be ok but maybe there is something about your environmnent frustrating > hbase's using a patched hadoop? > > The test program is short and is in-line at the end of this message. In this last run, the client was running against the hadoop-0.20.1-core.jar instead of the hadoop-0.20.1-hdfs127-core.jar. I will retest using the hdfs-127 jar. > "java.io.IOException: TIMED OUT" Your regionserver or master timed out its > zk session. GC or swapping or disk used by zk is under heavy i/o loading? > There are five zk servers in the test setup that share hardware with nodes that are also datanodes and regionservers. There are only 8 instances of this error and all of them occurred within 30 seconds of each other on the same node. > "ClosedChannelException" Probably symptom of a RS shutdown because of events > such as above. > > "Abandoning block..." Did this write to HLog fail? Its just an INFO level > log out of DFSClient. > Since these are just INFO level messages, they are not an issue? hbase:2009-10-21 11:40:57,082 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-1365065510928989381_470690 hdfs:2009-10-21 11:40:57,080 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-1365065510928989381_470690 src: /10.1.29.23:56117 dest: /10.1.29.23:50010 hdfs:2009-10-21 11:40:57,081 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-1365065510928989381_470690 received exception org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/mnt/eac8b34a-a4a4-495b-8241-43cd4b21c28f/current/blk_-8854288022995097022_462986.meta': No such file or directory hbase:2009-10-21 11:36:12,609 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_4473876596208463127_470012 hdfs:2009-10-21 11:35:44,842 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_4473876596208463127_470012 src: /10.1.29.22:57916 dest: /10.1.29.14:50010 hdfs:2009-10-21 11:35:44,844 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_4473876596208463127_470012 received exception java.io.EOFException hdfs:2009-10-21 11:36:26,105 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_4473876596208463127_470012 src: /10.1.29.14:34537 dest: /10.1.29.21:50010 hdfs:2009-10-21 11:36:26,105 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_4473876596208463127_470012 received exception org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/mnt/e9726517-c84c-4326-8859-44f8261a14fa/current/subdir26/subdir62/blk_4151284369367444478': No such file or directory hdfs:2009-10-21 11:36:12,605 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_4473876596208463127_470012 src: /10.1.29.22:34103 dest: /10.1.29.22:50010 hdfs:2009-10-21 11:36:12,609 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder blk_4473876596208463127_470012 2 Exception java.io.EOFException hdfs:2009-10-21 11:36:12,609 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block blk_4473876596208463127_470012 terminating hbase:2009-10-21 06:47:10,189 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-6810931018745175038_422166 hdfs:2009-10-21 06:47:10,171 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-6810931018745175038_422166 src: /10.1.29.12:44933 dest: /10.1.29.12:50010 hdfs:2009-10-21 06:47:10,188 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-6810931018745175038_422166 received exception java.io.EOFException hdfs:2009-10-21 06:46:43,362 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-6810931018745175038_422166 src: /10.1.29.12:48908 dest: /10.1.29.14:50010 hdfs:2009-10-21 06:46:43,362 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-6810931018745175038_422166 received exception org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/mnt/1a3e57e4-b514-4a85-83b1-ae9d12e3fb7a/current/subdir27/blk_-3769845572103484957_410554.meta': No such file or directory > "file system not available" What happened before this? Was this just an > emission on regionserver shutdown? > I don't have any "file system not available" messages this time around. > St.Ack > > Here is the source for the test program. For my tests I run two instances with the default settings: -- package org.apache.hadoop.hbase.util; import java.io.IOException; import java.io.FileInputStream; import java.lang.Integer; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.KeyValue; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.util.Bytes; /* create 'fc_test', {NAME => 'json', VERSIONS => 1}, {NAME => 'data', VERSIONS =>1} */ public class LoadTest extends Thread { protected int datasize = 4000000; public TableTest(int datasize) { this.datasize = datasize; } public void run() { try { FileInputStream random = new FileInputStream("/dev/urandom"); byte[] key = new byte[32]; byte[] data = new byte[datasize]; HBaseConfiguration hbaseConfiguration = new HBaseConfiguration(); HTable table = new HTable(hbaseConfiguration, "fc_test"); random.read(data, 0, datasize); while (true) { random.read(key, 0, 32); Put update = new Put(Bytes.toBytes(getHexString(key))); update.add(Bytes.toBytes("data"), Bytes.toBytes(""), data); table.put(update); } } catch (Exception e) { System.out.println(e.toString()); } } public static String getHexString(byte[] b) throws Exception { String result = ""; for (int i=0; i < b.length; i++) { result += Integer.toString( ( b[i] & 0xff ) + 0x100, 16).substring( 1 ); } return result; } public static void main(String args[]) throws IOException { Thread t = null; int i; int threads = 30; int datasize = 400000; try { threads = Integer.getInteger(args[0]).intValue(); datasize = Integer.getInteger(args[1]).intValue(); } catch (Exception e) { } System.out.println("Starting " + threads + " threads"); for (i = 0; i< threads; i++) { t = new TableTest(datasize); t.start(); } // wait for last thread try { t.join(); } catch (InterruptedException e) { } } }
