If you can grep for '4841840178880951849' as well as /hbase/users/73382377/data/312780071564432169 across all of your datanode logs plus your NN, and put that online somewhere, that would be great. If you can grep with -C 20 to get some context that would help as well.
Grepping for the region in question (73382377) in the RS logs would also be helpful. Thanks -Todd On Fri, May 7, 2010 at 9:16 PM, James Baldassari <jbaldass...@gmail.com>wrote: > On Sat, May 8, 2010 at 12:02 AM, Stack <st...@duboce.net> wrote: > > > On Fri, May 7, 2010 at 8:27 PM, James Baldassari <jbaldass...@gmail.com> > > wrote: > > > java.io.IOException: Cannot open filename > > > /hbase/users/73382377/data/312780071564432169 > > > > > This is the regionserver log? Is this deploying the region? It fails? > > > > This error is on the client accessing HBase. This exception was thrown on > a > get call to an HTable instance. I'm not sure if it was deploying the > region. All I know is that the system had been running with all regions > available (as far as I know), and then all of a sudden these errors started > showing up on the client. > > > > > > > Our cluster throughput goes from around 3k requests/second down to > > 500-1000 > > > and does not recover without manual intervention. The region server > log > > for > > > that region says: > > > > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to / > > > 10.24.166.74:50010 for file > > /hbase/users/73382377/data/312780071564432169 > > > for block -4841840178880951849:java.io.IOException: Got error in > response > > to > > > OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169 > for > > > block -4841840178880951849 > > > > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, > > call > > > get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1, > > > timeRange=[0,9223372036854775807), families={(family=data, > columns=ALL}) > > > from 10.24.117.100:2365: error: java.io.IOException: Cannot open > > filename > > > /hbase/users/73382377/data/312780071564432169 > > > java.io.IOException: Cannot open filename > > > /hbase/users/73382377/data/312780071564432169 > > > > > > The datanode log for 10.24.116.74 says: > > > > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration( > > > 10.24.166.74:50010, > > storageID=DS-14401423-10.24.166.74-50010-1270741415211, > > > infoPort=50075, ipcPort=50020): > > > Got exception while serving blk_-4841840178880951849_50277 to / > > 10.25.119.113 > > > : > > > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid. > > > > > > > Whats your hadoop? Is it 0.20.2 or CDH? Any patches? > > > > Hadoop is vanilla CDH 2. HBase is 0.20.3 + HBase-2180 > > > > > > > > > Running a major compaction on the users table fixed the problem the > first > > > time it happened, but this time the major compaction didn't fix it, so > > we're > > > in the process of rebooting the whole cluster. I'm wondering a few > > things: > > > > > > 1. What could trigger this problem? > > > 2. Why can't the system fail over to another block/file/datanode/region > > > server? We're using 3x replication in HDFS, and we have 8 data nodes > > which > > > double as our region servers. > > > 3. Are there any best practices for achieving high availability in an > > HBase > > > cluster? How can I configure the system to gracefully (and > > automatically) > > > handle these types of problems? > > > > > > > Let us know what your hadoop is and then we figure more on the issues > > above. > > > > If you need complete stack traces or any additional information, please let > me know. > > > > Thanks James, > > St.Ack > > P.S. Its eight node cluster on what kinda hw? (You've probably said in > > the past and I can dig through mail -- just say -- and then what kind > > of loading are you applying? Ditto for if you've said this already) > > > -- Todd Lipcon Software Engineer, Cloudera