On Sat, May 8, 2010 at 12:02 AM, Stack <st...@duboce.net> wrote: > On Fri, May 7, 2010 at 8:27 PM, James Baldassari <jbaldass...@gmail.com> > wrote: > > java.io.IOException: Cannot open filename > > /hbase/users/73382377/data/312780071564432169 > > > This is the regionserver log? Is this deploying the region? It fails? >
This error is on the client accessing HBase. This exception was thrown on a get call to an HTable instance. I'm not sure if it was deploying the region. All I know is that the system had been running with all regions available (as far as I know), and then all of a sudden these errors started showing up on the client. > > > Our cluster throughput goes from around 3k requests/second down to > 500-1000 > > and does not recover without manual intervention. The region server log > for > > that region says: > > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to / > > 10.24.166.74:50010 for file > /hbase/users/73382377/data/312780071564432169 > > for block -4841840178880951849:java.io.IOException: Got error in response > to > > OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169 for > > block -4841840178880951849 > > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, > call > > get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1, > > timeRange=[0,9223372036854775807), families={(family=data, columns=ALL}) > > from 10.24.117.100:2365: error: java.io.IOException: Cannot open > filename > > /hbase/users/73382377/data/312780071564432169 > > java.io.IOException: Cannot open filename > > /hbase/users/73382377/data/312780071564432169 > > > > The datanode log for 10.24.116.74 says: > > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration( > > 10.24.166.74:50010, > storageID=DS-14401423-10.24.166.74-50010-1270741415211, > > infoPort=50075, ipcPort=50020): > > Got exception while serving blk_-4841840178880951849_50277 to / > 10.25.119.113 > > : > > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid. > > > > Whats your hadoop? Is it 0.20.2 or CDH? Any patches? > Hadoop is vanilla CDH 2. HBase is 0.20.3 + HBase-2180 > > > > Running a major compaction on the users table fixed the problem the first > > time it happened, but this time the major compaction didn't fix it, so > we're > > in the process of rebooting the whole cluster. I'm wondering a few > things: > > > > 1. What could trigger this problem? > > 2. Why can't the system fail over to another block/file/datanode/region > > server? We're using 3x replication in HDFS, and we have 8 data nodes > which > > double as our region servers. > > 3. Are there any best practices for achieving high availability in an > HBase > > cluster? How can I configure the system to gracefully (and > automatically) > > handle these types of problems? > > > > Let us know what your hadoop is and then we figure more on the issues > above. > If you need complete stack traces or any additional information, please let me know. > Thanks James, > St.Ack > P.S. Its eight node cluster on what kinda hw? (You've probably said in > the past and I can dig through mail -- just say -- and then what kind > of loading are you applying? Ditto for if you've said this already) >