Thanks, I'll check out HBase-2231. Prior to this problem occurring our cluster had been running for almost 2 weeks with no problems. I'm not sure about the GC pauses, but I'll look through the logs. I've never noticed that before, though.
Also, maybe it would help to understand how we're using HBase. Mostly we're doing random reads, but once an hour we do a bulk update of a few million rows. I wonder if the problem is triggered when a bulk update coincides with a major compaction. We used to do these updates only once per day and had a problem once when an update was going through at the same time as a major compaction. I'm guessing the block cache overflowed. Now that we're doing these updates hourly they're much smaller, so we didn't think it would continue to be a problem. -James On Sat, May 8, 2010 at 12:15 AM, Todd Lipcon <t...@cloudera.com> wrote: > This could very well be HBASE-2231. > > Do you find that region servers occasionally crash after going into GC > pauses? > > -Todd > > On Fri, May 7, 2010 at 9:02 PM, Stack <st...@duboce.net> wrote: > > > On Fri, May 7, 2010 at 8:27 PM, James Baldassari <jbaldass...@gmail.com> > > wrote: > > > java.io.IOException: Cannot open filename > > > /hbase/users/73382377/data/312780071564432169 > > > > > This is the regionserver log? Is this deploying the region? It fails? > > > > > Our cluster throughput goes from around 3k requests/second down to > > 500-1000 > > > and does not recover without manual intervention. The region server > log > > for > > > that region says: > > > > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to / > > > 10.24.166.74:50010 for file > > /hbase/users/73382377/data/312780071564432169 > > > for block -4841840178880951849:java.io.IOException: Got error in > response > > to > > > OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169 > for > > > block -4841840178880951849 > > > > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, > > call > > > get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1, > > > timeRange=[0,9223372036854775807), families={(family=data, > columns=ALL}) > > > from 10.24.117.100:2365: error: java.io.IOException: Cannot open > > filename > > > /hbase/users/73382377/data/312780071564432169 > > > java.io.IOException: Cannot open filename > > > /hbase/users/73382377/data/312780071564432169 > > > > > > The datanode log for 10.24.116.74 says: > > > > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > > DatanodeRegistration( > > > 10.24.166.74:50010, > > storageID=DS-14401423-10.24.166.74-50010-1270741415211, > > > infoPort=50075, ipcPort=50020): > > > Got exception while serving blk_-4841840178880951849_50277 to / > > 10.25.119.113 > > > : > > > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid. > > > > > > > Whats your hadoop? Is it 0.20.2 or CDH? Any patches? > > > > > > > Running a major compaction on the users table fixed the problem the > first > > > time it happened, but this time the major compaction didn't fix it, so > > we're > > > in the process of rebooting the whole cluster. I'm wondering a few > > things: > > > > > > 1. What could trigger this problem? > > > 2. Why can't the system fail over to another block/file/datanode/region > > > server? We're using 3x replication in HDFS, and we have 8 data nodes > > which > > > double as our region servers. > > > 3. Are there any best practices for achieving high availability in an > > HBase > > > cluster? How can I configure the system to gracefully (and > > automatically) > > > handle these types of problems? > > > > > > > Let us know what your hadoop is and then we figure more on the issues > > above. > > Thanks James, > > St.Ack > > P.S. Its eight node cluster on what kinda hw? (You've probably said in > > the past and I can dig through mail -- just say -- and then what kind > > of loading are you applying? Ditto for if you've said this already) > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera >