HBase fail-over/reliability issues

James Baldassari Fri, 07 May 2010 20:27:36 -0700

Hi,

First of all, thanks to all the HBase contributors for getting 0.20.4 out.
We're planning on upgrading soon, and we're also looking forward to 0.20.5.
Recently we've had a couple of problems where HBase (0.20.3) can't seem to
read a file, and the client spews errors like this:


java.io.IOException: Cannot open filename
/hbase/users/73382377/data/312780071564432169

Our cluster throughput goes from around 3k requests/second down to 500-1000
and does not recover without manual intervention.  The region server log for
that region says:

WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
10.24.166.74:50010 for file /hbase/users/73382377/data/312780071564432169
for block -4841840178880951849:java.io.IOException: Got error in response to
OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169 for
block -4841840178880951849

INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020, call
get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
from 10.24.117.100:2365: error: java.io.IOException: Cannot open filename
/hbase/users/73382377/data/312780071564432169
java.io.IOException: Cannot open filename
/hbase/users/73382377/data/312780071564432169

The datanode log for 10.24.116.74 says:

WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.24.166.74:50010, storageID=DS-14401423-10.24.166.74-50010-1270741415211,
infoPort=50075, ipcPort=50020):
Got exception while serving blk_-4841840178880951849_50277 to /10.25.119.113
:
java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.

Running a major compaction on the users table fixed the problem the first
time it happened, but this time the major compaction didn't fix it, so we're
in the process of rebooting the whole cluster.  I'm wondering a few things:

1. What could trigger this problem?
2. Why can't the system fail over to another block/file/datanode/region
server?  We're using 3x replication in HDFS, and we have 8 data nodes which
double as our region servers.
3. Are there any best practices for achieving high availability in an HBase
cluster?  How can I configure the system to gracefully (and automatically)
handle these types of problems?

I'd appreciate any ideas you might have.  Oh, and we've already done a lot
of tuning on this cluster, so we've taken care of all the standard stuff
like increasing max xcievers.

Thanks,
James

HBase fail-over/reliability issues

Reply via email to