Re: HBase fail-over/reliability issues

James Baldassari Fri, 07 May 2010 21:16:41 -0700

On Sat, May 8, 2010 at 12:02 AM, Stack <st...@duboce.net> wrote:

> On Fri, May 7, 2010 at 8:27 PM, James Baldassari <jbaldass...@gmail.com>
> wrote:
> > java.io.IOException: Cannot open filename
> > /hbase/users/73382377/data/312780071564432169
> >
> This is the regionserver log?  Is this deploying the region?  It fails?
>


This error is on the client accessing HBase.  This exception was thrown on a
get call to an HTable instance.  I'm not sure if it was deploying the
region.  All I know is that the system had been running with all regions
available (as far as I know), and then all of a sudden these errors started
showing up on the client.


>
> > Our cluster throughput goes from around 3k requests/second down to
> 500-1000
> > and does not recover without manual intervention.  The region server log
> for
> > that region says:
> >
> > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> > 10.24.166.74:50010 for file
> /hbase/users/73382377/data/312780071564432169
> > for block -4841840178880951849:java.io.IOException: Got error in response
> to
> > OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169 for
> > block -4841840178880951849
> >
> > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on 60020,
> call
> > get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd, maxVersions=1,
> > timeRange=[0,9223372036854775807), families={(family=data, columns=ALL})
> > from 10.24.117.100:2365: error: java.io.IOException: Cannot open
> filename
> > /hbase/users/73382377/data/312780071564432169
> > java.io.IOException: Cannot open filename
> > /hbase/users/73382377/data/312780071564432169
> >
> > The datanode log for 10.24.116.74 says:
> >
> > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(
> > 10.24.166.74:50010,
> storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> > infoPort=50075, ipcPort=50020):
> > Got exception while serving blk_-4841840178880951849_50277 to /
> 10.25.119.113
> > :
> > java.io.IOException: Block blk_-4841840178880951849_50277 is not valid.
> >
>
> Whats your hadoop?  Is it 0.20.2 or CDH?  Any patches?
>

Hadoop is vanilla CDH 2.  HBase is 0.20.3 + HBase-2180


>
>
> > Running a major compaction on the users table fixed the problem the first
> > time it happened, but this time the major compaction didn't fix it, so
> we're
> > in the process of rebooting the whole cluster.  I'm wondering a few
> things:
> >
> > 1. What could trigger this problem?
> > 2. Why can't the system fail over to another block/file/datanode/region
> > server?  We're using 3x replication in HDFS, and we have 8 data nodes
> which
> > double as our region servers.
> > 3. Are there any best practices for achieving high availability in an
> HBase
> > cluster?  How can I configure the system to gracefully (and
> automatically)
> > handle these types of problems?
> >
>
> Let us know what your hadoop is and then we figure more on the issues
> above.
>

If you need complete stack traces or any additional information, please let
me know.


> Thanks James,
> St.Ack
> P.S. Its eight node cluster on what kinda hw? (You've probably said in
> the past and I can dig through mail -- just say -- and then what kind
> of loading are you applying?  Ditto for if you've said this already)
>

Re: HBase fail-over/reliability issues

Reply via email to