Re: HBase fail-over/reliability issues

James Baldassari Fri, 07 May 2010 23:17:53 -0700

OK, these logs are huge, so I'm just going to post the first 1,000 lines
from each for now.  Let me know if it would be helpful to have more.  The
namenode logs didn't contain either of the strings you were interested in.
A few of the datanode logs had '4841840178880951849':


http://pastebin.com/4MxzkxjW
http://pastebin.com/C7XErfTJ
http://pastebin.com/tn8pk2Xa

One of the region servers had log entries about the region 73382377:

http://pastebin.com/8uh6hQxQ

We rebooted the name node as well as all data nodes and region servers; then
we ran a major compaction on both of our tables.  The cluster seems stable
at this point, but I have a bad feeling that this problem may hit us again
soon because it has happened twice in the past two days now.

-James


On Sat, May 8, 2010 at 12:30 AM, Todd Lipcon <t...@cloudera.com> wrote:

> If you can grep for '4841840178880951849' as well
> as /hbase/users/73382377/data/312780071564432169 across all of your
> datanode
> logs plus your NN, and put that online somewhere, that would be great. If
> you can grep with -C 20 to get some context that would help as well.
>
> Grepping for the region in question (73382377) in the RS logs would also be
> helpful.
>
> Thanks
> -Todd
>
> On Fri, May 7, 2010 at 9:16 PM, James Baldassari <jbaldass...@gmail.com
> >wrote:
>
> > On Sat, May 8, 2010 at 12:02 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Fri, May 7, 2010 at 8:27 PM, James Baldassari <
> jbaldass...@gmail.com>
> > > wrote:
> > > > java.io.IOException: Cannot open filename
> > > > /hbase/users/73382377/data/312780071564432169
> > > >
> > > This is the regionserver log?  Is this deploying the region?  It fails?
> > >
> >
> > This error is on the client accessing HBase.  This exception was thrown
> on
> > a
> > get call to an HTable instance.  I'm not sure if it was deploying the
> > region.  All I know is that the system had been running with all regions
> > available (as far as I know), and then all of a sudden these errors
> started
> > showing up on the client.
> >
> >
> > >
> > > > Our cluster throughput goes from around 3k requests/second down to
> > > 500-1000
> > > > and does not recover without manual intervention.  The region server
> > log
> > > for
> > > > that region says:
> > > >
> > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
> > > > 10.24.166.74:50010 for file
> > > /hbase/users/73382377/data/312780071564432169
> > > > for block -4841840178880951849:java.io.IOException: Got error in
> > response
> > > to
> > > > OP_READ_BLOCK for file /hbase/users/73382377/data/312780071564432169
> > for
> > > > block -4841840178880951849
> > > >
> > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on
> 60020,
> > > call
> > > > get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd,
> maxVersions=1,
> > > > timeRange=[0,9223372036854775807), families={(family=data,
> > columns=ALL})
> > > > from 10.24.117.100:2365: error: java.io.IOException: Cannot open
> > > filename
> > > > /hbase/users/73382377/data/312780071564432169
> > > > java.io.IOException: Cannot open filename
> > > > /hbase/users/73382377/data/312780071564432169
> > > >
> > > > The datanode log for 10.24.116.74 says:
> > > >
> > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > DatanodeRegistration(
> > > > 10.24.166.74:50010,
> > > storageID=DS-14401423-10.24.166.74-50010-1270741415211,
> > > > infoPort=50075, ipcPort=50020):
> > > > Got exception while serving blk_-4841840178880951849_50277 to /
> > > 10.25.119.113
> > > > :
> > > > java.io.IOException: Block blk_-4841840178880951849_50277 is not
> valid.
> > > >
> > >
> > > Whats your hadoop?  Is it 0.20.2 or CDH?  Any patches?
> > >
> >
> > Hadoop is vanilla CDH 2.  HBase is 0.20.3 + HBase-2180
> >
> >
> > >
> > >
> > > > Running a major compaction on the users table fixed the problem the
> > first
> > > > time it happened, but this time the major compaction didn't fix it,
> so
> > > we're
> > > > in the process of rebooting the whole cluster.  I'm wondering a few
> > > things:
> > > >
> > > > 1. What could trigger this problem?
> > > > 2. Why can't the system fail over to another
> block/file/datanode/region
> > > > server?  We're using 3x replication in HDFS, and we have 8 data nodes
> > > which
> > > > double as our region servers.
> > > > 3. Are there any best practices for achieving high availability in an
> > > HBase
> > > > cluster?  How can I configure the system to gracefully (and
> > > automatically)
> > > > handle these types of problems?
> > > >
> > >
> > > Let us know what your hadoop is and then we figure more on the issues
> > > above.
> > >
> >
> > If you need complete stack traces or any additional information, please
> let
> > me know.
> >
> >
> > > Thanks James,
> > > St.Ack
> > > P.S. Its eight node cluster on what kinda hw? (You've probably said in
> > > the past and I can dig through mail -- just say -- and then what kind
> > > of loading are you applying?  Ditto for if you've said this already)
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: HBase fail-over/reliability issues

Reply via email to