Re: HBase fail-over/reliability issues

Todd Lipcon Mon, 10 May 2010 22:02:00 -0700

Hi James,

I'd recommend just the following in your log4j properties to tone down
the log volume:


log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN
log4j.logger.org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace=WARN

This will keep the INFO level logs that are very useful for debugging
the issue, but shouldn't generate GB after GB.

I'm both sorry and glad to hear your problem is reproducing every day.
The bad news is you've got a problem, the good news is we should be
able to get to the bottom of it much faster ! :)

-Todd


On Mon, May 10, 2010 at 9:59 AM, James Baldassari <jbaldass...@gmail.com> wrote:
> Hi Todd,
>
> Our log files were getting to be several gigabytes in size at the INFO level
> (particularly the datanode logs), so we changed the log level in all log4j
> configs to be WARN.  Do you think we're potentially missing some useful
> information at INFO and lower?  I could lower the log level if you think it
> would help.
>
> I did go back as far as I could in the logs to try to find out what was
> going on as soon as the problem appeared.  I didn't see anything
> interesting, possibly because the log level was at WARN.  The only two
> errors I see are these:
>
> 2010-05-06 19:15:56,176 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.25.119.115:50010,
> storageID=DS-2139827131-10.25.119.115-50010-1266376030910, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException
>
> This happens once every 5 minutes like clockwork.  I'm not sure if this has
> anything to do with the problem.  Then all of a sudden those errors about
> the block being invalid started showing up:
>
> 2010-05-06 19:17:37,946 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> 10.25.119.115:50010,
> storageID=DS-2139827131-10.25.119.115-50010-1266376030910, infoPort=50075,
> ipcPort=50020):Got exception while serving blk_-26696347696547536_49275 to /
> 10.24.166.67:
> java.io.IOException: Block blk_-26696347696547536_49275 is not valid.
>
> So let me know if I should change the log level and whether I need to enable
> NN clienttrace.  I don't think I'll have to wait long for this problem to
> reappear.  It seems to be happening almost every day now.
>
> Thanks,
> James
>
>
> On Sat, May 8, 2010 at 6:43 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> Hi James,
>>
>> You'll need to go farther back in the logs to find what happened to the
>> block that caused it to get deleted. All of the logs below are too late
>> (the
>> block's already gone, we need to figure out why).
>>
>> Can you look backwards through the past several days of the NN logs? Have
>> you disabled the NN clienttrace log in log4j.properties?
>>
>> -Todd
>>
>> On Fri, May 7, 2010 at 11:17 PM, James Baldassari <jbaldass...@gmail.com
>> >wrote:
>>
>> > OK, these logs are huge, so I'm just going to post the first 1,000 lines
>> > from each for now.  Let me know if it would be helpful to have more.  The
>> > namenode logs didn't contain either of the strings you were interested
>> in.
>> > A few of the datanode logs had '4841840178880951849':
>> >
>> > http://pastebin.com/4MxzkxjW
>> > http://pastebin.com/C7XErfTJ
>> > http://pastebin.com/tn8pk2Xa
>> >
>> > One of the region servers had log entries about the region 73382377:
>> >
>> > http://pastebin.com/8uh6hQxQ
>> >
>> > We rebooted the name node as well as all data nodes and region servers;
>> > then
>> > we ran a major compaction on both of our tables.  The cluster seems
>> stable
>> > at this point, but I have a bad feeling that this problem may hit us
>> again
>> > soon because it has happened twice in the past two days now.
>> >
>> > -James
>> >
>> >
>> > On Sat, May 8, 2010 at 12:30 AM, Todd Lipcon <t...@cloudera.com> wrote:
>> >
>> > > If you can grep for '4841840178880951849' as well
>> > > as /hbase/users/73382377/data/312780071564432169 across all of your
>> > > datanode
>> > > logs plus your NN, and put that online somewhere, that would be great.
>> If
>> > > you can grep with -C 20 to get some context that would help as well.
>> > >
>> > > Grepping for the region in question (73382377) in the RS logs would
>> also
>> > be
>> > > helpful.
>> > >
>> > > Thanks
>> > > -Todd
>> > >
>> > > On Fri, May 7, 2010 at 9:16 PM, James Baldassari <
>> jbaldass...@gmail.com
>> > > >wrote:
>> > >
>> > > > On Sat, May 8, 2010 at 12:02 AM, Stack <st...@duboce.net> wrote:
>> > > >
>> > > > > On Fri, May 7, 2010 at 8:27 PM, James Baldassari <
>> > > jbaldass...@gmail.com>
>> > > > > wrote:
>> > > > > > java.io.IOException: Cannot open filename
>> > > > > > /hbase/users/73382377/data/312780071564432169
>> > > > > >
>> > > > > This is the regionserver log?  Is this deploying the region?  It
>> > fails?
>> > > > >
>> > > >
>> > > > This error is on the client accessing HBase.  This exception was
>> thrown
>> > > on
>> > > > a
>> > > > get call to an HTable instance.  I'm not sure if it was deploying the
>> > > > region.  All I know is that the system had been running with all
>> > regions
>> > > > available (as far as I know), and then all of a sudden these errors
>> > > started
>> > > > showing up on the client.
>> > > >
>> > > >
>> > > > >
>> > > > > > Our cluster throughput goes from around 3k requests/second down
>> to
>> > > > > 500-1000
>> > > > > > and does not recover without manual intervention.  The region
>> > server
>> > > > log
>> > > > > for
>> > > > > > that region says:
>> > > > > >
>> > > > > > WARN org.apache.hadoop.hdfs.DFSClient: Failed to connect to /
>> > > > > > 10.24.166.74:50010 for file
>> > > > > /hbase/users/73382377/data/312780071564432169
>> > > > > > for block -4841840178880951849:java.io.IOException: Got error in
>> > > > response
>> > > > > to
>> > > > > > OP_READ_BLOCK for file
>> > /hbase/users/73382377/data/312780071564432169
>> > > > for
>> > > > > > block -4841840178880951849
>> > > > > >
>> > > > > > INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 40 on
>> > > 60020,
>> > > > > call
>> > > > > > get([...@25f907b4, row=963aba6c5f351f5655abdc9db82a4cbd,
>> > > maxVersions=1,
>> > > > > > timeRange=[0,9223372036854775807), families={(family=data,
>> > > > columns=ALL})
>> > > > > > from 10.24.117.100:2365: error: java.io.IOException: Cannot open
>> > > > > filename
>> > > > > > /hbase/users/73382377/data/312780071564432169
>> > > > > > java.io.IOException: Cannot open filename
>> > > > > > /hbase/users/73382377/data/312780071564432169
>> > > > > >
>> > > > > > The datanode log for 10.24.116.74 says:
>> > > > > >
>> > > > > > WARN org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > > > > DatanodeRegistration(
>> > > > > > 10.24.166.74:50010,
>> > > > > storageID=DS-14401423-10.24.166.74-50010-1270741415211,
>> > > > > > infoPort=50075, ipcPort=50020):
>> > > > > > Got exception while serving blk_-4841840178880951849_50277 to /
>> > > > > 10.25.119.113
>> > > > > > :
>> > > > > > java.io.IOException: Block blk_-4841840178880951849_50277 is not
>> > > valid.
>> > > > > >
>> > > > >
>> > > > > Whats your hadoop?  Is it 0.20.2 or CDH?  Any patches?
>> > > > >
>> > > >
>> > > > Hadoop is vanilla CDH 2.  HBase is 0.20.3 + HBase-2180
>> > > >
>> > > >
>> > > > >
>> > > > >
>> > > > > > Running a major compaction on the users table fixed the problem
>> the
>> > > > first
>> > > > > > time it happened, but this time the major compaction didn't fix
>> it,
>> > > so
>> > > > > we're
>> > > > > > in the process of rebooting the whole cluster.  I'm wondering a
>> few
>> > > > > things:
>> > > > > >
>> > > > > > 1. What could trigger this problem?
>> > > > > > 2. Why can't the system fail over to another
>> > > block/file/datanode/region
>> > > > > > server?  We're using 3x replication in HDFS, and we have 8 data
>> > nodes
>> > > > > which
>> > > > > > double as our region servers.
>> > > > > > 3. Are there any best practices for achieving high availability
>> in
>> > an
>> > > > > HBase
>> > > > > > cluster?  How can I configure the system to gracefully (and
>> > > > > automatically)
>> > > > > > handle these types of problems?
>> > > > > >
>> > > > >
>> > > > > Let us know what your hadoop is and then we figure more on the
>> issues
>> > > > > above.
>> > > > >
>> > > >
>> > > > If you need complete stack traces or any additional information,
>> please
>> > > let
>> > > > me know.
>> > > >
>> > > >
>> > > > > Thanks James,
>> > > > > St.Ack
>> > > > > P.S. Its eight node cluster on what kinda hw? (You've probably said
>> > in
>> > > > > the past and I can dig through mail -- just say -- and then what
>> kind
>> > > > > of loading are you applying?  Ditto for if you've said this
>> already)
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Todd Lipcon
>> > > Software Engineer, Cloudera
>> > >
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: HBase fail-over/reliability issues

Reply via email to