Re: HBase Exceptions on version 0.20.1

Jonathan Gray Mon, 09 Nov 2009 11:24:34 -0800

It's fairly easy to run HDFS into the ground if you eat up all theresources.

It's also fairly easy to run a Linux machine into the ground if you eatup all the resources; or just about anything by starving it of CPU.

I don't disagree with a read-only mode if the server is full, but ingeneral believe admins of any production cluster need to be constantlyaware of capacity and usage and plan properly. Not just so things don'tfill up, but also from a performance POV.

I completely disagree with the statement "any hbase cluster will reachthis tipping point at some point in its lifetime as more and more datais added". Anyone who continuously adds data without paying anyattention to capacity, then this will happen, but this is the case withanything that has finite resources.

We should certainly do everything we can to prevent data loss andcorruption in all cases, but also must also set realistic expectations.One should not assume they can fill their HBase cluster to the brinkand for everything to always be okay (even if this is the case).


JG

stack wrote:

Agreed.  Please make an issue.

Meantime, it should be possible to have a cron run a script that checks
cluster resources from time-to-time -- e.g. how full hdfs is, how much each
regionserver is carrying -- and when it determines the needle is in the red,
flip the cluster to be read-only.

St.Ack

On Mon, Nov 9, 2009 at 9:25 AM, elsif <[email protected]> wrote:

The larger issue here is that any hbase cluster will reach this tipping
point at some point in its lifetime as more and more data is added.  We
need to have a graceful method to put the cluster into safe mode until
more resources can be added or the load on the cluster has been
reduced.  We cannot allow hbase to run itself into the ground causing
data loss or corruption under any circumstances.
*
*
Andrew Purtell wrote:

You should consider provisioning more nodes to get beyond this ceiling

you encountered.

DFS write latency spikes from 3 seconds to 6 seconds, to 15! Flushing

cannot happen fast enough to avoid an OOME. Possibly there was even
insufficient CPU to GC. The log entries you highlighted indicate the load
you are exerting on your current cluster needs to be spread out over more
resources than currently allocated.

This:

2009-11-06 09:15:37,144 WARN org.apache.hadoop.hbase.util.Sleeper: We

slept 286007ms, ten times longer than scheduled: 10000

indicates a thread that wanted to sleep for 10 seconds was starved for

CPU for 286 seconds. Obviously Zookeeper timeouts and resulting HBase
process shutdowns, missed DFS heartbeats possibly resulting in spurious
declaration of dead datanodes, and other serious problems will result from
this.

Did your systems start to swap?

When region servers shut down, the master notices this and splits their

HLogs into per region reconstruction logs. These are the "oldlogfile.log"
files. The master log will shed light on why this particular reconstruction
log was botched. Would have happened at the master. The region server
probably did do a clean shutdown. I suspect DFS was in extremis due to
overloading so the split failed. The checksum error indicates incomplete
write at the OS level. Did a datanode crash?

HBASE-1956 is about making the DFS latency metric exportable via the
Hadoop metrics layer, perhaps via Ganglia. Write latency above 1 or 2
seconds is a warning. Anything above 5 seconds is an alarm.  It's a
good indication that an overloading condition is in progress.

The Hadoop stack, being pre 1.0, has some rough edges. Response to

overloading is one of them. For one thing, HBase could be better about
applying backpressure to writing clients when the system is under stress. We
will get there. HBASE-1956 is a start.

    - Andy

Re: HBase Exceptions on version 0.20.1

Reply via email to