once the table has split more you might look in to using
org.apache.hadoop.hbase.mapred.HRegionPartitioner.java

It will split up the data and only run one reduce per region so all that's regions rows will be sent to just one reducer but does not help much as when the table is small and you have a lot of reduce task.

It has benefits while one region is done that region will likely be flushed as memcache gets full and has to starts flushing So it can start compactions and splits with out having to worry about more data coming. Right now all the reduce will sort the data by key so all the reduce task will start writing to the same regions as they go because the data is sorted so they start from the first of the table to the last.

Billy


"Bradford Stephens" <[email protected]> wrote in message news:[email protected]...
Hey there,

So, I wiped my HDFS and reinstalled everything, and am running smaller
loads... so far, so good. I've got 7 regionservers.

My job basically takes a lot of documents and metadata with unique
binary keys (like "055E51294F9D9CA331D968D04B72A11C"), combines them
all in a reducer, then writes it to HBase.

What I'm noticing is that it's writing to mostly one or two regions on
one box at a time, even though I have 7 reducers running. Monitoring
everything with dstat -v, I notice that only 2 of my servers are doing
much. These boxes have very low CPU idling, and high disk output (a
few GB a minute).

Everything else has a a little bit of disk activity (maybe 500
MB/minute), but very idle CPUs.

Is this normal behavior? I guess as more data is loaded, more
regionservers are split, so over time, more boxen will be loading
data?

Cheers,
Bradford



Reply via email to