You should write to HBase from the mapper and not use a reducer. By the time data gets to the reducer it is sorted, and sorted inserts into HBase cause one or two regions to be hot spots.
By inserting random data, regions split faster and then the load will get distributed over more region servers. --- Jim Kellerman, Powerset (Live Search, Microsoft Corporation) > -----Original Message----- > From: Bradford Stephens [mailto:[email protected]] > Sent: Thursday, June 11, 2009 6:10 PM > To: [email protected] > Subject: HBase Write to Regionservers behavior > > Hey there, > > So, I wiped my HDFS and reinstalled everything, and am running > smaller > loads... so far, so good. I've got 7 regionservers. > > My job basically takes a lot of documents and metadata with unique > binary keys (like "055E51294F9D9CA331D968D04B72A11C"), combines them > all in a reducer, then writes it to HBase. > > What I'm noticing is that it's writing to mostly one or two regions > on > one box at a time, even though I have 7 reducers running. Monitoring > everything with dstat -v, I notice that only 2 of my servers are > doing > much. These boxes have very low CPU idling, and high disk output (a > few GB a minute). > > Everything else has a a little bit of disk activity (maybe 500 > MB/minute), but very idle CPUs. > > Is this normal behavior? I guess as more data is loaded, more > regionservers are split, so over time, more boxen will be loading > data? > > Cheers, > Bradford
