Thanks for sharing your experiences. Have you considered upgrading to HBase 0.92 or 0.94 ? There have been several bug fixes / enhancements to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
Cheers On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < [email protected]> wrote: > Hi All, > I have question about improving the Map / Reduce job performance while > ingesting huge amount of data into Hbase using HFileOutputFormat. Here is > what we are using: > > 1) *Cloudera hadoop-0.20.2-cdh3u* > 2) *hbase-0.90.40cdh3u2* > > I've used 2 different strategies as described below: > > *Strategy#1:* PreSplit the number of regions with 10 regions per region > server. And then subsequently kick off the hadoop job with > HFileOutputFormat.configureIncrementLoad. This mchanism does create reduce > tasks equal to the number of regions * 10. We used the "hash" of each > record as the Key to Mapoutput. This process resulted in each mapper finish > process in accepetable amount of time. But the reduce task takes forever to > finish. We found that first the copy/shuffle process too condierable amoun > of time and then the sort process took foreever to finish. > We tried to address this issue by constructing the key as > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a > gven mapper. The idea was to reduce shuffling / copying from each mapper. > But even this solution didn't save us anytime and the reduce step took > significant amount to finish. I played with adjusting the number of > pre-split regions in both dierctions but to no avail. > This led us to move to Strategy#2 we got rid of the reduce step. > > *QUESTION:* Is there anything I could've done better in this strategy to > make reduce step finish faster ? Do I need to produce Row Keys differently > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > Hbase0.90 ? Please help me troubleshoot. > > Strategy#2: PreSplit the number of regions with 10 regions per region > server. And then subsequently kick off the hadoop job with > HFileOutputFormat.configureIncrementLoad. But set the number of reducer = > 0. In this strategy (current), I pre-sorted all the mapper input using > Treeset before writing to output. With No. of reducers = 0, this resulted > the mapper to write directly to HFiles. This was cool because map/reduce > (no reduce phase actually) finished very fast and we noticed the HFiles got > written very quickly. Then I used * > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into Hbase. > I called this method on successful completon of the job in the > driver class. This is working much better than the Strategy#1 in terms of > performance. But the bulkLoad() call in the driver sometimes takes longer > if there is huge amount of data. > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I call > this api from Mapper directly, instead of waiting the whole job to finish > first? I've used used habse "completebulkload" utilty but it has two > issues with it. First, I do not see any performance improvement with it. > Second, it needs to be run separately from Hadoop Job driver class and we > wanted to integrate both the piece. So we used > *hbase.utils.LoadIncrementHFiles.bulkLoad(). > * > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase 0.90 > version doesn't have the option to pass ALGORITHM. Is that something we > need to worry about? > > Please help me point in the right direction to address this problem. > > Thanks > Upen >
