Sure. I can try that. Just curious, out of these 2 strategies, which one do you thin is better ? Do you have any experience of trying one or the other ?
Thanks Upen On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[email protected]> wrote: > Thanks for sharing your experiences. > > Have you considered upgrading to HBase 0.92 or 0.94 ? > There have been several bug fixes / enhancements > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > Cheers > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > [email protected]> wrote: > > > Hi All, > > I have question about improving the Map / Reduce job performance while > > ingesting huge amount of data into Hbase using HFileOutputFormat. Here is > > what we are using: > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > 2) *hbase-0.90.40cdh3u2* > > > > I've used 2 different strategies as described below: > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per region > > server. And then subsequently kick off the hadoop job with > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > reduce > > tasks equal to the number of regions * 10. We used the "hash" of each > > record as the Key to Mapoutput. This process resulted in each mapper > finish > > process in accepetable amount of time. But the reduce task takes forever > to > > finish. We found that first the copy/shuffle process too condierable > amoun > > of time and then the sort process took foreever to finish. > > We tried to address this issue by constructing the key as > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a > > gven mapper. The idea was to reduce shuffling / copying from each mapper. > > But even this solution didn't save us anytime and the reduce step took > > significant amount to finish. I played with adjusting the number of > > pre-split regions in both dierctions but to no avail. > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > *QUESTION:* Is there anything I could've done better in this strategy to > > make reduce step finish faster ? Do I need to produce Row Keys > differently > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > Hbase0.90 ? Please help me troubleshoot. > > > > Strategy#2: PreSplit the number of regions with 10 regions per region > > server. And then subsequently kick off the hadoop job with > > HFileOutputFormat.configureIncrementLoad. But set the number of reducer = > > 0. In this strategy (current), I pre-sorted all the mapper input using > > Treeset before writing to output. With No. of reducers = 0, this resulted > > the mapper to write directly to HFiles. This was cool because map/reduce > > (no reduce phase actually) finished very fast and we noticed the HFiles > got > > written very quickly. Then I used * > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into > Hbase. > > I called this method on successful completon of the job in the > > driver class. This is working much better than the Strategy#1 in terms of > > performance. But the bulkLoad() call in the driver sometimes takes longer > > if there is huge amount of data. > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I > call > > this api from Mapper directly, instead of waiting the whole job to finish > > first? I've used used habse "completebulkload" utilty but it has two > > issues with it. First, I do not see any performance improvement with it. > > Second, it needs to be run separately from Hadoop Job driver class and we > > wanted to integrate both the piece. So we used > > *hbase.utils.LoadIncrementHFiles.bulkLoad(). > > * > > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase > 0.90 > > version doesn't have the option to pass ALGORITHM. Is that something we > > need to worry about? > > > > Please help me point in the right direction to address this problem. > > > > Thanks > > Upen > > >
