I think second approach is better. Cheers
On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < [email protected]> wrote: > Sure. I can try that. Just curious, out of these 2 strategies, which one do > you thin is better ? Do you have any experience of trying one or the other > ? > > Thanks > Upen > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[email protected]> wrote: > > > Thanks for sharing your experiences. > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? > > There have been several bug fixes / enhancements > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. > > > > Cheers > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < > > [email protected]> wrote: > > > > > Hi All, > > > I have question about improving the Map / Reduce job performance while > > > ingesting huge amount of data into Hbase using HFileOutputFormat. Here > is > > > what we are using: > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* > > > 2) *hbase-0.90.40cdh3u2* > > > > > > I've used 2 different strategies as described below: > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per region > > > server. And then subsequently kick off the hadoop job with > > > HFileOutputFormat.configureIncrementLoad. This mchanism does create > > reduce > > > tasks equal to the number of regions * 10. We used the "hash" of each > > > record as the Key to Mapoutput. This process resulted in each mapper > > finish > > > process in accepetable amount of time. But the reduce task takes > forever > > to > > > finish. We found that first the copy/shuffle process too condierable > > amoun > > > of time and then the sort process took foreever to finish. > > > We tried to address this issue by constructing the key as > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records > of a > > > gven mapper. The idea was to reduce shuffling / copying from each > mapper. > > > But even this solution didn't save us anytime and the reduce step took > > > significant amount to finish. I played with adjusting the number of > > > pre-split regions in both dierctions but to no avail. > > > This led us to move to Strategy#2 we got rid of the reduce step. > > > > > > *QUESTION:* Is there anything I could've done better in this strategy > to > > > make reduce step finish faster ? Do I need to produce Row Keys > > differently > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or > > > Hbase0.90 ? Please help me troubleshoot. > > > > > > Strategy#2: PreSplit the number of regions with 10 regions per region > > > server. And then subsequently kick off the hadoop job with > > > HFileOutputFormat.configureIncrementLoad. But set the number of > reducer = > > > 0. In this strategy (current), I pre-sorted all the mapper input using > > > Treeset before writing to output. With No. of reducers = 0, this > resulted > > > the mapper to write directly to HFiles. This was cool because > map/reduce > > > (no reduce phase actually) finished very fast and we noticed the HFiles > > got > > > written very quickly. Then I used * > > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into > > Hbase. > > > I called this method on successful completon of the job in the > > > driver class. This is working much better than the Strategy#1 in terms > of > > > performance. But the bulkLoad() call in the driver sometimes takes > longer > > > if there is huge amount of data. > > > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I > > call > > > this api from Mapper directly, instead of waiting the whole job to > finish > > > first? I've used used habse "completebulkload" utilty but it has two > > > issues with it. First, I do not see any performance improvement with > it. > > > Second, it needs to be run separately from Hadoop Job driver class and > we > > > wanted to integrate both the piece. So we used > > > *hbase.utils.LoadIncrementHFiles.bulkLoad(). > > > * > > > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase > > 0.90 > > > version doesn't have the option to pass ALGORITHM. Is that something we > > > need to worry about? > > > > > > Please help me point in the right direction to address this problem. > > > > > > Thanks > > > Upen > > > > > >
