Re: HBase Map/Reduce Data Ingest Performance

Upender K. Nimbekar Mon, 17 Dec 2012 11:11:42 -0800

Sure. I can try that. Just curious, out of these 2 strategies, which one do
you thin is better ? Do you have any experience of trying one or the other ?


Thanks
Upen

On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[email protected]> wrote:

> Thanks for sharing your experiences.
>
> Have you considered upgrading to HBase 0.92 or 0.94 ?
> There have been several bug fixes / enhancements
> to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
>
> Cheers
>
> On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
> [email protected]> wrote:
>
> > Hi All,
> > I have question about improving the Map / Reduce job performance while
> > ingesting huge amount of data into Hbase using HFileOutputFormat. Here is
> > what we are using:
> >
> > 1) *Cloudera hadoop-0.20.2-cdh3u*
> > 2) *hbase-0.90.40cdh3u2*
> >
> > I've used 2 different strategies as described below:
> >
> > *Strategy#1:* PreSplit the number of regions with 10 regions per region
> > server. And then subsequently kick off the hadoop job with
> > HFileOutputFormat.configureIncrementLoad. This mchanism does create
> reduce
> > tasks equal to the number of regions * 10. We used the "hash" of each
> > record as the Key to Mapoutput. This process resulted in each mapper
> finish
> > process in accepetable amount of time. But the reduce task takes forever
> to
> > finish. We found that first the copy/shuffle process too condierable
> amoun
> > of time and then the sort process took foreever to finish.
> > We tried to address this issue by constructing the key as
> > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a
> > gven mapper. The idea was to reduce shuffling / copying from each mapper.
> > But even this solution didn't save us anytime and the reduce step took
> > significant amount to finish. I played with adjusting the number of
> > pre-split regions in both dierctions but to no avail.
> > This led us to move to Strategy#2 we got rid of the reduce step.
> >
> > *QUESTION:* Is there anything I could've done better in this strategy to
> > make reduce step finish faster ? Do I need to produce Row Keys
> differently
> > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or
> > Hbase0.90 ? Please help me troubleshoot.
> >
> > Strategy#2: PreSplit the number of regions with 10 regions per region
> > server. And then subsequently kick off the hadoop job with
> > HFileOutputFormat.configureIncrementLoad. But set the number of reducer =
> > 0. In this strategy (current), I pre-sorted all the mapper input using
> > Treeset before writing to output. With No. of reducers = 0, this resulted
> > the mapper to write directly to HFiles. This was cool because map/reduce
> > (no reduce phase actually) finished very fast and we noticed the HFiles
> got
> > written very quickly. Then I used *
> > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into
> Hbase.
> > I called this method on successful completon of the job in the
> > driver class. This is working much better than the Strategy#1 in terms of
> > performance. But the bulkLoad() call in the driver sometimes takes longer
> > if there is huge amount of data.
> >
> > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I
> call
> > this api from Mapper directly, instead of waiting the whole job to finish
> > first?  I've used used habse "completebulkload" utilty but it has two
> > issues with it. First, I do not see any performance improvement with it.
> > Second, it needs to be run separately from Hadoop Job driver class and we
> > wanted to integrate both the piece. So we used
> > *hbase.utils.LoadIncrementHFiles.bulkLoad().
>  > *
> > Also, we used Hbase RegionSplitter to pre-split the regions. But hbase
> 0.90
> > version doesn't have the option to pass ALGORITHM. Is that something we
> > need to worry about?
> >
> > Please help me point in the right direction to address this problem.
> >
> > Thanks
> > Upen
> >
>

Re: HBase Map/Reduce Data Ingest Performance

Reply via email to