Re: HBase Map/Reduce Data Ingest Performance

Ted Yu Mon, 17 Dec 2012 09:46:02 -0800

Thanks for sharing your experiences.

Have you considered upgrading to HBase 0.92 or 0.94 ?
There have been several bug fixes / enhancements
to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.


Cheers

On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
[email protected]> wrote:

> Hi All,
> I have question about improving the Map / Reduce job performance while
> ingesting huge amount of data into Hbase using HFileOutputFormat. Here is
> what we are using:
>
> 1) *Cloudera hadoop-0.20.2-cdh3u*
> 2) *hbase-0.90.40cdh3u2*
>
> I've used 2 different strategies as described below:
>
> *Strategy#1:* PreSplit the number of regions with 10 regions per region
> server. And then subsequently kick off the hadoop job with
> HFileOutputFormat.configureIncrementLoad. This mchanism does create reduce
> tasks equal to the number of regions * 10. We used the "hash" of each
> record as the Key to Mapoutput. This process resulted in each mapper finish
> process in accepetable amount of time. But the reduce task takes forever to
> finish. We found that first the copy/shuffle process too condierable amoun
> of time and then the sort process took foreever to finish.
> We tried to address this issue by constructing the key as
> "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the records of a
> gven mapper. The idea was to reduce shuffling / copying from each mapper.
> But even this solution didn't save us anytime and the reduce step took
> significant amount to finish. I played with adjusting the number of
> pre-split regions in both dierctions but to no avail.
> This led us to move to Strategy#2 we got rid of the reduce step.
>
> *QUESTION:* Is there anything I could've done better in this strategy to
> make reduce step finish faster ? Do I need to produce Row Keys differently
> than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 or
> Hbase0.90 ? Please help me troubleshoot.
>
> Strategy#2: PreSplit the number of regions with 10 regions per region
> server. And then subsequently kick off the hadoop job with
> HFileOutputFormat.configureIncrementLoad. But set the number of reducer =
> 0. In this strategy (current), I pre-sorted all the mapper input using
> Treeset before writing to output. With No. of reducers = 0, this resulted
> the mapper to write directly to HFiles. This was cool because map/reduce
> (no reduce phase actually) finished very fast and we noticed the HFiles got
> written very quickly. Then I used *
> hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles into Hbase.
> I called this method on successful completon of the job in the
> driver class. This is working much better than the Strategy#1 in terms of
> performance. But the bulkLoad() call in the driver sometimes takes longer
> if there is huge amount of data.
>
> *QUESTION:* Is there anyway to make the bulkLoad() run faster ? Can I call
> this api from Mapper directly, instead of waiting the whole job to finish
> first?  I've used used habse "completebulkload" utilty but it has two
> issues with it. First, I do not see any performance improvement with it.
> Second, it needs to be run separately from Hadoop Job driver class and we
> wanted to integrate both the piece. So we used
> *hbase.utils.LoadIncrementHFiles.bulkLoad().
> *
> Also, we used Hbase RegionSplitter to pre-split the regions. But hbase 0.90
> version doesn't have the option to pass ALGORITHM. Is that something we
> need to worry about?
>
> Please help me point in the right direction to address this problem.
>
> Thanks
> Upen
>

Re: HBase Map/Reduce Data Ingest Performance

Reply via email to