Re: HBase Map/Reduce Data Ingest Performance

Nick Dimiduk Tue, 18 Dec 2012 11:21:24 -0800

Please forgive my poor choice of words; I meant no disrespect.

-n


On Tue, Dec 18, 2012 at 11:06 AM, Upender K. Nimbekar <
[email protected]> wrote:

> I would like to request you maintain the respect of people asking questions
> on this forum. Let's not start the thread in the wrong direction.
> I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad.
> Call succeeded but bulkLoad call still threw exception. However, it does
> work if I do chmod and bulkLoad() from Hadoop Driver after the job is
> finished.
> BTW, Hbase user needs a WRITE permission and NOT read bease it created some
> _tmp directories.
>
> Upen
>
> On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[email protected]> wrote:
>
> > Dumb question: what's the filesystem permissions of your generated
> HFiles?
> > Can the HBase process read them? Maybe a simple chmod or chown will get
> you
> > the rest of the way there.
> >
> > On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar <
> >  [email protected]> wrote:
> >
> > > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But
> > running
> > > into permission issues while hbase user tries to import Hfile into
> Hbase.
> > > Not sure, if there is way to change the target HDFS file permission via
> > > HFileOutputFormat.
> > >
> > >
> > > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[email protected]> wrote:
> > >
> > > > I think second approach is better.
> > > >
> > > > Cheers
> > > >
> > > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar <
> > > > [email protected]> wrote:
> > > >
> > > > > Sure. I can try that. Just curious, out of these 2 strategies,
> which
> > > one
> > > > do
> > > > > you thin is better ? Do you have any experience of trying one or
> the
> > > > other
> > > > > ?
> > > > >
> > > > > Thanks
> > > > > Upen
> > > > >
> > > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[email protected]>
> > wrote:
> > > > >
> > > > > > Thanks for sharing your experiences.
> > > > > >
> > > > > > Have you considered upgrading to HBase 0.92 or 0.94 ?
> > > > > > There have been several bug fixes / enhancements
> > > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar <
> > > > > > [email protected]> wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > > I have question about improving the Map / Reduce job
> performance
> > > > while
> > > > > > > ingesting huge amount of data into Hbase using
> HFileOutputFormat.
> > > > Here
> > > > > is
> > > > > > > what we are using:
> > > > > > >
> > > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u*
> > > > > > > 2) *hbase-0.90.40cdh3u2*
> > > > > > >
> > > > > > > I've used 2 different strategies as described below:
> > > > > > >
> > > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions
> per
> > > > region
> > > > > > > server. And then subsequently kick off the hadoop job with
> > > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does
> > create
> > > > > > reduce
> > > > > > > tasks equal to the number of regions * 10. We used the "hash"
> of
> > > each
> > > > > > > record as the Key to Mapoutput. This process resulted in each
> > > mapper
> > > > > > finish
> > > > > > > process in accepetable amount of time. But the reduce task
> takes
> > > > > forever
> > > > > > to
> > > > > > > finish. We found that first the copy/shuffle process too
> > > condierable
> > > > > > amoun
> > > > > > > of time and then the sort process took foreever to finish.
> > > > > > > We tried to address this issue by constructing the key as
> > > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the
> > > records
> > > > > of a
> > > > > > > gven mapper. The idea was to reduce shuffling / copying from
> each
> > > > > mapper.
> > > > > > > But even this solution didn't save us anytime and the reduce
> step
> > > > took
> > > > > > > significant amount to finish. I played with adjusting the
> number
> > of
> > > > > > > pre-split regions in both dierctions but to no avail.
> > > > > > > This led us to move to Strategy#2 we got rid of the reduce
> step.
> > > > > > >
> > > > > > > *QUESTION:* Is there anything I could've done better in this
> > > strategy
> > > > > to
> > > > > > > make reduce step finish faster ? Do I need to produce Row Keys
> > > > > > differently
> > > > > > > than "hash1"_"hash2" of the text ? Is it a known issue with
> CDH3
> > or
> > > > > > > Hbase0.90 ? Please help me troubleshoot.
> > > > > > >
> > > > > > > Strategy#2: PreSplit the number of regions with 10 regions per
> > > region
> > > > > > > server. And then subsequently kick off the hadoop job with
> > > > > > > HFileOutputFormat.configureIncrementLoad. But set the number of
> > > > > reducer =
> > > > > > > 0. In this strategy (current), I pre-sorted all the mapper
> input
> > > > using
> > > > > > > Treeset before writing to output. With No. of reducers = 0,
> this
> > > > > resulted
> > > > > > > the mapper to write directly to HFiles. This was cool because
> > > > > map/reduce
> > > > > > > (no reduce phase actually) finished very fast and we noticed
> the
> > > > HFiles
> > > > > > got
> > > > > > > written very quickly. Then I used *
> > > > > > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles
> > into
> > > > > > Hbase.
> > > > > > > I called this method on successful completon of the job in the
> > > > > > > driver class. This is working much better than the Strategy#1
> in
> > > > terms
> > > > > of
> > > > > > > performance. But the bulkLoad() call in the driver sometimes
> > takes
> > > > > longer
> > > > > > > if there is huge amount of data.
> > > > > > >
> > > > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ?
> > > Can I
> > > > > > call
> > > > > > > this api from Mapper directly, instead of waiting the whole job
> > to
> > > > > finish
> > > > > > > first?  I've used used habse "completebulkload" utilty but it
> has
> > > two
> > > > > > > issues with it. First, I do not see any performance improvement
> > > with
> > > > > it.
> > > > > > > Second, it needs to be run separately from Hadoop Job driver
> > class
> > > > and
> > > > > we
> > > > > > > wanted to integrate both the piece. So we used
> > > > > > > *hbase.utils.LoadIncrementHFiles.bulkLoad().
> > > > > >  > *
> > > > > > > Also, we used Hbase RegionSplitter to pre-split the regions.
> But
> > > > hbase
> > > > > > 0.90
> > > > > > > version doesn't have the option to pass ALGORITHM. Is that
> > > something
> > > > we
> > > > > > > need to worry about?
> > > > > > >
> > > > > > > Please help me point in the right direction to address this
> > > problem.
> > > > > > >
> > > > > > > Thanks
> > > > > > > Upen
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HBase Map/Reduce Data Ingest Performance

Reply via email to