I don't think Nick was being disrespectful, usually when people prefix a question with "Dumb question" it means that they think their own question is dumb but they feel like asking it anyway in case something basic wasn't covered.
J-D On Tue, Dec 18, 2012 at 11:06 AM, Upender K. Nimbekar <[email protected]> wrote: > I would like to request you maintain the respect of people asking questions > on this forum. Let's not start the thread in the wrong direction. > I wish it was a dumb question. I did chmod 777 prior to calling bulkLoad. > Call succeeded but bulkLoad call still threw exception. However, it does > work if I do chmod and bulkLoad() from Hadoop Driver after the job is > finished. > BTW, Hbase user needs a WRITE permission and NOT read bease it created some > _tmp directories. > > Upen > > On Tue, Dec 18, 2012 at 12:31 PM, Nick Dimiduk <[email protected]> wrote: > >> Dumb question: what's the filesystem permissions of your generated HFiles? >> Can the HBase process read them? Maybe a simple chmod or chown will get you >> the rest of the way there. >> >> On Mon, Dec 17, 2012 at 6:30 PM, Upender K. Nimbekar < >> [email protected]> wrote: >> >> > Thanks ! I'm calling doBulkLoad() from mapper cleanup() method. But >> running >> > into permission issues while hbase user tries to import Hfile into Hbase. >> > Not sure, if there is way to change the target HDFS file permission via >> > HFileOutputFormat. >> > >> > >> > On Mon, Dec 17, 2012 at 7:52 PM, Ted Yu <[email protected]> wrote: >> > >> > > I think second approach is better. >> > > >> > > Cheers >> > > >> > > On Mon, Dec 17, 2012 at 11:11 AM, Upender K. Nimbekar < >> > > [email protected]> wrote: >> > > >> > > > Sure. I can try that. Just curious, out of these 2 strategies, which >> > one >> > > do >> > > > you thin is better ? Do you have any experience of trying one or the >> > > other >> > > > ? >> > > > >> > > > Thanks >> > > > Upen >> > > > >> > > > On Mon, Dec 17, 2012 at 12:45 PM, Ted Yu <[email protected]> >> wrote: >> > > > >> > > > > Thanks for sharing your experiences. >> > > > > >> > > > > Have you considered upgrading to HBase 0.92 or 0.94 ? >> > > > > There have been several bug fixes / enhancements >> > > > > to LoadIncrementHFiles.bulkLoad() API in newer HBase releases. >> > > > > >> > > > > Cheers >> > > > > >> > > > > On Mon, Dec 17, 2012 at 7:34 AM, Upender K. Nimbekar < >> > > > > [email protected]> wrote: >> > > > > >> > > > > > Hi All, >> > > > > > I have question about improving the Map / Reduce job performance >> > > while >> > > > > > ingesting huge amount of data into Hbase using HFileOutputFormat. >> > > Here >> > > > is >> > > > > > what we are using: >> > > > > > >> > > > > > 1) *Cloudera hadoop-0.20.2-cdh3u* >> > > > > > 2) *hbase-0.90.40cdh3u2* >> > > > > > >> > > > > > I've used 2 different strategies as described below: >> > > > > > >> > > > > > *Strategy#1:* PreSplit the number of regions with 10 regions per >> > > region >> > > > > > server. And then subsequently kick off the hadoop job with >> > > > > > HFileOutputFormat.configureIncrementLoad. This mchanism does >> create >> > > > > reduce >> > > > > > tasks equal to the number of regions * 10. We used the "hash" of >> > each >> > > > > > record as the Key to Mapoutput. This process resulted in each >> > mapper >> > > > > finish >> > > > > > process in accepetable amount of time. But the reduce task takes >> > > > forever >> > > > > to >> > > > > > finish. We found that first the copy/shuffle process too >> > condierable >> > > > > amoun >> > > > > > of time and then the sort process took foreever to finish. >> > > > > > We tried to address this issue by constructing the key as >> > > > > > "fixedhash1"_"hash2" where "fixedhash1" is fixed for all the >> > records >> > > > of a >> > > > > > gven mapper. The idea was to reduce shuffling / copying from each >> > > > mapper. >> > > > > > But even this solution didn't save us anytime and the reduce step >> > > took >> > > > > > significant amount to finish. I played with adjusting the number >> of >> > > > > > pre-split regions in both dierctions but to no avail. >> > > > > > This led us to move to Strategy#2 we got rid of the reduce step. >> > > > > > >> > > > > > *QUESTION:* Is there anything I could've done better in this >> > strategy >> > > > to >> > > > > > make reduce step finish faster ? Do I need to produce Row Keys >> > > > > differently >> > > > > > than "hash1"_"hash2" of the text ? Is it a known issue with CDH3 >> or >> > > > > > Hbase0.90 ? Please help me troubleshoot. >> > > > > > >> > > > > > Strategy#2: PreSplit the number of regions with 10 regions per >> > region >> > > > > > server. And then subsequently kick off the hadoop job with >> > > > > > HFileOutputFormat.configureIncrementLoad. But set the number of >> > > > reducer = >> > > > > > 0. In this strategy (current), I pre-sorted all the mapper input >> > > using >> > > > > > Treeset before writing to output. With No. of reducers = 0, this >> > > > resulted >> > > > > > the mapper to write directly to HFiles. This was cool because >> > > > map/reduce >> > > > > > (no reduce phase actually) finished very fast and we noticed the >> > > HFiles >> > > > > got >> > > > > > written very quickly. Then I used * >> > > > > > hbase.utils.LoadIncrementHFiles.bulkLoad()* API to move HFiles >> into >> > > > > Hbase. >> > > > > > I called this method on successful completon of the job in the >> > > > > > driver class. This is working much better than the Strategy#1 in >> > > terms >> > > > of >> > > > > > performance. But the bulkLoad() call in the driver sometimes >> takes >> > > > longer >> > > > > > if there is huge amount of data. >> > > > > > >> > > > > > *QUESTION:* Is there anyway to make the bulkLoad() run faster ? >> > Can I >> > > > > call >> > > > > > this api from Mapper directly, instead of waiting the whole job >> to >> > > > finish >> > > > > > first? I've used used habse "completebulkload" utilty but it has >> > two >> > > > > > issues with it. First, I do not see any performance improvement >> > with >> > > > it. >> > > > > > Second, it needs to be run separately from Hadoop Job driver >> class >> > > and >> > > > we >> > > > > > wanted to integrate both the piece. So we used >> > > > > > *hbase.utils.LoadIncrementHFiles.bulkLoad(). >> > > > > > * >> > > > > > Also, we used Hbase RegionSplitter to pre-split the regions. But >> > > hbase >> > > > > 0.90 >> > > > > > version doesn't have the option to pass ALGORITHM. Is that >> > something >> > > we >> > > > > > need to worry about? >> > > > > > >> > > > > > Please help me point in the right direction to address this >> > problem. >> > > > > > >> > > > > > Thanks >> > > > > > Upen >> > > > > > >> > > > > >> > > > >> > > >> > >>
