Oh, and the reason to use a MR job counting rows is because if many, a single process would take too long (If you know you have a small table, use the 'count' command in shell).
St.Ack On Wed, Apr 22, 2009 at 9:06 AM, Stack <[email protected]> wrote: > If you run > > ./bin/hadoop -jar hbase.jar rowcounter > > It will emit usage. You are a smart fellow. I think you can take it from > there. > > Stack > > > > > On Apr 22, 2009, at 5:48, Rakhi Khatwani <[email protected]> wrote: > > Hi Lars, >> Thanks for the suggesstion, I also figured out my problem using >> TableInputFormatBase. >> >> but my table had only one region but i still wanted to split the input >> into >> 4 maps. >> so i am basically overriding the getInputSplits() method in >> TableInputFormatBase. >> >> One more question >> is there any method in hbase API which can count the number of rows in a >> table? >> i tried googling it and all i came across is a RowCounter class which is a >> mapreduce job to count the number of rows. but i really dont know how to >> use >> it. any suggestions? >> >> thanks, >> Raakhi >> >> >> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <[email protected]> wrote: >> >> Hi Rakhi, >>> >>> This is all done in the TableInputFormatBase class, which you can extend >>> and then override the getSplits() function: >>> >>> >>> >>> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html >>> >>> This is where you can then specify how many rows per map are assigned. >>> Really straight forward as I see it. I have used it to implement a >>> special >>> "only use N regions" support where I can run a sample subset against a MR >>> job. For example only map 5 out if 8K regions of a table. >>> >>> The default one will always split all regions into N maps. Hence the >>> recommendation to set the number of maps to the number of regions in a >>> table. If you set it to something lower than it will split the regions >>> into >>> a smaller number but with more rows per map, i.e. each map gets more than >>> one region to process. >>> >>> Look into the source of the above class and it should be obvious - I >>> hope. >>> >>> Lars >>> >>> >>> >>> Rakhi Khatwani wrote: >>> >>> Hi, >>>> I have a table with N records, >>>> now i want to run a map reduce job with 4 maps and 0 reduces. >>>> is there a way i can create my own custom input split so that i can >>>> send 'n' records to each map?? >>>> if there is a way, can i have a sample code snippet to gain better >>>> understanding? >>>> >>>> Thanks >>>> Raakhi. >>>> >>>> >>>> >>>> >>>
