Re: Custom Input Split

Rakhi Khatwani Wed, 22 Apr 2009 09:50:46 -0700

Hi St Ack,
          well i did go through the usage... where we were supposed to
mention 3 parameters, OutputDir, TableName and Columns
what i actually wanted is an int value count, which contains the number of
rows in the table.
i guess this program seems to store the o/p in some output dir... correct me
if i am going wrong.


Thanks,
Raakhi

On Wed, Apr 22, 2009 at 8:25 AM, stack <[email protected]> wrote:

> Oh, and the reason to use a MR job counting rows is because if many, a
> single process would take too long (If you know you have a small table, use
> the 'count' command in shell).
>
> St.Ack
>
> On Wed, Apr 22, 2009 at 9:06 AM, Stack <[email protected]> wrote:
>
> > If you run
> >
> > ./bin/hadoop -jar hbase.jar rowcounter
> >
> > It will emit usage.  You are a smart fellow. I think you can take it from
> > there.
> >
> > Stack
> >
> >
> >
> >
> > On Apr 22, 2009, at 5:48, Rakhi Khatwani <[email protected]>
> wrote:
> >
> >  Hi Lars,
> >>          Thanks for the suggesstion, I also figured out my problem using
> >> TableInputFormatBase.
> >>
> >> but my table had only one region but i still wanted to split the input
> >> into
> >> 4 maps.
> >> so i am basically overriding the getInputSplits() method in
> >> TableInputFormatBase.
> >>
> >> One more question
> >> is there any method in hbase API which can count the number of rows in a
> >> table?
> >> i tried googling it and all i came across is a RowCounter class which is
> a
> >> mapreduce job to count the number of rows. but i really dont know how to
> >> use
> >> it. any suggestions?
> >>
> >> thanks,
> >> Raakhi
> >>
> >>
> >> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <[email protected]>
> wrote:
> >>
> >>  Hi Rakhi,
> >>>
> >>> This is all done in the TableInputFormatBase class, which you can
> extend
> >>> and then override the getSplits() function:
> >>>
> >>>
> >>>
> >>>
> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
> >>>
> >>> This is where you can then specify how many rows per map are assigned.
> >>> Really straight forward as I see it. I have used it to implement a
> >>> special
> >>> "only use N regions" support where I can run a sample subset against a
> MR
> >>> job. For example only map 5 out if 8K regions of a table.
> >>>
> >>> The default one will always split all regions into N maps. Hence the
> >>> recommendation to set the number of maps to the number of regions in a
> >>> table. If you set it to something lower than it will split the regions
> >>> into
> >>> a smaller number but with more rows per map, i.e. each map gets more
> than
> >>> one region to process.
> >>>
> >>> Look into the source of the above class and it should be obvious - I
> >>> hope.
> >>>
> >>> Lars
> >>>
> >>>
> >>>
> >>> Rakhi Khatwani wrote:
> >>>
> >>>  Hi,
> >>>>   I have a table with N records,
> >>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
> >>>>   is there a way i can create my own custom input split so that i can
> >>>> send 'n' records to each map??
> >>>>  if there is a way, can i have a sample code snippet to gain better
> >>>> understanding?
> >>>>
> >>>> Thanks
> >>>> Raakhi.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
>

Re: Custom Input Split

Reply via email to