Re: Custom Input Split

stack Wed, 22 Apr 2009 09:25:51 -0700

Oh, and the reason to use a MR job counting rows is because if many, a
single process would take too long (If you know you have a small table, use
the 'count' command in shell).


St.Ack

On Wed, Apr 22, 2009 at 9:06 AM, Stack <[email protected]> wrote:

> If you run
>
> ./bin/hadoop -jar hbase.jar rowcounter
>
> It will emit usage.  You are a smart fellow. I think you can take it from
> there.
>
> Stack
>
>
>
>
> On Apr 22, 2009, at 5:48, Rakhi Khatwani <[email protected]> wrote:
>
>  Hi Lars,
>>          Thanks for the suggesstion, I also figured out my problem using
>> TableInputFormatBase.
>>
>> but my table had only one region but i still wanted to split the input
>> into
>> 4 maps.
>> so i am basically overriding the getInputSplits() method in
>> TableInputFormatBase.
>>
>> One more question
>> is there any method in hbase API which can count the number of rows in a
>> table?
>> i tried googling it and all i came across is a RowCounter class which is a
>> mapreduce job to count the number of rows. but i really dont know how to
>> use
>> it. any suggestions?
>>
>> thanks,
>> Raakhi
>>
>>
>> On Wed, Apr 22, 2009 at 4:30 AM, Lars George <[email protected]> wrote:
>>
>>  Hi Rakhi,
>>>
>>> This is all done in the TableInputFormatBase class, which you can extend
>>> and then override the getSplits() function:
>>>
>>>
>>>
>>> http://hadoop.apache.org/hbase/docs/r0.19.1/api/org/apache/hadoop/hbase/mapred/TableInputFormatBase.html
>>>
>>> This is where you can then specify how many rows per map are assigned.
>>> Really straight forward as I see it. I have used it to implement a
>>> special
>>> "only use N regions" support where I can run a sample subset against a MR
>>> job. For example only map 5 out if 8K regions of a table.
>>>
>>> The default one will always split all regions into N maps. Hence the
>>> recommendation to set the number of maps to the number of regions in a
>>> table. If you set it to something lower than it will split the regions
>>> into
>>> a smaller number but with more rows per map, i.e. each map gets more than
>>> one region to process.
>>>
>>> Look into the source of the above class and it should be obvious - I
>>> hope.
>>>
>>> Lars
>>>
>>>
>>>
>>> Rakhi Khatwani wrote:
>>>
>>>  Hi,
>>>>   I have a table with N records,
>>>>   now i want to run a map reduce job with 4 maps and 0 reduces.
>>>>   is there a way i can create my own custom input split so that i can
>>>> send 'n' records to each map??
>>>>  if there is a way, can i have a sample code snippet to gain better
>>>> understanding?
>>>>
>>>> Thanks
>>>> Raakhi.
>>>>
>>>>
>>>>
>>>>
>>>

Re: Custom Input Split

Reply via email to