Actually the following solved my problem ... but I'm a little suspicious of the
side effect of doing the following instead of using my own InputSplit to be 5
lines.
conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // #
of maps = # lines
conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per
mapper = 5
If you have any thought of whether the upper solution is worst that writing my
own inputSplit to be about 5 lines, let me know.
Thanks everyone !
Maha
On Feb 20, 2011, at 11:47 AM, maha wrote:
> Hi again Jim and Ted,
>
> I understood that each mapper will be getting a block of lines... but even
> thought I had only 2 mappers for a 16 lines of input file and TextInputFormat
> is used. A map-function is processed for each of those 16 lines!
>
> I wanted a block of lines per map ... hence something like map1 has 8 lines
> and map2 has 8 lines.
>
> So first question: is there a difference between Mappers and maps ?
>
> Second: Does that mean I need to write my own inputFormat to make the
> InputSplit equal to multipleLines ???
>
> Thank you,
>
> Maha
>
>
> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:
>
>> That's right. The TextInputFormat handles situations where records cross
>> split boundaries. What your mapper will see is "whole" records.
>>
>> -----Original Message-----
>> From: maha [mailto:[email protected]]
>> Sent: Friday, February 18, 2011 1:14 PM
>> To: common-user
>> Subject: Quick question
>>
>> Hi all,
>>
>> I want to check if the following statement is right:
>>
>> If I use TextInputFormat to process a text file with 2000 lines (each ending
>> with \n) with 20 mappers. Then each map will have a sequence of COMPLETE
>> LINES .
>>
>> In other words, the input is not split byte-wise but by lines.
>>
>> Is that right?
>>
>>
>> Thank you,
>> Maha
>>
>