Re: Set number of mappers by the number of input lines for a single file?

biro lehel Sun, 20 May 2012 04:09:52 -0700

Hello Harsh,

Meantime I figured out what was the problem (it was my bad, intermixing of the 
API's), however I read somewhere that using it (from the old API) in 0.20.2 can 
cause problems. So I took NLineInputFormat.java from the 2.0 branch and simply 
inserted it in my project, it all went fine.

However, as I notice, although as many tasks are generated as the number of 
line in my input file, the whole thing (the whole job) still gets executed on a 
single node (on a single slave) - at least there is only one job showing up on 
my jobtracker, running on one of my slaves. What I want is distribution in a 
way that for the very same (single) input file, all my running slaves get 
involved and process (separately) the lines of this input file. I don't even 
have a reduce phase at the moment, I only want to do the processing on the 
input, through the mapper. Is the scenario I described achievable? How should I 
proceed?

Thank you,
Lehel.

--- On Sun, 5/20/12, Harsh J <ha...@cloudera.com> wrote:

From: Harsh J <ha...@cloudera.com>
Subject: Re: Set number of mappers by the number of input lines for a single 
file?
To: common-user@hadoop.apache.org
Date: Sunday, May 20, 2012, 1:54 PM

Biro,

0.20.2 did carry NLineInputFormat but in the older/stable (marked
deprecated, but was undeprecated subsequently) API package. See
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
which does confirm that 0.20.2 carried it. For 0.20.2, I recommend
sticking to the mapred.* API package.

For the new API (mapreduce.* package) version, you can also grab the
source and include it with the license into your project (and follow
whatever is required in doing so) from here:
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-1/src/mapred/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.java

Hope this helps.

On Sun, May 20, 2012 at 4:03 PM, biro lehel <lehel.b...@yahoo.com> wrote:
> Hello Harsh,
>
> Thanks for your answer. The problem is, that I'm using version 0.20.2, and, 
> as I checked, NLineInputFormat is not implemented here (at least I couldn't 
> find it). Switching to an other version would be kind of a big deal in my 
> infrastructure, since I'm using VM's deployed form images already 
> pre-configured with 0.20.2, so it is not an option at the moment.  What 
> should I do?
>
> Thanks,
> Lehel.
>
> --- On Sun, 5/20/12, Harsh J <ha...@cloudera.com> wrote:
>
> From: Harsh J <ha...@cloudera.com>
> Subject: Re: Set number of mappers by the number of input lines for a single 
> file?
> To: common-user@hadoop.apache.org
> Date: Sunday, May 20, 2012, 12:52 PM
>
> Lehel,
>
> You may use the NLineInputFormat with N=1:
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> On Sun, May 20, 2012 at 2:48 PM, biro lehel <lehel.b...@yahoo.com> wrote:
>> Dear all,
>>
>> I have one single input file, which contains, on every line, some 
>> hydrological calibration models (data). Each line of the file should be 
>> processed and then the output from every line written to another single 
>> output file.
>>
>> I understood that hadoop spawns mapper tasks with the same number as how 
>> many input files there are (meaning, in my case, a single mapper would be 
>> generated). However, I want that a mapper to be dealing with only a single 
>> line from my input file (nr. of mapper tasks =  number of lines in my file).
>>
>> What is the best way to obtain such behavior? How should I specify this to 
>> Hadoop?
>>
>> Any suggestions are more than welcome.
>>
>> Thank you,
>> Lehel.
>
>
>
> --
> Harsh J

-- 
Harsh J

Re: Set number of mappers by the number of input lines for a single file?

Reply via email to