Re: why does my mapper class reads my input file twice?

Jane Wayne Tue, 06 Mar 2012 05:52:34 -0800

Harsh,

Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and
it is as you stated. That make sense now. I simply commented out
FileInputFormat.addInputPath(job, input)
 and FileOutputFormat.setOutputPath(job, output) and everything
automagically works now.


Thanks a bunch!

On Tue, Mar 6, 2012 at 2:06 AM, Harsh J <ha...@cloudera.com> wrote:

> Its your use of the mapred.input.dir property, which is a reserved
> name in the framework (its what FileInputFormat uses).
>
> You have a config you extract path from:
> Path input = new Path(conf.get("mapred.input.dir"));
>
> Then you do:
> FileInputFormat.addInputPath(job, input);
>
> Which internally, simply appends a path to a config prop called
> "mapred.input.dir". Hence your job gets launched with two input files
> (the very same) - one added by default Tool-provided configuration
> (cause of your -Dmapred.input.dir) and the other added by you.
>
> Fix the input path line to use a different config:
> Path input = new Path(conf.get("input.path"));
>
> And run job as:
> hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
> -Dmapred.output.dir=result
>
> On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne <jane.wayne2...@gmail.com>
> wrote:
> > i have code that reads in a text file. i notice that each line in the
> text
> > file is somehow being read twice. why is this happening?
> >
> > my mapper class looks like the following:
> >
> > public class MyMapper extends Mapper<LongWritable, Text, LongWritable,
> > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyMapper.class);
> >  @Override
> > public void map(LongWritable key, Text value, Context context) throws
> > IOException, InterruptedException {
> > String s = (new
> > StringBuilder()).append(value.toString()).append("m").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> >
> > my reducer class looks like the following:
> >
> > public class MyReducer extends Reducer<LongWritable, Text, LongWritable,
> > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyReducer.class);
> >  @Override
> > public void reduce(LongWritable key, Iterable<Text> values, Context
> > context) throws IOException, InterruptedException {
> > for(Iterator<Text> it = values.iterator(); it.hasNext();) {
> > Text txt = it.next();
> > String s = (new
> > StringBuilder()).append(txt.toString()).append("r").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> > }
> >
> > my job class looks like the following:
> >
> > public class MyJob extends Configured implements Tool {
> >
> > public static void main(String[] args) throws Exception {
> > ToolRunner.run(new Configuration(), new MyJob(), args);
> > }
> >
> > @Override
> > public int run(String[] args) throws Exception {
> > Configuration conf = getConf();
> > Path input = new Path(conf.get("mapred.input.dir"));
> >    Path output = new Path(conf.get("mapred.output.dir"));
> >
> >    Job job = new Job(conf, "dummy job");
> >    job.setMapOutputKeyClass(LongWritable.class);
> >    job.setMapOutputValueClass(Text.class);
> >    job.setOutputKeyClass(LongWritable.class);
> >    job.setOutputValueClass(Text.class);
> >
> >    job.setMapperClass(MyMapper.class);
> >    job.setReducerClass(MyReducer.class);
> >
> >    FileInputFormat.addInputPath(job, input);
> >    FileOutputFormat.setOutputPath(job, output);
> >
> >    job.setJarByClass(MyJob.class);
> >
> >    return job.waitForCompletion(true) ? 0 : 1;
> > }
> > }
> >
> > the text file that i am trying to read in looks like the following. as
> you
> > can see, there are 9 lines.
> >
> > T, T
> > T, T
> > T, T
> > F, F
> > F, F
> > F, F
> > F, F
> > T, F
> > F, T
> >
> > the output file that i get after my Job runs looks like the following. as
> > you can see, there are 18 lines. each key is emitted twice from the
> mapper
> > to the reducer.
> >
> > 0   T, Tmr
> > 0   T, Tmr
> > 6   T, Tmr
> > 6   T, Tmr
> > 12  T, Tmr
> > 12  T, Tmr
> > 18  F, Fmr
> > 18  F, Fmr
> > 24  F, Fmr
> > 24  F, Fmr
> > 30  F, Fmr
> > 30  F, Fmr
> > 36  F, Fmr
> > 36  F, Fmr
> > 42  T, Fmr
> > 42  T, Fmr
> > 48  F, Tmr
> > 48  F, Tmr
> >
> > the way i execute my Job is as follows (cygwin + hadoop 0.20.2).
> >
> > hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
> > -Dmapred.output.dir=result
> >
> > originally, this happened when i read in a sequence file, but even for a
> > text file, this problem is still happening. is it the way i have setup my
> > Job?
>
>
>
> --
> Harsh J
>

Re: why does my mapper class reads my input file twice?

Reply via email to