Harsh, Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and it is as you stated. That make sense now. I simply commented out FileInputFormat.addInputPath(job, input) and FileOutputFormat.setOutputPath(job, output) and everything automagically works now.
Thanks a bunch! On Tue, Mar 6, 2012 at 2:06 AM, Harsh J <ha...@cloudera.com> wrote: > Its your use of the mapred.input.dir property, which is a reserved > name in the framework (its what FileInputFormat uses). > > You have a config you extract path from: > Path input = new Path(conf.get("mapred.input.dir")); > > Then you do: > FileInputFormat.addInputPath(job, input); > > Which internally, simply appends a path to a config prop called > "mapred.input.dir". Hence your job gets launched with two input files > (the very same) - one added by default Tool-provided configuration > (cause of your -Dmapred.input.dir) and the other added by you. > > Fix the input path line to use a different config: > Path input = new Path(conf.get("input.path")); > > And run job as: > hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt > -Dmapred.output.dir=result > > On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne <jane.wayne2...@gmail.com> > wrote: > > i have code that reads in a text file. i notice that each line in the > text > > file is somehow being read twice. why is this happening? > > > > my mapper class looks like the following: > > > > public class MyMapper extends Mapper<LongWritable, Text, LongWritable, > > Text> { > > > > private static final Log _log = LogFactory.getLog(MyMapper.class); > > @Override > > public void map(LongWritable key, Text value, Context context) throws > > IOException, InterruptedException { > > String s = (new > > StringBuilder()).append(value.toString()).append("m").toString(); > > context.write(key, new Text(s)); > > _log.debug(key.toString() + " => " + s); > > } > > } > > > > my reducer class looks like the following: > > > > public class MyReducer extends Reducer<LongWritable, Text, LongWritable, > > Text> { > > > > private static final Log _log = LogFactory.getLog(MyReducer.class); > > @Override > > public void reduce(LongWritable key, Iterable<Text> values, Context > > context) throws IOException, InterruptedException { > > for(Iterator<Text> it = values.iterator(); it.hasNext();) { > > Text txt = it.next(); > > String s = (new > > StringBuilder()).append(txt.toString()).append("r").toString(); > > context.write(key, new Text(s)); > > _log.debug(key.toString() + " => " + s); > > } > > } > > } > > > > my job class looks like the following: > > > > public class MyJob extends Configured implements Tool { > > > > public static void main(String[] args) throws Exception { > > ToolRunner.run(new Configuration(), new MyJob(), args); > > } > > > > @Override > > public int run(String[] args) throws Exception { > > Configuration conf = getConf(); > > Path input = new Path(conf.get("mapred.input.dir")); > > Path output = new Path(conf.get("mapred.output.dir")); > > > > Job job = new Job(conf, "dummy job"); > > job.setMapOutputKeyClass(LongWritable.class); > > job.setMapOutputValueClass(Text.class); > > job.setOutputKeyClass(LongWritable.class); > > job.setOutputValueClass(Text.class); > > > > job.setMapperClass(MyMapper.class); > > job.setReducerClass(MyReducer.class); > > > > FileInputFormat.addInputPath(job, input); > > FileOutputFormat.setOutputPath(job, output); > > > > job.setJarByClass(MyJob.class); > > > > return job.waitForCompletion(true) ? 0 : 1; > > } > > } > > > > the text file that i am trying to read in looks like the following. as > you > > can see, there are 9 lines. > > > > T, T > > T, T > > T, T > > F, F > > F, F > > F, F > > F, F > > T, F > > F, T > > > > the output file that i get after my Job runs looks like the following. as > > you can see, there are 18 lines. each key is emitted twice from the > mapper > > to the reducer. > > > > 0 T, Tmr > > 0 T, Tmr > > 6 T, Tmr > > 6 T, Tmr > > 12 T, Tmr > > 12 T, Tmr > > 18 F, Fmr > > 18 F, Fmr > > 24 F, Fmr > > 24 F, Fmr > > 30 F, Fmr > > 30 F, Fmr > > 36 F, Fmr > > 36 F, Fmr > > 42 T, Fmr > > 42 T, Fmr > > 48 F, Tmr > > 48 F, Tmr > > > > the way i execute my Job is as follows (cygwin + hadoop 0.20.2). > > > > hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt > > -Dmapred.output.dir=result > > > > originally, this happened when i read in a sequence file, but even for a > > text file, this problem is still happening. is it the way i have setup my > > Job? > > > > -- > Harsh J >