Anil, Thanks for your suggestion. The NLineInputFormat code actually helped.
Incase anybody has the same problem, here's a custom OneLineInputFormat (that splits the file such that each split contains only one line) you can use: public class OneLineInputFormat extends FileInputFormat<LongWritable, Text> { @Override public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { // from here - https://issues.apache.org/jira/secure/attachment/12413533/patch-375.txt context.setStatus(split.toString()); return new LineRecordReader(); } public List<InputSplit> getSplits(JobContext job) throws IOException { List<InputSplit> splits = new ArrayList<InputSplit>(); for (FileStatus status : listStatus(job)) { Path fileName = status.getPath(); if (status.isDir()) { throw new IOException("Not a file: " + fileName); } FileSystem fs = fileName.getFileSystem(job.getConfiguration()); LineReader lr = null; try { FSDataInputStream in = fs.open(fileName); lr = new LineReader(in, job.getConfiguration()); Text line = new Text(); long begin = 0; long length = 0; int num = -1; while ((num = lr.readLine(line)) > 0) { length += num; if (begin == 0) { splits.add(new FileSplit(fileName, begin, length - 1, new String[] {})); } else { splits.add(new FileSplit(fileName, begin - 1, length, new String[] {})); } begin += length; length = 0; } } finally { if (lr != null) { lr.close(); } } } return splits; } } On Thu, Mar 15, 2012 at 9:38 PM, anil gupta <anilg...@buffalo.edu> wrote: > Have a look at NLineInputFormat class in Hadoop. It is build to split the > input on the basis of number of lines. > > On Thu, Mar 15, 2012 at 6:13 PM, Deepak Nettem <deepaknet...@gmail.com > >wrote: > > > Hi, > > > > I have this use case - I need to spawn as many mappers as the number of > > lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually > > each line represents the path of another data source that the Mappers > will > > work on. So each mapper will read 1 line, (the map() method will need to > be > > called only once), and work on the data source. > > > > What's the best way to construct InputSplit, InputFormat and RecordReader > > to achieve this? I would appreciate any example code :) > > > > Best, > > Deepak > > > > > > -- > Thanks & Regards, > Anil Gupta >