Re: Suggestion for InputSplit and InputFormat - Split every line.

Deepak Nettem Fri, 16 Mar 2012 14:32:55 -0700

Anil,

Thanks for your suggestion. The NLineInputFormat code actually helped.


Incase anybody has the same problem, here's a custom OneLineInputFormat
(that splits the file such that each split contains only one line) you can
use:

public class OneLineInputFormat extends FileInputFormat<LongWritable, Text>
{

    @Override
    public RecordReader<LongWritable, Text> createRecordReader(InputSplit
split,
            TaskAttemptContext context) throws IOException,
InterruptedException {

        // from here -
https://issues.apache.org/jira/secure/attachment/12413533/patch-375.txt
        context.setStatus(split.toString());
        return new LineRecordReader();
    }

    public List<InputSplit> getSplits(JobContext job)
      throws IOException {
        List<InputSplit> splits = new ArrayList<InputSplit>();
        for (FileStatus status : listStatus(job)) {
          Path fileName = status.getPath();
          if (status.isDir()) {
            throw new IOException("Not a file: " + fileName);
          }
          FileSystem  fs = fileName.getFileSystem(job.getConfiguration());
          LineReader lr = null;

          try {
            FSDataInputStream in  = fs.open(fileName);
            lr = new LineReader(in, job.getConfiguration());
            Text line = new Text();
            long begin = 0;
            long length = 0;
            int num = -1;
            while ((num = lr.readLine(line)) > 0) {
                length += num;

                if (begin == 0) {
                    splits.add(new FileSplit(fileName, begin, length - 1,
                    new String[] {}));
                } else {
                    splits.add(new FileSplit(fileName, begin - 1, length,
                    new String[] {}));
                }
                begin += length;
                length = 0;
            }
          } finally {
              if (lr != null) {
              lr.close();
             }
          }
        }
        return splits;
      }
}

On Thu, Mar 15, 2012 at 9:38 PM, anil gupta <anilg...@buffalo.edu> wrote:

> Have a look at NLineInputFormat class in Hadoop. It is build to split the
> input on the basis of number of lines.
>
> On Thu, Mar 15, 2012 at 6:13 PM, Deepak Nettem <deepaknet...@gmail.com
> >wrote:
>
> > Hi,
> >
> > I have this use case - I need to spawn as many mappers as the number of
> > lines in a file in HDFS. This file isn't big (only 10-50 lines). Actually
> > each line represents the path of another data source that the Mappers
> will
> > work on. So each mapper will read 1 line, (the map() method will need to
> be
> > called only once), and work on the data source.
> >
> > What's the best way to construct InputSplit, InputFormat and RecordReader
> > to achieve this? I would appreciate any example code :)
> >
> > Best,
> > Deepak
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: Suggestion for InputSplit and InputFormat - Split every line.

Reply via email to