Hi,

An example extracting one record per file would be :

public class FooInputFormat extends MultiFileInputFormat {

  @Override
public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException {
    return new FooRecordReader(job, (MultiFileSplit)split);
  }
}


public static class FooRecordReader implements RecordReader {

  private MultiFileSplit split;
  private long offset;
  private long totLength;
  private FileSystem fs;
  private int count = 0;
  private Path[] paths;
    public FooRecordReader(Configuration conf, MultiFileSplit split)
  throws IOException {
    this.split = split;
    fs = FileSystem.get(conf);
    this.paths = split.getPaths();
    this.totLength = split.getLength();
    this.offset = 0;
  }

  public WritableComparable createKey() {
    ..
  }

  public Writable createValue() {
    ..
  }

  public void close() throws IOException { }

  public long getPos() throws IOException {
    return offset;
  }

  public float getProgress() throws IOException {
    return ((float)offset) / split.getLength();
  }

  public boolean next(Writable key, Writable value) throws IOException {
    if(offset >= totLength)
      return false;
    if(count >= split.numPaths())
      return false;
Path file = paths[count];
    FSDataInputStream stream = fs.open(file);
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
    Scanner scanner = new Scanner(reader.readLine());
          //read from file, fill in key and value
       reader.close();
    stream.close();
    offset += split.getLength(count);
    count++;
    return true;
  }
}


I guess, I should add an example code to the mapred tutorial, and examples directory.

Jason Curtes wrote:
Hello,

I have been trying to run Hadoop on a set of small text files, not larger
than 10k each. The total input size is 15MB. If I try to run the example
word count application, it takes about 2000 seconds, more than half an hour
to complete. However, if I merge all the files into one large file, it takes
much less than a minute. I think using MultiInputFileFormat can be helpful
at this point. However, the API documentation is not really helpful. I
wonder if MultiInputFileFormat can really solve my problem, and if so, can
you suggest me a reference on how to use it, or a few lines to be added to
the word count example to make things more clear?

Thanks in advance.

Regards,

Jason Curtes

Reply via email to