Vijay Murthi wrote:
I have just started looking at Hadoop source code. How can I use each
file a distinct split? Already my data is evenly distributed across
these compressed files.

Implement your own InputFormat.

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html

In particular, your getSplits implementation should return a single split per input file, ignoring the numSplits parameter.

You can probably subclass InputFormatBase, and have your getSplits method simply call listPaths() and then construct and return a single split per path returned.

Your RecordReader implementation might then look something like:

  public RecordReader getRecordReader(FileSystem fs, FileSplit split,
                                      JobConf job, Reporter reporter)
    throws IOException {

    final BufferedReader in =
      new BufferedReader(new InputStreamReader
        (new GZIPInputStream(fs.open(split.getPath()))));

    return new RecordReader() {
        long position;

        public synchronized boolean next(Writable key, Writable value)
          throws IOException {
          String line = in.readLine();
          if (line != null) {
            position += line.length();
            ((UTF8)value).set(line);
            return true;
          }
          return false;
        }

        public synchronized long getPos() throws IOException {
          return position;
        }

        public synchronized void close() throws IOException {
          in.close();
        }

      };
  }

Then include your InputFormat's class file in your job's jar file.

Doug

Reply via email to