Re: reading zip files

Doug Cutting Thu, 11 May 2006 11:51:18 -0700

Vijay Murthi wrote:

I have just started looking at Hadoop source code. How can I use each
file a distinct split? Already my data is evenly distributed across

these compressed files.


Implement your own InputFormat.

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html

In particular, your getSplits implementation should return a singlesplit per input file, ignoring the numSplits parameter.

You can probably subclass InputFormatBase, and have your getSplitsmethod simply call listPaths() and then construct and return a singlesplit per path returned.


Your RecordReader implementation might then look something like:

  public RecordReader getRecordReader(FileSystem fs, FileSplit split,
                                      JobConf job, Reporter reporter)
    throws IOException {

    final BufferedReader in =
      new BufferedReader(new InputStreamReader
        (new GZIPInputStream(fs.open(split.getPath()))));

    return new RecordReader() {
        long position;

        public synchronized boolean next(Writable key, Writable value)
          throws IOException {
          String line = in.readLine();
          if (line != null) {
            position += line.length();
            ((UTF8)value).set(line);
            return true;
          }
          return false;
        }

        public synchronized long getPos() throws IOException {
          return position;
        }

        public synchronized void close() throws IOException {
          in.close();
        }

      };
  }

Then include your InputFormat's class file in your job's jar file.

Doug

Re: reading zip files

Reply via email to