Re: InputFormat for Two Types

Ted Dunning Sun, 30 Sep 2007 16:20:16 -0700


Depending on how your store your raw log data, it might or might not be
suitable for repeated reading.  In my case, I have a situation very much
like yours, but my log files are encrypted and compressed using a stream
compression.  That means that I can't split those files, which is a real
pity because any processing job on less than a few days of data takes the
same amount of time.  I would LOVE it if processing an hour of data took a
LOT less time than processing 2 days of data.

As a result, I am looking at converting all those logs whenever they are in
the HDFS.  What I would particularly like is a good compressed format that
handles lots of missing data well (tab-delimited does this well because of
the heavy compression of repeated tabs), but I want to be able to split
input files.  TextInputFormat, unfortunately, has this test in it:

  protected boolean isSplitable(FileSystem fs, Path file) {
    return compressionCodecs.getCodec(file) == null;
  }

This seems to indicate that textual files can be both compressed and split.

On the other hand, SequenceFiles are splittable, but it isn't clear how well
they will handle missing or empty fields.  That is my next experiment,
however.

On 9/30/07 3:33 PM, "Stu Hood" <[EMAIL PROTECTED]> wrote:

> Hello,
> 
> I need to write a mapreduce program that begins with 2 jobs:
>  1. Convert raw log data to SequenceFiles
>  2. Read from SequenceFiles, and cherry pick completed events
>   (otherwise, keep them as SequenceFiles to be checked again later)
> But I should be able to compact those 2 jobs into 1 job.
> 
> I just need to figure out how to write an InputFormat that uses 2 types of
> RecordReaders, depending on the input file type. Specifically, the inputs
> would be either raw log data (TextInputFormat), or partially processed log
> data (SequenceFileInputFormat).
> 
> I think I need to extend SequenceFileInputFormat to look for an identifying
> extension on the files. Then I would be able to return either a
> LineRecordReader or a SequenceFileRecordReader, and some logic in Map could
> process the line into a record.
> 
> Am I headed in the right direction? Or should I stick with running 2 jobs
> instead of trying to squash these steps into 1?
> 
> Thanks,
> 
> Stu Hood
> 
> Webmail.us
> 
> "You manage your business. We'll manage your email."®

Re: InputFormat for Two Types

Reply via email to