Ah yea, I forgot to mention that part. Our raw logs are obviously compressed as 
well, but I'm pushing them frequently enough that there are always enough 
splits/maps to saturate the nodes.

I heard some discussion a little while back about development of a bz2/zip 
codec, which would be splittable (see 
http://www.nabble.com/Compression-using-Hadoop...-tf4354954.html#a12432166 ). 
But I don't know how much good that would do me... I really need to be able to 
compress in an 'online' fashion that seems more difficult to achieve with bzip2.

So repeatedly reading the raw logs is out, due to their being compressed, but 
also because it is a very small number of events that aren't emitted on the 
first go round.

Any ideas?

Thanks,
Stu


-----Original Message-----
From: Ted Dunning <[EMAIL PROTECTED]>
Sent: Sunday, September 30, 2007 7:19pm
To: [email protected]
Subject: Re: InputFormat for Two Types



Depending on how your store your raw log data, it might or might not be
suitable for repeated reading.  In my case, I have a situation very much
like yours, but my log files are encrypted and compressed using a stream
compression.  That means that I can't split those files, which is a real
pity because any processing job on less than a few days of data takes the
same amount of time.  I would LOVE it if processing an hour of data took a
LOT less time than processing 2 days of data.

As a result, I am looking at converting all those logs whenever they are in
the HDFS.  What I would particularly like is a good compressed format that
handles lots of missing data well (tab-delimited does this well because of
the heavy compression of repeated tabs), but I want to be able to split
input files.  TextInputFormat, unfortunately, has this test in it:

  
  protected boolean isSplitable(FileSystem fs, Path file) {
    return compressionCodecs.getCodec(file) == null;
  }

This seems to indicate that textual files can be both compressed and split.

On the other hand, SequenceFiles are splittable, but it isn't clear how well
they will handle missing or empty fields.  That is my next experiment,
however.

On 9/30/07 3:33 PM, "Stu Hood"  wrote:

> Hello,
> 
> I need to write a mapreduce program that begins with 2 jobs:
>  1. Convert raw log data to SequenceFiles
>  2 Read from SequenceFiles, and cherry pick completed events
>   (otherwise, keep them as SequenceFiles to be checked again later)
> But I should be able to compact those 2 jobs into 1 job.
> 
> I just need to figure out how to write an InputFormat that uses 2 types of
> RecordReaders, depending on the input file type. Specifically, the inputs
> would be either raw log data (TextInputFormat), or partially processed log
> data (SequenceFileInputFormat).
> 
> I think I need to extend SequenceFileInputFormat to look for an identifying
> extension on the files. Then I would be able to return either a
> LineRecordReader or a SequenceFileRecordReader, and some logic in Map could
> process the line into a record.
> 
> Am I headed in the right direction? Or should I stick with running 2 jobs
> instead of trying to squash these steps into 1?
> 
> Thanks,
> 
> Stu Hood
> 
> Webmail.us
> 
> "You manage your business. We'll manage your email."®

Reply via email to