Ah yea, I forgot to mention that part. Our raw logs are obviously compressed as well, but I'm pushing them frequently enough that there are always enough splits/maps to saturate the nodes.
I heard some discussion a little while back about development of a bz2/zip codec, which would be splittable (see http://www.nabble.com/Compression-using-Hadoop...-tf4354954.html#a12432166 ). But I don't know how much good that would do me... I really need to be able to compress in an 'online' fashion that seems more difficult to achieve with bzip2. So repeatedly reading the raw logs is out, due to their being compressed, but also because it is a very small number of events that aren't emitted on the first go round. Any ideas? Thanks, Stu -----Original Message----- From: Ted Dunning <[EMAIL PROTECTED]> Sent: Sunday, September 30, 2007 7:19pm To: [email protected] Subject: Re: InputFormat for Two Types Depending on how your store your raw log data, it might or might not be suitable for repeated reading. In my case, I have a situation very much like yours, but my log files are encrypted and compressed using a stream compression. That means that I can't split those files, which is a real pity because any processing job on less than a few days of data takes the same amount of time. I would LOVE it if processing an hour of data took a LOT less time than processing 2 days of data. As a result, I am looking at converting all those logs whenever they are in the HDFS. What I would particularly like is a good compressed format that handles lots of missing data well (tab-delimited does this well because of the heavy compression of repeated tabs), but I want to be able to split input files. TextInputFormat, unfortunately, has this test in it: protected boolean isSplitable(FileSystem fs, Path file) { return compressionCodecs.getCodec(file) == null; } This seems to indicate that textual files can be both compressed and split. On the other hand, SequenceFiles are splittable, but it isn't clear how well they will handle missing or empty fields. That is my next experiment, however. On 9/30/07 3:33 PM, "Stu Hood" wrote: > Hello, > > I need to write a mapreduce program that begins with 2 jobs: > 1. Convert raw log data to SequenceFiles > 2 Read from SequenceFiles, and cherry pick completed events > (otherwise, keep them as SequenceFiles to be checked again later) > But I should be able to compact those 2 jobs into 1 job. > > I just need to figure out how to write an InputFormat that uses 2 types of > RecordReaders, depending on the input file type. Specifically, the inputs > would be either raw log data (TextInputFormat), or partially processed log > data (SequenceFileInputFormat). > > I think I need to extend SequenceFileInputFormat to look for an identifying > extension on the files. Then I would be able to return either a > LineRecordReader or a SequenceFileRecordReader, and some logic in Map could > process the line into a record. > > Am I headed in the right direction? Or should I stick with running 2 jobs > instead of trying to squash these steps into 1? > > Thanks, > > Stu Hood > > Webmail.us > > "You manage your business. We'll manage your email."®
