Hello,

I need to write a mapreduce program that begins with 2 jobs:
 1. Convert raw log data to SequenceFiles
 2. Read from SequenceFiles, and cherry pick completed events
  (otherwise, keep them as SequenceFiles to be checked again later)
But I should be able to compact those 2 jobs into 1 job.

I just need to figure out how to write an InputFormat that uses 2 types of 
RecordReaders, depending on the input file type. Specifically, the inputs would 
be either raw log data (TextInputFormat), or partially processed log data 
(SequenceFileInputFormat).

I think I need to extend SequenceFileInputFormat to look for an identifying 
extension on the files. Then I would be able to return either a 
LineRecordReader or a SequenceFileRecordReader, and some logic in Map could 
process the line into a record.

Am I headed in the right direction? Or should I stick with running 2 jobs 
instead of trying to squash these steps into 1?

Thanks,

Stu Hood

Webmail.us

"You manage your business. We'll manage your email."®

Reply via email to