On 12/22/09 1:36 PM, "Bill Graham" <billgra...@gmail.com> wrote:
> I've written my own Processor to handle my log format per this wiki and I've > run into a couple of gotchast: > http://wiki.apache.org/hadoop/DemuxModification > > 1. The default processor is not the TsProcessor as documented, but the > DefaultProcessor (see line 83 of Demux.java). This causes headaches because > when using DefaultProcessor data always goes under minute "0" in hdfs, > regardless of when in the hour it was created. > There is a generic method to build the record, like: buildGenericRecord(record, recordEntry, timestamp, recordType); This method will build up key like: Time partition/Primary Key/timestamp When all records are roll up into large sequence file by end of the hour and end of the day, the sequence file is sorted by time partition and primary key. This arrangement of data structure was put in place to assist data scanning. When data is retrieved, use record.getTimestamp() to find the real timestamp for the record. TsProcessor is incompleted for now because the key in ChukwaRecord is used in hourly and daily roll up. Without using buildGenericRecord, hourly and daily roll up will not work correctly. > 2. When implementing a custom parser as shown in the wiki, how do you register > the class so it gets included in the job that's submitted to the hadoop > cluster? The only way I've been able to do this is to put my class in the > package org.apache.hadoop.chukwa.extraction.demux.processor.mapper and then > manually add that class to the chukwa-core-0.3.0.jar that is on my data > processor, which is a pretty rough hack. Otherwise, I get class not found > exceptions in my mapper. The demux process is controlled by $CHUKWA_HOME/conf/chukwa-demux-conf.xml, and map the recordType to your parser class. There is an plan to load parser class from class path by using Java annotation. It is still in the initial phase of planning. Design participation are welcome. Hope this helps. :) Regards, Eric