I have just sent this mail to Ari, but it is probably wise to share it will all of you:
Hello Ari, I'm Oded Rosen, with Legolas Media R&D team. We would like to use Chukwa to pass data from our real time servers into our hadoop cluster. The dataflow already reaches several GB/day, and we are about to extend this in the near future. Our main aim is to process raw data (in the form of fieldname1=value1<tab>fieldname2=value2....\n) into a format that fits straight into Hive, for a later processing. We are already running a DirTailingAdaptor on our input directory, and recieve the the collected data in the chukwa/logs dir. Now, we would like to write our own Demux processor, in order to process the sink data, get only the fields we need from it, format the data and write it to the output directory, which will be defined as the input directory of a Hive table. We have already written mapper/reducer classes that know how to extract the wanted fields from the raw data and apply the needed formats. We want to set a Demux processor with these classes as the map/reduce classes, but we could not find any documentation about how to do it. All we could do until now is to run the default demux that just copies the data into the output directory. We will appreciate any help you can offer us. -- Oded