Thank you for the insight. "Ariel Rabkin" <[email protected]> said:
> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[email protected]> wrote: >> Hi Matt, > > >> >> The duplication filtering in Chukwa 0.3.0 depends on data loading to >> mysql. The same primary key will update to the same row to remove >> duplicates. It is possible to build a duplication detection process >> prior to demux which filter data based on sequence id + data type + >> csource (host), but this hasn't been implemented because primary key >> update method works well for my use case. > > This isn't quite right. There is support in 0.3 and later versions for > doing de-duplication at the collector, in the manner Eric describes. > It works as a filter in the writer pipeline. > > You need the following in your configuration: > > <property> > <name>chukwaCollector.writerClass</name> > > <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value> > </property> > > <property> > <name>chukwaCollector.pipeline</name> > <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value> > </property> > > > See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for > background > > > --Ari > > -- > Ari Rabkin [email protected] > UC Berkeley Computer Science Department >
