Note, the Dedup collector is only good for a single collector. If you use multiple collector, it will not help.
Regards, Eric On 10/22/10 9:21 AM, "Matt Davies" <[email protected]> wrote: > Thank you for the insight. > > "Ariel Rabkin" <[email protected]> said: > >> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[email protected]> wrote: >>> Hi Matt, >> >> >>> >>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>> mysql. The same primary key will update to the same row to remove >>> duplicates. It is possible to build a duplication detection process >>> prior to demux which filter data based on sequence id + data type + >>> csource (host), but this hasn't been implemented because primary key >>> update method works well for my use case. >> >> This isn't quite right. There is support in 0.3 and later versions for >> doing de-duplication at the collector, in the manner Eric describes. >> It works as a filter in the writer pipeline. >> >> You need the following in your configuration: >> >> <property> >> <name>chukwaCollector.writerClass</name> >> >> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va >> lue> >> </property> >> >> <property> >> <name>chukwaCollector.pipeline</name> >> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop >> .chukwa.datacollection.writer.SeqFileWriter</value> >> </property> >> >> >> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for >> background >> >> >> --Ari >> >> -- >> Ari Rabkin [email protected] >> UC Berkeley Computer Science Department >> > > >
