Note, the Dedup collector is only good for a single collector.  If you use
multiple collector, it will not help.

Regards,
Eric

On 10/22/10 9:21 AM, "Matt Davies" <[email protected]> wrote:

> Thank you for the insight.
> 
> "Ariel Rabkin" <[email protected]> said:
> 
>> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[email protected]> wrote:
>>> Hi Matt,
>> 
>> 
>>> 
>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>>> mysql.  The same primary key will update to the same row to remove
>>> duplicates.  It is possible to build a duplication detection process
>>> prior to demux which filter data based on sequence id + data type +
>>> csource (host), but this hasn't been implemented because primary key
>>> update method works well for my use case.
>> 
>> This isn't quite right. There is support in 0.3 and later versions for
>> doing de-duplication at the collector, in the manner Eric describes.
>> It works as a filter in the writer pipeline.
>> 
>> You need the following in your configuration:
>> 
>> <property>
>>   <name>chukwaCollector.writerClass</name>
>>   
>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va
>> lue>
>> </property>
>> 
>> <property>
>>   <name>chukwaCollector.pipeline</name>
>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop
>> .chukwa.datacollection.writer.SeqFileWriter</value>
>> </property>
>> 
>> 
>> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for
>> background
>> 
>> 
>> --Ari
>> 
>> --
>> Ari Rabkin [email protected]
>> UC Berkeley Computer Science Department
>> 
> 
> 
> 

Reply via email to