Re: Seeing duplicate entries

Matt Davies Fri, 22 Oct 2010 09:22:06 -0700

Thank you for the insight.

"Ariel Rabkin" <[email protected]> said:


> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[email protected]> wrote:
>> Hi Matt,
> 
> 
>>
>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>> mysql.  The same primary key will update to the same row to remove
>> duplicates.  It is possible to build a duplication detection process
>> prior to demux which filter data based on sequence id + data type +
>> csource (host), but this hasn't been implemented because primary key
>> update method works well for my use case.
> 
> This isn't quite right. There is support in 0.3 and later versions for
> doing de-duplication at the collector, in the manner Eric describes.
> It works as a filter in the writer pipeline.
> 
> You need the following in your configuration:
> 
> <property>
>   <name>chukwaCollector.writerClass</name>
>   
> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value>
> </property>
> 
> <property>
>   <name>chukwaCollector.pipeline</name>
> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value>
> </property>
> 
> 
> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for 
> background
> 
> 
> --Ari
> 
> --
> Ari Rabkin [email protected]
> UC Berkeley Computer Science Department
>

Re: Seeing duplicate entries

Reply via email to