Re: Seeing duplicate entries

Matt Davies Fri, 22 Oct 2010 12:23:39 -0700

Eric,

I've been playing out several ideas on where to put in the correction for our 
system.  Upon investigation it seems that 2 separate demux operations see the 
duplicate record so doing some sort of distinct in demux seems unreliable given 
our use.


It appears you are putting data into a database and using the db to enforce the 
uniqueness constraint.  Do you see any way we could do a dedup operation after 
demux (within the chukwa environment) if we write our data strait into HDFS? 

I could see writing a simple MR job to go and figure this stuff out for me, but 
it seems very inelegant and introduces more delay before I can utilize the data.

Any other thoughts?

"Eric Yang" <[email protected]> said:

> Note, the Dedup collector is only good for a single collector.  If you use
> multiple collector, it will not help.
> 
> Regards,
> Eric
> 
> On 10/22/10 9:21 AM, "Matt Davies" <[email protected]> wrote:
> 
>> Thank you for the insight.
>>
>> "Ariel Rabkin" <[email protected]> said:
>>
>>> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[email protected]> wrote:
>>>> Hi Matt,
>>>
>>>
>>>>
>>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to
>>>> mysql.  The same primary key will update to the same row to remove
>>>> duplicates.  It is possible to build a duplication detection process
>>>> prior to demux which filter data based on sequence id + data type +
>>>> csource (host), but this hasn't been implemented because primary key
>>>> update method works well for my use case.
>>>
>>> This isn't quite right. There is support in 0.3 and later versions for
>>> doing de-duplication at the collector, in the manner Eric describes.
>>> It works as a filter in the writer pipeline.
>>>
>>> You need the following in your configuration:
>>>
>>> <property>
>>>   <name>chukwaCollector.writerClass</name>
>>>
>>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va
>>> lue>
>>> </property>
>>>
>>> <property>
>>>   <name>chukwaCollector.pipeline</name>
>>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop
>>> .chukwa.datacollection.writer.SeqFileWriter</value>
>>> </property>
>>>
>>>
>>> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for
>>> background
>>>
>>>
>>> --Ari
>>>
>>> --
>>> Ari Rabkin [email protected]
>>> UC Berkeley Computer Science Department
>>>
>>
>>
>>
> 
>

Re: Seeing duplicate entries

Reply via email to