Eric, I've been playing out several ideas on where to put in the correction for our system. Upon investigation it seems that 2 separate demux operations see the duplicate record so doing some sort of distinct in demux seems unreliable given our use.
It appears you are putting data into a database and using the db to enforce the uniqueness constraint. Do you see any way we could do a dedup operation after demux (within the chukwa environment) if we write our data strait into HDFS? I could see writing a simple MR job to go and figure this stuff out for me, but it seems very inelegant and introduces more delay before I can utilize the data. Any other thoughts? "Eric Yang" <[email protected]> said: > Note, the Dedup collector is only good for a single collector. If you use > multiple collector, it will not help. > > Regards, > Eric > > On 10/22/10 9:21 AM, "Matt Davies" <[email protected]> wrote: > >> Thank you for the insight. >> >> "Ariel Rabkin" <[email protected]> said: >> >>> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[email protected]> wrote: >>>> Hi Matt, >>> >>> >>>> >>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>>> mysql. The same primary key will update to the same row to remove >>>> duplicates. It is possible to build a duplication detection process >>>> prior to demux which filter data based on sequence id + data type + >>>> csource (host), but this hasn't been implemented because primary key >>>> update method works well for my use case. >>> >>> This isn't quite right. There is support in 0.3 and later versions for >>> doing de-duplication at the collector, in the manner Eric describes. >>> It works as a filter in the writer pipeline. >>> >>> You need the following in your configuration: >>> >>> <property> >>> <name>chukwaCollector.writerClass</name> >>> >>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va >>> lue> >>> </property> >>> >>> <property> >>> <name>chukwaCollector.pipeline</name> >>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop >>> .chukwa.datacollection.writer.SeqFileWriter</value> >>> </property> >>> >>> >>> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for >>> background >>> >>> >>> --Ari >>> >>> -- >>> Ari Rabkin [email protected] >>> UC Berkeley Computer Science Department >>> >> >> >> > >
