Re: Seeing duplicate entries

Corbin Hoenes Sat, 23 Oct 2010 13:33:37 -0700

+1

I imagine it is jst another pipelinable class loaded into thecollector? If so bill's scenario would work.


Sent from my iPhone

On Oct 23, 2010, at 12:59 PM, Bill Graham <[email protected]> wrote:

Eric, I'm also curious about how the HBase integration works. Do you
have time to write something up on it? I'm interested in the
possibility of extending what's there to write my own custom data into
HBase from a collector, while said data also continues through to HDFS
as it does currently.
On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes<[email protected]> wrote:
Eric in chukwa 0.5 is hbase the final store instead of hdfs? Whatformatwill the hbase data be in (e.g. A chukwarecord object ? Somethinguser
configurable? )

Sent from my iPhone

On Oct 22, 2010, at 8:48 AM, Eric Yang <[email protected]> wrote:
Hi Matt,
This is expected in Chukwa archives. When agent is unable to postto
the collector, it will retry to post the same data again to another
collector or retrys with the same collector when no othercollector isavailable. Collector may have data written without properacknowledge
back to agent in high load situation.  Chukwa philosophy is to retry
until receiving acknowledgement.  Duplicated data filter will be
treated after data has been received.

The duplication filtering in Chukwa 0.3.0 depends on data loading to
mysql.  The same primary key will update to the same row to remove
duplicates.  It is possible to build a duplication detection process
prior to demux which filter data based on sequence id + data type +
csource (host), but this hasn't been implemented because primary key
update method works well for my use case.
In Chukwa 0.5, we are treating duplication the same as in Chukwa0.3,where it will replace any duplicated row in HBase base onTimestamp +
HBase row key.

regards,
Eric
On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies<[email protected]> wrote:
Hey everyone,
I have a situation where I'm seeing duplicated data downstreambefore thedemux process. It appears this happens during high system loadsand we are
still using the 0.3.0 series.
So, we have validated that there is a single, unique entry in oursourcefile which then shows up a random amount of times before we seeit in demux.So, it appears that there is duplication happening somewherebetween the
agent and collector.
Has anyone else seen this? Any ideas as to why we are seeing thisduring
high system loads, but not during lower loads.

TIA,
Matt

Re: Seeing duplicate entries

Reply via email to