[
https://issues.apache.org/jira/browse/FLUME-2222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164120#comment-14164120
]
Edward Sargisson commented on FLUME-2222:
-----------------------------------------
[~nicktgr15] BTW, the way we deal with that behaviour is to set an ID as early
in the pipeline as we can manage and use that ID when writing to ES. ES will
overwrite records with the same id.
> Duplicate entries in Elasticsearch when using Flume elasticsearch-sink
> ----------------------------------------------------------------------
>
> Key: FLUME-2222
> URL: https://issues.apache.org/jira/browse/FLUME-2222
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.4.0
> Environment: centos 6
> Reporter: Nikolaos Tsipas
> Assignee: Ashish Paliwal
> Labels: elasticsearch, sink
> Attachments: Screen Shot 2013-10-29 at 12.36.01.png
>
>
> Hello,
> I'm using flume elasticsearch-sink to transfer logs from ec2 instances to
> elasticsearch and I get duplicate entries for numerous documents.
> I've noticed this issue when I was sending a specific number of log lines to
> elasticsearch using flume and then I was counting them using kibana to verify
> that all of them arrived. Most of the time, especially when multiple flume
> instances were used, I was getting duplicate entries. e.g. instead of
> receiving 10000 documents from an instance, I was receiving 10060.
> Duplication level seems to be proportional to the number of instances sending
> log data simultaneously. e.g. with 3 flume instances I get 10060, with 50
> flume instances I get 10300.
> Is duplication something that I should expect when using flume
> elasticsearch-sink?
> There is a {{doRollback()}} method that is called on transaction failure but
> I think that it updates only the local flume channel and not elasticsearch.
> Any info/suggestions would be appreciated.
> Regards,
> Nick
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)