[
https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755918#comment-13755918
]
Gabriel Commeau commented on FLUME-2173:
----------------------------------------
Hi Hari & team,
May I suggest the following idea: instead of assigning a UUID to the events,
which I assume would be arbitrary if not random, what about enforcing ordering
of events? Each "ingest" agent/client (i.e. first tier) would have a unique
identifier (e.g. a random UUID , or host name + agent name), and a local
counter, which would increment for every event generated/ingested by that
agent. Consequently, each event has an "ingest" ID and a counter value. In
ZooKeeper, instead of having a long list of UUID for the events recently gone
once, we'd only have as many Z-nodes as ingest agents/clients (let it be N),
which contain the highest counter value of events successfully passed through
from the corresponding ingest agent/client. If an event has successfully been
processed (i.e. the first time), the dedup channel increments the ZK counter to
the counter value of that event. If the ZK counter is equal or greater than the
counter value of the event, it's a duplicate.
The advantage is that a batch of M events successfully processed can be
acknowledged in k <= N ZooKeeper operations, and not in M - usually much larger
than N. The inconvenient is that if some events get stuck in process, the dedup
channels will be waiting on them, and so we'd need a way to resend these events
- and therefore, a channel seems like the appropriate place to do that. The
"ingest" channel can clear the events that have a counter value below the ZK
counter, as they have successfully been through once.
On another note, what about grouping the dedup channels for
exactly-once-semantics? We would define a namespace for this group of channels,
and would guaranty that the events come exactly once in that group of channels;
but it could come twice in 2 distinct groups of channels - say one that goes to
HDFS and one to HBase for instance. The ZK structure detailed above can be
duplicated for each namespace (which would be a parent Z-node); and in order to
clear the events, the "ingest" channel needs to check all existing namespaces.
I hope this helps.
> Exactly once semantics for Flume
> --------------------------------
>
> Key: FLUME-2173
> URL: https://issues.apache.org/jira/browse/FLUME-2173
> Project: Flume
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant
> to track exactly once semantics for Flume. My initial idea is to include uuid
> event ids on events at the original source (use a config to mark a source an
> original source) and identify destination sinks. At the destination sinks,
> use a unique ZK Znode to track the events. If once seen (and configured),
> pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a
> backward compatible way.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira