[ 
https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755918#comment-13755918
 ] 

Gabriel Commeau commented on FLUME-2173:
----------------------------------------

Hi Hari & team,

May I suggest the following idea: instead of assigning a UUID to the events, 
which I assume would be arbitrary if not random, what about enforcing ordering 
of events? Each "ingest" agent/client (i.e. first tier) would have a unique 
identifier (e.g. a random UUID , or host name + agent name), and a local 
counter, which would increment for every event generated/ingested by that 
agent. Consequently, each event has an "ingest" ID and a counter value. In 
ZooKeeper, instead of having a long list of UUID for the events recently gone 
once, we'd only have as many Z-nodes as ingest agents/clients (let it be N), 
which contain the highest counter value of events successfully passed through 
from the corresponding ingest agent/client. If an event has successfully been 
processed (i.e. the first time), the dedup channel increments the ZK counter to 
the counter value of that event. If the ZK counter is equal or greater than the 
counter value of the event, it's a duplicate.
The advantage is that a batch of M events successfully processed can be 
acknowledged in k <= N ZooKeeper operations, and not in M - usually much larger 
than N. The inconvenient is that if some events get stuck in process, the dedup 
channels will be waiting on them, and so we'd need a way to resend these events 
- and therefore, a channel seems like the appropriate place to do that. The 
"ingest" channel can clear the events that have a counter value below the ZK 
counter, as they have successfully been through once.

On another note, what about grouping the dedup channels for 
exactly-once-semantics? We would define a namespace for this group of channels, 
and would guaranty that the events come exactly once in that group of channels; 
but it could come twice in 2 distinct groups of channels - say one that goes to 
HDFS and one to HBase for instance. The ZK structure detailed above can be 
duplicated for each namespace (which would be a parent Z-node); and in order to 
clear the events, the "ingest" channel needs to check all existing namespaces.

I hope this helps.
                
> Exactly once semantics for Flume
> --------------------------------
>
>                 Key: FLUME-2173
>                 URL: https://issues.apache.org/jira/browse/FLUME-2173
>             Project: Flume
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant 
> to track exactly once semantics for Flume. My initial idea is to include uuid 
> event ids on events at the original source (use a config to mark a source an 
> original source) and identify destination sinks. At the destination sinks, 
> use a unique ZK Znode to track the events. If once seen (and configured), 
> pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a 
> backward compatible way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to