[ 
https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753337#comment-13753337
 ] 

Hari Shreedharan commented on FLUME-2173:
-----------------------------------------

Yep, that is what I was thinking about. I was planning to keep these as 
interfaces which are pluggable, and having no requirement of once-only can 
easily be done by implementing a pass through dedupe logic. Local dedupe can 
easily be implemented and we can simply suggest that users configure a ZK based 
dedupe at the final channel(s). This allows low latency, and local dedupe too.

This brings me to another point - having local dedupe can actually allow for 
some interesting stuff. We could use local dedupes to allow for once only 
processing of events entering an agent. 

Putting in a once-only guarantee allows for being able to do some interesting 
processing on events. For example, if we know that events will always arrive 
only once (even if there are multiple channels), we could use that to make 
"accurate" counts of events/event-types. 

In fact, if we are able to somehow do some processing on the sink side (after 
dedupe), we could do some simple event processing while still moving the events 
through. I am thinking whether it makes sense to do something like allowing 
sinks/sink-based new component to do some processing on events picked up from 
the channel, and then allow it to write events out to a channel. This sort of 
creates a workflow that could look like this:

AvroSource->Channel->Sink->Channel->Sink->Channel->HDFS.

This allows for rolling back some failed processing without losing data 
(assuming the sink actually duplicates data and does not modify based on 
references returned by a memory channel). This is sort of how classical 
processing systems work (with processing code, separated by queues). Allowing 
the sinks to pull from multiple channels would even allow us to do cartesian 
product like processing too - like pseudo-joins.

Doing this, combined with once-only delivery would allow us to quite reliably 
do some simple event processing (I agree, the definition of "simple" is 
different for different people).

Thoughts?
                
> Exactly once semantics for Flume
> --------------------------------
>
>                 Key: FLUME-2173
>                 URL: https://issues.apache.org/jira/browse/FLUME-2173
>             Project: Flume
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant 
> to track exactly once semantics for Flume. My initial idea is to include uuid 
> event ids on events at the original source (use a config to mark a source an 
> original source) and identify destination sinks. At the destination sinks, 
> use a unique ZK Znode to track the events. If once seen (and configured), 
> pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a 
> backward compatible way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to