[
https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13753337#comment-13753337
]
Hari Shreedharan commented on FLUME-2173:
-----------------------------------------
Yep, that is what I was thinking about. I was planning to keep these as
interfaces which are pluggable, and having no requirement of once-only can
easily be done by implementing a pass through dedupe logic. Local dedupe can
easily be implemented and we can simply suggest that users configure a ZK based
dedupe at the final channel(s). This allows low latency, and local dedupe too.
This brings me to another point - having local dedupe can actually allow for
some interesting stuff. We could use local dedupes to allow for once only
processing of events entering an agent.
Putting in a once-only guarantee allows for being able to do some interesting
processing on events. For example, if we know that events will always arrive
only once (even if there are multiple channels), we could use that to make
"accurate" counts of events/event-types.
In fact, if we are able to somehow do some processing on the sink side (after
dedupe), we could do some simple event processing while still moving the events
through. I am thinking whether it makes sense to do something like allowing
sinks/sink-based new component to do some processing on events picked up from
the channel, and then allow it to write events out to a channel. This sort of
creates a workflow that could look like this:
AvroSource->Channel->Sink->Channel->Sink->Channel->HDFS.
This allows for rolling back some failed processing without losing data
(assuming the sink actually duplicates data and does not modify based on
references returned by a memory channel). This is sort of how classical
processing systems work (with processing code, separated by queues). Allowing
the sinks to pull from multiple channels would even allow us to do cartesian
product like processing too - like pseudo-joins.
Doing this, combined with once-only delivery would allow us to quite reliably
do some simple event processing (I agree, the definition of "simple" is
different for different people).
Thoughts?
> Exactly once semantics for Flume
> --------------------------------
>
> Key: FLUME-2173
> URL: https://issues.apache.org/jira/browse/FLUME-2173
> Project: Flume
> Issue Type: Bug
> Reporter: Hari Shreedharan
> Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant
> to track exactly once semantics for Flume. My initial idea is to include uuid
> event ids on events at the original source (use a config to mark a source an
> original source) and identify destination sinks. At the destination sinks,
> use a unique ZK Znode to track the events. If once seen (and configured),
> pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a
> backward compatible way.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira