[ 
https://issues.apache.org/jira/browse/FLUME-2173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13751975#comment-13751975
 ] 

Hari Shreedharan commented on FLUME-2173:
-----------------------------------------

Copying over the discussion from the dev@ list:

{quote}
Hi Arvind,

Thanks for your reply. You are right in the fact the global state check and 
update to the sink will require each sink to explicitly support it. We can, of 
course have this implementation be in an abstract class which is inherited, but 
yes, this would also mean that there needs to be code changes. 

It makes sense to check state in the channels, pretty much the same way as in 
the sinks. What is a bit concerning is that we will need to do this check at 
every agent that the event passes through, and probably make some changes in 
the channel interface to get rid of race conditions (not sure if that is the 
case, but I think we will need to). Given that an event is likely to pass 
through 2-3 tiers, each event gets delayed by the time taken by that many ZK 
round-trips. I am open to this as well, especially considering that it is 
likely to be a better OOB experience for many users (the ones who have their 
own custom sinks). Would it suffice to check at the sinks at the terminal agent 
to make sure that an event gets written out only once? 

Thinking about this, having a once-only delivery at the channel level also 
opens up some possibilities with regards to being able to do some sort of 
processing on events. Having a guarantee of seeing an event exactly once allows 
us to do some event processing like counters etc. That seems like a good side 
effect to have.

Either way, I am glad we agree on the aspect of checking a global state manager 
to verify that events are deduped.  


Thanks,
Hari

On Tuesday, August 27, 2013 at 2:12 PM, Arvind Prabhakar wrote:

Hi Hari,

Thanks for bringing this up for discussion. I think it will be tremendously
beneficial to Flume users if we can extend once-only guarantee. Your
initial suggestion seems reasonable of having a Sink trap the events and
reference a global state to drop duplicates. Rather than pushing this
functionality to Sinks is there any other way by which we can make it more
generally available? The reason I raise this concern is because otherwise
this becomes a feature of a particular sink and not every sink will have
the necessary implementation opportunity to get this.

Alternatively what do you think about this being done at the channel level?
Since we normally do not see custom implementations of channels, an
implementation that works with the channel will likely be more useful for
the broader community of Flume users.

Regards,
Arvidn


On Sun, Aug 25, 2013 at 9:07 AM, Hari Shreedharan <[email protected]
wrote:

Hi Gabriel,

Thanks for your input. The part where we use replicating channel selector
to purposefully replicate - we can easily make it configurable whether to
delete deplicate events or not. That should not be difficult to do.

The 2nd point where multiple agents/sinks could write the same event can
be solved by namespacing the events into different namespaces. So each sink
checks one namespace for the event, and multiple sinks can belong to the
same namespace - this way, if multiple events are going to write to the
same HDFS cluster, then if a duplicate occurs we can easily drop it.
Unfortunately, this also does not work around the who
HDFS-writing-but-throwing issue.

I agree updating ZK will hit latency, but that is the cost to build once
only semantics on a highly flexible system. If you look at the algorithm,
we actually go to ZK only once per event (to create, there are no updates) - 
this
can even happen per batch if needed to reduce ZK round trips (though I am 
not sure if ZK provides a batch API).

The two phase commit approach sounds good, but it might require interface
changes which can now only be made in Flume 2.x. Alse, if we use a single
UUID combined with several flags we might be able to work duplicates caused
by this replication.


Thanks,
Hari


On Sunday, August 25, 2013 at 7:24 AM, Gabriel Commeau wrote:

Hi Hari,


I deleted my comment (again). The mailing list is probably a better
avenue
to discuss this ­ sorry about that! :)

I can find at least one other way duplicate events can occur, and so what
I provided helps to reduce duplicate events but is not sufficient to
guaranty exactly once semantics. However, I still think that using a
2-phase commit when writing to multiple channels would benefit Flume.
This
should probably be a different ticket though.

Concerning the algorithm you offered, the case of replicating channel
selector should probably be handled, by creating a new UUID for each
duplicate message.
I hope this helps.


Regards,

Gabriel

{quote}
                
> Exactly once semantics for Flume
> --------------------------------
>
>                 Key: FLUME-2173
>                 URL: https://issues.apache.org/jira/browse/FLUME-2173
>             Project: Flume
>          Issue Type: Bug
>            Reporter: Hari Shreedharan
>            Assignee: Hari Shreedharan
>
> Currently Flume guarantees only at least once semantics. This jira is meant 
> to track exactly once semantics for Flume. My initial idea is to include uuid 
> event ids on events at the original source (use a config to mark a source an 
> original source) and identify destination sinks. At the destination sinks, 
> use a unique ZK Znode to track the events. If once seen (and configured), 
> pull the duplicate out.
> This might need some refactoring, but my belief is we can do this in a 
> backward compatible way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to