Bosko,

With respect to potential dups generated by the producer, it can always
happen that the producer request was served at the broker, but due to
failure, the producer didn't get the ack. In this case, it's really up to
the producer to check whether the request actually went through or not. As
part of KAFKA-49, we are planning to return offsets to producers. With
that, it's potentially possible for the producer to read data from the
offset of last successful request to check if a message is at the server
before resending it, assuming there is some sort of GUID in the message
that you can check.

Thanks,

Jun

On Sat, Dec 3, 2011 at 9:45 PM, Bosko Milekic <bosko.mile...@bloom-hq.com>wrote:

> I've started looking at Kafka as a possible replacement for flume for
> moving event data into hadoop, on-the-fly stream processing, and for
> making some of our event data available to external parties via
> pub/sub.
>
> The two main reasons we're looking for alternatives to flume:
>
> - While trying to avoid hitting the disk, there's a problem when you
> move events downstream when the downstream (consumer) is unable to
> keep up and empty his socket receive buffers fast enough... in the
> case of flume the problem currently renders the flow to a halt. This
> manifests itself in the form of an upstream flume node's socket send
> buffer filling up and blocking the running thread.  This is a bummer
> as even "fan-out" nodes in flume are currently single threaded, so
> fanning out can completely stop if one of the downstream nodes gets
> bogged down [apparently this is addressed in the flume-ng ongoing work
> (I'm told that fanout node is multi-threaded there)] -- I was pleased
> to see that this drawback was specifically addressed and avoided by
> Kafka's design (brokers always buffer and consumers pull as they see
> fit);
>
> - Flume guarantees at-least-once event delivery in E2E and DFO modes,
> and at-most-once in BE mode. This means de-dupping is required if you
> want complete data. Wondering if de-dupping could be avoided with
> exactly-once delivery, and it seems to me, *at least for broker ->
> consumer communication*, that exactly-once is what you get with Kafka
> (actually, it's not that Kafka provides this explicitly, it's that one
> can build consumers that do).
>
> However, once I started looking at the producer api, I ran into a
> question that my (so far) brief perusing of the source has not yet
> answered - maybe you can help:
>
> The kafka producer API doesn't appear to provide explicit guarantees
> to successful reception of messages at the broker end. Both
> http://readthedocs.org/docs/brod/en/latest/spec.html and
> https://issues.apache.org/jira/browse/KAFKA-49 confirm this, and upon
> reading the solution being worked on in KAFKA-49, I'm wondering
> whether the simple protocol being implemented will in fact guarantee
> exactly-once delivery.  Is this the goal?  If the acknowledgment from
> the broker gets lost (e.g., producer sends a successful message to
> broker, broker sends ack, producer goes down before receiving ack and
> bookkeeping it), then it seems to me that the producer would end up
> with slightly stale state and thus re-transmit the last message that
> he did not manage to ack. This is fine, but it's at-least-once
> guarantee, not exactly-once.
>
> Is enabling exactly-once message delivery a goal for Kafka, and if so,
> what am I missing regarding the producer protocol being developed in
> KAFKA-49 and its intended/suggested usage? How do existing kafka
> setups (at LInkedIn or other) deal with a producer not being able to
> transmit to a broker for some transient time, without generating
> duplicates?
>
> Apologies if this was better suited for kafka-dev, feel free to
> forward there if that is the case.
>
> Best,
> b
>

Reply via email to