I've started looking at Kafka as a possible replacement for flume for moving event data into hadoop, on-the-fly stream processing, and for making some of our event data available to external parties via pub/sub.
The two main reasons we're looking for alternatives to flume: - While trying to avoid hitting the disk, there's a problem when you move events downstream when the downstream (consumer) is unable to keep up and empty his socket receive buffers fast enough... in the case of flume the problem currently renders the flow to a halt. This manifests itself in the form of an upstream flume node's socket send buffer filling up and blocking the running thread. This is a bummer as even "fan-out" nodes in flume are currently single threaded, so fanning out can completely stop if one of the downstream nodes gets bogged down [apparently this is addressed in the flume-ng ongoing work (I'm told that fanout node is multi-threaded there)] -- I was pleased to see that this drawback was specifically addressed and avoided by Kafka's design (brokers always buffer and consumers pull as they see fit); - Flume guarantees at-least-once event delivery in E2E and DFO modes, and at-most-once in BE mode. This means de-dupping is required if you want complete data. Wondering if de-dupping could be avoided with exactly-once delivery, and it seems to me, *at least for broker -> consumer communication*, that exactly-once is what you get with Kafka (actually, it's not that Kafka provides this explicitly, it's that one can build consumers that do). However, once I started looking at the producer api, I ran into a question that my (so far) brief perusing of the source has not yet answered - maybe you can help: The kafka producer API doesn't appear to provide explicit guarantees to successful reception of messages at the broker end. Both http://readthedocs.org/docs/brod/en/latest/spec.html and https://issues.apache.org/jira/browse/KAFKA-49 confirm this, and upon reading the solution being worked on in KAFKA-49, I'm wondering whether the simple protocol being implemented will in fact guarantee exactly-once delivery. Is this the goal? If the acknowledgment from the broker gets lost (e.g., producer sends a successful message to broker, broker sends ack, producer goes down before receiving ack and bookkeeping it), then it seems to me that the producer would end up with slightly stale state and thus re-transmit the last message that he did not manage to ack. This is fine, but it's at-least-once guarantee, not exactly-once. Is enabling exactly-once message delivery a goal for Kafka, and if so, what am I missing regarding the producer protocol being developed in KAFKA-49 and its intended/suggested usage? How do existing kafka setups (at LInkedIn or other) deal with a producer not being able to transmit to a broker for some transient time, without generating duplicates? Apologies if this was better suited for kafka-dev, feel free to forward there if that is the case. Best, b