kafka git commit: KAFKA-5021; Update delivery semantics documentation for EoS (KIP-98)

jgus Tue, 20 Jun 2017 16:57:07 -0700

Repository: kafka
Updated Branches:
  refs/heads/trunk f28fc1100 -> cb3952a4f



KAFKA-5021; Update delivery semantics documentation for EoS (KIP-98)

Author: Jason Gustafson <[email protected]>

Reviewers: Apurva Mehta <[email protected]>, Ismael Juma <[email protected]>, 
Guozhang Wang <[email protected]>

Closes #3388 from hachikuji/KAFKA-5021


Project: http://git-wip-us.apache.org/repos/asf/kafka/repo
Commit: http://git-wip-us.apache.org/repos/asf/kafka/commit/cb3952a4
Tree: http://git-wip-us.apache.org/repos/asf/kafka/tree/cb3952a4
Diff: http://git-wip-us.apache.org/repos/asf/kafka/diff/cb3952a4

Branch: refs/heads/trunk
Commit: cb3952a4fd14f72374165acdc3ed777347ffdaba
Parents: f28fc11
Author: Jason Gustafson <[email protected]>
Authored: Tue Jun 20 16:56:38 2017 -0700
Committer: Jason Gustafson <[email protected]>
Committed: Tue Jun 20 16:56:38 2017 -0700

----------------------------------------------------------------------
 docs/design.html | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kafka/blob/cb3952a4/docs/design.html
----------------------------------------------------------------------
diff --git a/docs/design.html b/docs/design.html
index 8814c57..564df38 100644
--- a/docs/design.html
+++ b/docs/design.html
@@ -133,7 +133,7 @@
     the user can always compress its messages one at a time without any 
support needed from Kafka, but this can lead to very poor compression ratios as 
much of the redundancy is due to repetition between messages of
     the same type (e.g. field names in JSON or user agents in web logs or 
common string values). Efficient compression requires compressing multiple 
messages together rather than compressing each message individually.
     <p>
-    Kafka supports this by allowing recursive message sets. A batch of 
messages can be clumped together compressed and sent to the server in this 
form. This batch of messages will be written in compressed form and will
+    Kafka supports this with an efficient batching format. A batch of messages 
can be clumped together compressed and sent to the server in this form. This 
batch of messages will be written in compressed form and will
     remain compressed in the log and will only be decompressed by the consumer.
     <p>
     Kafka supports GZIP, Snappy and LZ4 compression protocols. More details on 
compression can be found <a 
href="https://cwiki.apache.org/confluence/display/KAFKA/Compression";>here</a>.
@@ -242,10 +242,11 @@
     described in more detail in the next section. For now let's assume a 
perfect, lossless broker and try to understand the guarantees to the producer 
and consumer. If a producer attempts to publish a message and 
     experiences a network error it cannot be sure if this error happened 
before or after the message was committed. This is similar to the semantics of 
inserting into a database table with an autogenerated key.
     <p>
-    These are not the strongest possible semantics for publishers. Although we 
cannot be sure of what happened in the case of a network error, it is possible 
to allow the producer to generate a sort of "primary key" that
-    makes retrying the produce request idempotent. This feature is not trivial 
for a replicated system because of course it must work even (or especially) in 
the case of a server failure. With this feature it would
-    suffice for the producer to retry until it receives acknowledgement of a 
successfully committed message at which point we would guarantee the message 
had been published exactly once. We hope to add this in a future
-    Kafka version.
+    Prior to 0.11.0.0, if a producer failed to receive a response indicating 
that a message was committed, it had little choice but to resend the message. 
This provides at-least-once delivery semantics since the
+    message may be written to the log again during resending if the original 
request had in fact succeeded. Since 0.11.0.0, the Kafka producer also supports 
an idempotent delivery option which guarantees that resending
+    will not result in duplicate entries in the log. To achieve this, the 
broker assigns each producer an ID and deduplicates messages using a sequence 
number that is sent by the producer along with every message.
+    Also beginning with 0.11.0.0, the producer supports the ability to send 
messages to multiple topic partitions using transaction-like semantics: i.e. 
either all messages are successfully written or none of them are.
+    The main use case for this is exactly-once processing between Kafka topics 
(described below).
     <p>
     Not all use cases require such strong guarantees. For uses which are 
latency sensitive we allow the producer to specify the durability level it 
desires. If the producer specifies that it wants to wait on the message
     being committed this can take on the order of 10 ms. However the producer 
can also specify that it wants to perform the send completely asynchronously or 
that it wants to wait only until the leader (but not
@@ -261,15 +262,24 @@
     <li>It can read the messages, process the messages, and finally save its 
position. In this case there is a possibility that the consumer process crashes 
after processing messages but before saving its position.
     In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
     messages have a primary key and so the updates are idempotent (receiving 
the same message twice just overwrites a record with another copy of itself).
-    <li>So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
-    what is actually stored as output. The classic way of achieving this would 
be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
-    handled more simply and generally by simply letting the consumer store its 
offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
-    support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
-    or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
     </ol>
     <p>
-    So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-    messages. Exactly-once delivery requires co-operation with the destination 
storage system but Kafka provides the offset which makes implementing this 
straight-forward.
+    So what about exactly once semantics (i.e. the thing you actually want)? 
When consuming from a Kafka topic and producing to another topic (as in a <a 
href="https://kafka.apache.org/documentation/streams";>Kafka Streams</a>
+    application), we can leverage the new transactional producer capabilities 
in 0.11.0.0 that were mentioned above. The consumer's position is stored as a 
message in a topic, so we can write the offset to Kafka in the
+    same transaction as the output topics receiving the processed data. If the 
transaction is aborted, the consumer's position will revert to its old value 
and the produced data on the output topics will not be visible
+    to other consumers, depending on their "isolation level." In the default 
"read_uncommitted" isolation level, all messages are visible to consumers even 
if they were part of an aborted transaction,
+    but in "read_committed," the consumer will only return messages from 
transactions which were committed (and any messages which were not part of a 
transaction).
+    <p>
+    When writing to an external system, the limitation is in the need to 
coordinate the consumer's position with what is actually stored as output. The 
classic way of achieving this would be to introduce a two-phase
+    commit between the storage of the consumer position and the storage of the 
consumers output. But this can be handled more simply and generally by letting 
the consumer store its offset in the same place as
+    its output. This is better because many of the output systems a consumer 
might want to write to will not support a two-phase commit. As an example of 
this, consider a
+    <a href="https://kafka.apache.org/documentation/#connect";>Kafka 
Connect</a> connector which populates data in HDFS along with the offsets of 
the data it reads so that it is guaranteed that either data and
+    offsets are both updated or neither is. We follow similar patterns for 
many other data systems which require these stronger semantics and for which 
the messages do not have a primary key to allow for deduplication.
+    <p>
+    So effectively Kafka supports exactly-once delivery in <a 
href="https://kafka.apache.org/documentation/streams";>Kafka Streams</a>, and 
the transactional producer/consumer can be used generally to provide
+    exactly-once delivery when transfering and processing data between Kafka 
topics. Exactly-once delivery for other destination systems generally requires 
cooperation with such systems, but Kafka provides the
+    offset which makes implementing this feasible (see also <a 
href="https://kafka.apache.org/documentation/#connect";>Kafka Connect</a>). 
Otherwise, Kafka guarantees at-least-once delivery by default, and allows
+    the user to implement at-most-once delivery by disabling retries on the 
producer and committing offsets in the consumer prior to processing a batch of 
messages.
 
     <h3><a id="replication" href="#replication">4.7 Replication</a></h3>
     <p>

kafka git commit: KAFKA-5021; Update delivery semantics documentation for EoS (KIP-98)

Reply via email to