[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-20 Thread hachikuji
Github user hachikuji closed the pull request at:

https://github.com/apache/kafka-site/pull/60


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-19 Thread hachikuji
Github user hachikuji commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r122826323
  
--- Diff: 0110/design.html ---
@@ -264,21 +264,22 @@
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
 
 
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
-what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
-handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
-support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
-or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
-A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
-Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
-the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
-"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
-(and any messages which were not part of any transaction).
-
-So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
-cooperation with that system, but Kafka provides the offset which 
makes implementing this straight-forward.
+So what about exactly once semantics (i.e. the thing you actually 
want)? When consuming from a Kafka topic and producing to another topic (as in 
a https://kafka.apache.org/documentation/streams;>Kafka Streams
+application), we can leverage the new transactional producer 
capabilities in 0.11.0.0 that were mentioned above. The consumer's position is 
stored as a message in a topic, so we can write the offset to Kafka in the
+same transaction as the output topics receiving the processed data. If 
the transaction is aborted, the consumer's position will revert to its old 
value and none of the output data will be visible to consumers. Consumers 
support an "isolation level" configuration
+to achieve this. In the default "read_uncommitted" mode, all messages 
are visible to consumers even if they were part of an aborted transaction, but 
in "read_committed" mode, the consumer will only return data from
+transactions which were committed (and any messages which were not 
part of any transaction).
+
+When writing to an external system, the limitation is in the need to 
coordinate the consumer's position with what is actually stored as output. The 
classic way of achieving this would be to introduce a two-phase
+commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as
--- End diff --

I'm not sure I understand the second question. Can you elaborate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-19 Thread hachikuji
Github user hachikuji commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r122825810
  
--- Diff: 0110/design.html ---
@@ -264,21 +264,22 @@
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
 
 
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
-what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
-handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
-support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
-or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
-A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
-Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
-the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
-"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
-(and any messages which were not part of any transaction).
-
-So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
-cooperation with that system, but Kafka provides the offset which 
makes implementing this straight-forward.
+So what about exactly once semantics (i.e. the thing you actually 
want)? When consuming from a Kafka topic and producing to another topic (as in 
a https://kafka.apache.org/documentation/streams;>Kafka Streams
+application), we can leverage the new transactional producer 
capabilities in 0.11.0.0 that were mentioned above. The consumer's position is 
stored as a message in a topic, so we can write the offset to Kafka in the
+same transaction as the output topics receiving the processed data. If 
the transaction is aborted, the consumer's position will revert to its old 
value and none of the output data will be visible to consumers. Consumers 
support an "isolation level" configuration
+to achieve this. In the default "read_uncommitted" mode, all messages 
are visible to consumers even if they were part of an aborted transaction, but 
in "read_committed" mode, the consumer will only return data from
+transactions which were committed (and any messages which were not 
part of any transaction).
+
+When writing to an external system, the limitation is in the need to 
coordinate the consumer's position with what is actually stored as output. The 
classic way of achieving this would be to introduce a two-phase
+commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as
+its output. This is better because many of the output systems a 
consumer might want to write to will not support a two-phase commit. As an 
example of this, our Hadoop ETL that populates data in HDFS stores its
+offsets in HDFS with the data it reads so that it is guaranteed that 
either data and offsets are both updated or neither is. We follow similar 
patterns for many other data systems which require these stronger
+semantics and for which the messages 

[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-14 Thread ijuma
Github user ijuma commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r122096399
  
--- Diff: 0110/design.html ---
@@ -264,21 +264,22 @@
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
 
 
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
-what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
-handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
-support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
-or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
-A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
-Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
-the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
-"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
-(and any messages which were not part of any transaction).
-
-So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
-cooperation with that system, but Kafka provides the offset which 
makes implementing this straight-forward.
+So what about exactly once semantics (i.e. the thing you actually 
want)? When consuming from a Kafka topic and producing to another topic (as in 
a https://kafka.apache.org/documentation/streams;>Kafka Streams
+application), we can leverage the new transactional producer 
capabilities in 0.11.0.0 that were mentioned above. The consumer's position is 
stored as a message in a topic, so we can write the offset to Kafka in the
+same transaction as the output topics receiving the processed data. If 
the transaction is aborted, the consumer's position will revert to its old 
value and none of the output data will be visible to consumers. Consumers 
support an "isolation level" configuration
+to achieve this. In the default "read_uncommitted" mode, all messages 
are visible to consumers even if they were part of an aborted transaction, but 
in "read_committed" mode, the consumer will only return data from
+transactions which were committed (and any messages which were not 
part of any transaction).
+
+When writing to an external system, the limitation is in the need to 
coordinate the consumer's position with what is actually stored as output. The 
classic way of achieving this would be to introduce a two-phase
+commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as
+its output. This is better because many of the output systems a 
consumer might want to write to will not support a two-phase commit. As an 
example of this, our Hadoop ETL that populates data in HDFS stores its
--- End diff --

`our ETL` doesn't seem right because it's a Confluent Connector right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and 

[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-14 Thread ijuma
Github user ijuma commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r122096525
  
--- Diff: 0110/design.html ---
@@ -264,21 +264,22 @@
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
 
 
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
-what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
-handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
-support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
-or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
-A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
-Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
-the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
-"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
-(and any messages which were not part of any transaction).
-
-So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
-cooperation with that system, but Kafka provides the offset which 
makes implementing this straight-forward.
+So what about exactly once semantics (i.e. the thing you actually 
want)? When consuming from a Kafka topic and producing to another topic (as in 
a https://kafka.apache.org/documentation/streams;>Kafka Streams
+application), we can leverage the new transactional producer 
capabilities in 0.11.0.0 that were mentioned above. The consumer's position is 
stored as a message in a topic, so we can write the offset to Kafka in the
+same transaction as the output topics receiving the processed data. If 
the transaction is aborted, the consumer's position will revert to its old 
value and none of the output data will be visible to consumers. Consumers 
support an "isolation level" configuration
+to achieve this. In the default "read_uncommitted" mode, all messages 
are visible to consumers even if they were part of an aborted transaction, but 
in "read_committed" mode, the consumer will only return data from
+transactions which were committed (and any messages which were not 
part of any transaction).
+
+When writing to an external system, the limitation is in the need to 
coordinate the consumer's position with what is actually stored as output. The 
classic way of achieving this would be to introduce a two-phase
+commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as
+its output. This is better because many of the output systems a 
consumer might want to write to will not support a two-phase commit. As an 
example of this, our Hadoop ETL that populates data in HDFS stores its
+offsets in HDFS with the data it reads so that it is guaranteed that 
either data and offsets are both updated or neither is. We follow similar 
patterns for many other data systems which require these stronger
+semantics and for which the messages do 

[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-14 Thread ijuma
Github user ijuma commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r122096221
  
--- Diff: 0110/design.html ---
@@ -264,21 +264,22 @@
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
 
 
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
-what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
-handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
-support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
-or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
-A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
-Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
-the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
-"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
-(and any messages which were not part of any transaction).
-
-So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
-cooperation with that system, but Kafka provides the offset which 
makes implementing this straight-forward.
+So what about exactly once semantics (i.e. the thing you actually 
want)? When consuming from a Kafka topic and producing to another topic (as in 
a https://kafka.apache.org/documentation/streams;>Kafka Streams
+application), we can leverage the new transactional producer 
capabilities in 0.11.0.0 that were mentioned above. The consumer's position is 
stored as a message in a topic, so we can write the offset to Kafka in the
+same transaction as the output topics receiving the processed data. If 
the transaction is aborted, the consumer's position will revert to its old 
value and none of the output data will be visible to consumers. Consumers 
support an "isolation level" configuration
+to achieve this. In the default "read_uncommitted" mode, all messages 
are visible to consumers even if they were part of an aborted transaction, but 
in "read_committed" mode, the consumer will only return data from
+transactions which were committed (and any messages which were not 
part of any transaction).
+
+When writing to an external system, the limitation is in the need to 
coordinate the consumer's position with what is actually stored as output. The 
classic way of achieving this would be to introduce a two-phase
+commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as
--- End diff --

"storage of the consumer position"? Since we talk about consumer position, 
should we be saying `consumer's output`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-14 Thread ijuma
Github user ijuma commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r122096607
  
--- Diff: 0110/design.html ---
@@ -264,21 +264,22 @@
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
 
 
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
-what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
-handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
-support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
-or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
-A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
-Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
-the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
-"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
-(and any messages which were not part of any transaction).
-
-So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
-cooperation with that system, but Kafka provides the offset which 
makes implementing this straight-forward.
+So what about exactly once semantics (i.e. the thing you actually 
want)? When consuming from a Kafka topic and producing to another topic (as in 
a https://kafka.apache.org/documentation/streams;>Kafka Streams
+application), we can leverage the new transactional producer 
capabilities in 0.11.0.0 that were mentioned above. The consumer's position is 
stored as a message in a topic, so we can write the offset to Kafka in the
+same transaction as the output topics receiving the processed data. If 
the transaction is aborted, the consumer's position will revert to its old 
value and none of the output data will be visible to consumers. Consumers 
support an "isolation level" configuration
+to achieve this. In the default "read_uncommitted" mode, all messages 
are visible to consumers even if they were part of an aborted transaction, but 
in "read_committed" mode, the consumer will only return data from
+transactions which were committed (and any messages which were not 
part of any transaction).
+
+When writing to an external system, the limitation is in the need to 
coordinate the consumer's position with what is actually stored as output. The 
classic way of achieving this would be to introduce a two-phase
+commit between the storage for the consumer position and the storage 
of the consumers output. But this can be handled more simply and generally by 
simply letting the consumer store its offset in the same place as
+its output. This is better because many of the output systems a 
consumer might want to write to will not support a two-phase commit. As an 
example of this, our Hadoop ETL that populates data in HDFS stores its
+offsets in HDFS with the data it reads so that it is guaranteed that 
either data and offsets are both updated or neither is. We follow similar 
patterns for many other data systems which require these stronger
+semantics and for which the messages do 

[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-12 Thread guozhangwang
Github user guozhangwang commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r121496464
  
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
 It can read the messages, process the messages, and finally save 
its position. In this case there is a possibility that the consumer process 
crashes after processing messages but before saving its position.
 In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
+
+
+So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
 what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
 handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
 support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
 or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
 
-So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery requires co-operation with the 
destination storage system but Kafka provides the offset which makes 
implementing this straight-forward.
+A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
+the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
+"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
+(and any messages which were not part of any transaction).
+
+So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
+messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
--- End diff --

nit: add a ref link when mentioning Kafka Streams? 
https://kafka.apache.org/documentation/streams


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-12 Thread guozhangwang
Github user guozhangwang commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r121496370
  
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
 It can read the messages, process the messages, and finally save 
its position. In this case there is a possibility that the consumer process 
crashes after processing messages but before saving its position.
 In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
+
+
+So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
 what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
 handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
 support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
 or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
 
-So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery requires co-operation with the 
destination storage system but Kafka provides the offset which makes 
implementing this straight-forward.
+A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
+the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
+"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
+(and any messages which were not part of any transaction).
+
+So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
+messages. Exactly-once delivery is supported when processing messages 
between Kafka topics, such as in Kafka Streams applications. Exactly-once 
delivery for other destination storage system generally requires
--- End diff --

By just reading the first sentence it looks like "exactly-once is 
constraint to only the case when processing messages between Kafka topics" but 
only the second line points out that e.g. even with an external system like 
HDFS, Kafka Connect can also achieve EOS by coordinating with the system.

Maybe we can also re-phrase it as "Exactly-once delivery is supported 
out-of-the-box when ...; for other destination storage system it is generally 
required to cooperate ... to achieve exactly-once".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-12 Thread guozhangwang
Github user guozhangwang commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r121492476
  
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
 It can read the messages, process the messages, and finally save 
its position. In this case there is a possibility that the consumer process 
crashes after processing messages but before saving its position.
 In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
+
+
+So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
 what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
 handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
 support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
 or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
 
-So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery requires co-operation with the 
destination storage system but Kafka provides the offset which makes 
implementing this straight-forward.
+A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
+the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
+"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
+(and any messages which were not part of any transaction).
+
+So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
--- End diff --

+1. Maybe we could say "Kafka guarantees exactly-once delivery, but default 
it is turned off to optimize performance with at-least-once delivery" blah blah.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-09 Thread hachikuji
Github user hachikuji commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r121249550
  
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
 It can read the messages, process the messages, and finally save 
its position. In this case there is a possibility that the consumer process 
crashes after processing messages but before saving its position.
 In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
+
+
+So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
 what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
 handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
 support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
 or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
 
-So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery requires co-operation with the 
destination storage system but Kafka provides the offset which makes 
implementing this straight-forward.
+A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
+the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
--- End diff --

Thanks. I will try to clarify these lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-09 Thread hachikuji
Github user hachikuji commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r121249508
  
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
 It can read the messages, process the messages, and finally save 
its position. In this case there is a possibility that the consumer process 
crashes after processing messages but before saving its position.
 In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
+
+
+So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
 what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
 handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
 support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
 or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
 
-So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery requires co-operation with the 
destination storage system but Kafka provides the offset which makes 
implementing this straight-forward.
+A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
+the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
+"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
+(and any messages which were not part of any transaction).
+
+So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
--- End diff --

Good point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-09 Thread ijuma
Github user ijuma commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r121246230
  
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
 It can read the messages, process the messages, and finally save 
its position. In this case there is a possibility that the consumer process 
crashes after processing messages but before saving its position.
 In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
+
+
+So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
 what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
 handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
 support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
 or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
 
-So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery requires co-operation with the 
destination storage system but Kafka provides the offset which makes 
implementing this straight-forward.
+A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
+the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
+"read_uncommitted" mode, all messages are visible to consumers even if 
they were part of an aborted transaction, but in "read_committed" mode, the 
consumer will only return data from transactions which were committed
+(and any messages which were not part of any transaction).
+
+So effectively Kafka guarantees at-least-once delivery by default, and 
allows the user to implement at-most-once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
--- End diff --

I wonder if we should mention the stronger guarantees first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-09 Thread ijuma
Github user ijuma commented on a diff in the pull request:

https://github.com/apache/kafka-site/pull/60#discussion_r121246131
  
--- Diff: 0110/design.html ---
@@ -261,15 +262,23 @@
 It can read the messages, process the messages, and finally save 
its position. In this case there is a possibility that the consumer process 
crashes after processing messages but before saving its position.
 In this case when the new process takes over the first few messages it 
receives will already have been processed. This corresponds to the 
"at-least-once" semantics in the case of consumer failure. In many cases
 messages have a primary key and so the updates are idempotent 
(receiving the same message twice just overwrites a record with another copy of 
itself).
-So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to co-ordinate the consumer's position with
+
+
+So what about exactly once semantics (i.e. the thing you actually 
want)? The limitation here is not actually a feature of the messaging system 
but rather the need to coordinate the consumer's position with
 what is actually stored as output. The classic way of achieving this 
would be to introduce a two-phase commit between the storage for the consumer 
position and the storage of the consumers output. But this can be
 handled more simply and generally by simply letting the consumer store 
its offset in the same place as its output. This is better because many of the 
output systems a consumer might want to write to will not
 support a two-phase commit. As an example of this, our Hadoop ETL that 
populates data in HDFS stores its offsets in HDFS with the data it reads so 
that it is guaranteed that either data and offsets are both updated
 or neither is. We follow similar patterns for many other data systems 
which require these stronger semantics and for which the messages do not have a 
primary key to allow for deduplication.
-
 
-So effectively Kafka guarantees at-least-once delivery by default and 
allows the user to implement at most once delivery by disabling retries on the 
producer and committing its offset prior to processing a batch of
-messages. Exactly-once delivery requires co-operation with the 
destination storage system but Kafka provides the offset which makes 
implementing this straight-forward.
+A special case is when the output system is just another Kafka topic 
(e.g. in a Kafka Streams application). Here we can leverage the new 
transactional producer capabilities in 0.11.0.0 that were mentioned above.
+Since the consumer's position is stored as a message in a topic, we 
can ensure that that topic is included in the same transaction as the output 
topics receiving the processed data. If the transaction is aborted,
+the consumer's position will revert to its old value and none of the 
output data will be visible to consumers. To enable this, consumers support an 
"isolation level" to achieve this. In the default
--- End diff --

Seems like this sentence could be clarified a little. `we can ensure that 
that topic is included in the same transaction`, `that` is repeated and it's 
unclear what it means to include a topic in a transaction. Did you mean the 
updates to the topic or something along those lines?

Also `the consumer's position will revert to its old value and none of the 
output data will be visible to consumers`. It may be worth qualifying 
`consumer` and `consumers` in that sentence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] kafka-site pull request #60: Update delivery semantics section for KIP-98

2017-06-08 Thread hachikuji
GitHub user hachikuji opened a pull request:

https://github.com/apache/kafka-site/pull/60

Update delivery semantics section for KIP-98



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hachikuji/kafka-site update-delivery-semantics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka-site/pull/60.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #60


commit 0759f3a22e377e12e857eb6da4977adb62261d30
Author: Jason Gustafson 
Date:   2017-06-08T21:28:13Z

Update delivery semantics section for KIP-98




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---