This is an automated email from the ASF dual-hosted git repository. guozhang pushed a commit to branch trunk in repository https://gitbox.apache.org/repos/asf/kafka.git
The following commit(s) were added to refs/heads/trunk by this push: new 68b2f49 KAFKA-3514, Documentations: Add out of ordering in concepts. (#5458) 68b2f49 is described below commit 68b2f49ea75059df5527378e8ae771195029c98a Author: Guozhang Wang <wangg...@gmail.com> AuthorDate: Tue Sep 11 16:28:52 2018 -0700 KAFKA-3514, Documentations: Add out of ordering in concepts. (#5458) Reviewers: Matthias J. Sax <matth...@confluent.io>, John Roesler <j...@confluent.io>, Bill Bejeck <b...@confluent.io> --- docs/streams/core-concepts.html | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/docs/streams/core-concepts.html b/docs/streams/core-concepts.html index 3f9eab5..015fbb4 100644 --- a/docs/streams/core-concepts.html +++ b/docs/streams/core-concepts.html @@ -172,6 +172,37 @@ More details can be found in the <a href="/{{version}}/documentation#streamsconfigs"><b>Kafka Streams Configs</b></a> section. </p> + <h3><a id="streams_out_of_ordering" href="#streams_out_of_ordering">Out-of-Order Handling</a></h3> + + <p> + Besides the guarantee that each record will be processed exactly-once, another issue that many stream processing application will face is how to + handle <a href="tbd">out-of-order data</a> that may impact their business logic. In Kafka Streams, there are two causes that could potentially + result in out-of-order data arrivals with respect to their timestamps: + </p> + + <ul> + <li> Within a topic-partition, a record's timestamp may not be monotonically increasing along with their offsets. Since Kafka Streams will always try to process records within a topic-partition to follow the offset order, + it can cause records with larger timestamps (but smaller offsets) to be processed earlier than records with smaller timestamps (but larger offsets) in the same topic-partition. + </li> + <li> Within a <a href="/{{version}}/documentation/streams/architecture#streams_architecture_tasks">stream task</a> that may be processing multiple topic-partitions, if users configure the application to not wait for all partitions to contain some buffered data and + pick from the partition with the smallest timestamp to process the next record, then later on when some records are fetched for other topic-partitions, their timestamps may be smaller than those processed records fetched from another topic-partition. + </li> + </ul> + + <p> + For stateless operations, out-of-order data will not impact processing logic since only one record is considered at a time, without looking into the history of past processed records; + for stateful operations such as aggregations and joins, however, out-of-order data could cause the processing logic to be incorrect. If users want to handle such out-of-order data, generally they need to allow their applications + to wait for longer time while bookkeeping their states during the wait time, i.e. making trade-off decisions between latency, cost, and correctness. + In Kafka Streams specifically, users can configure their window operators for windowed aggregations to achieve such trade-offs (details can be found in <a href="/{{version}}/documentation/streams/developer-guide"><b>Developer Guide</b></a>). + As for Joins, users have to be aware that some of the out-of-order data cannot be handled by increasing on latency and cost in Streams yet: + </p> + + <ul> + <li> For Stream-Stream joins, all three types (inner, outer, left) handle out-of-order records correctly, but the resulted stream may contain unnecessary leftRecord-null for left joins, and leftRecord-null or null-rightRecord for outer joins. </li> + <li> For Stream-Table joins, out-of-order records are not handled (i.e., Streams applications don't check for out-of-order records and just process all records in offset order), and hence it may produce unpredictable results. </li> + <li> For Table-Table joins, out-of-order records are not handled (i.e., Streams applications don't check for out-of-order records and just process all records in offset order). However, the join result is a changelog stream and hence will be eventually consistent. </li> + </ul> + <div class="pagination"> <a href="/{{version}}/documentation/streams/tutorial" class="pagination__btn pagination__btn__prev">Previous</a> <a href="/{{version}}/documentation/streams/architecture" class="pagination__btn pagination__btn__next">Next</a>