[ 
https://issues.apache.org/jira/browse/KAFKA-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456332#comment-15456332
 ] 

Matthias J. Sax commented on KAFKA-4113:
----------------------------------------

Two things to add. (1) prioritizing smaller timestamps in processing is "best 
effort" and thus only a weak guarantee. (2) This feature is useful if you have 
a more "static table" that you need to fully populate on *very first Streams 
application startup* -- this table might get an update every few weeks, but the 
"current content" of the table must be fully available because if I get a first 
KStream record, I need to enrich it via a join to the KTable and if the KTable 
record could be missing, the result would be wrong. (Assume you know that there 
will be a matching KTable record for every KStreams record). Thus, we need to 
have a way to _guarantee_ that the KTable is first populated completely. Our 
current "best effort approach" would start to process all streams from the 
beginning (even if it would prioritize KTable and catch up eventually). 
However, the first KStreams records would be processed incorrectly, as no 
matching KTable record might be found, and thus the KStream record gets dropped 
on inner-join (or has missing right-hand side on left-join an is thus 
corrupted).

> Allow KTable bootstrap
> ----------------------
>
>                 Key: KAFKA-4113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4113
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: streams
>            Reporter: Matthias J. Sax
>            Assignee: Guozhang Wang
>
> On the mailing list, there are multiple request about the possibility to 
> "fully populate" a KTable before actual stream processing start.
> Even if it is somewhat difficult to define, when the initial populating phase 
> should end, there are multiple possibilities:
> The main idea is, that there is a rarely updated topic that contains the 
> data. Only after this topic got read completely and the KTable is ready, the 
> application should start processing. This would indicate, that on startup, 
> the current partition sizes must be fetched and stored, and after KTable got 
> populated up to those offsets, stream processing can start.
> Other discussed ideas are:
> 1) an initial fixed time period for populating
> (it might be hard for a user to estimate the correct value)
> 2) an "idle" period, ie, if no update to a KTable for a certain time is
> done, we consider it as populated
> 3) a timestamp cut off point, ie, all records with an older timestamp
> belong to the initial populating phase
> The API change is not decided yet, and the API desing is part of this JIRA.
> One suggestion (for option (4)) was:
> {noformat}
> KTable table = builder.table("topic", 1000); // populate the table without 
> reading any other topics until see one record with timestamp 1000.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to