[ 
https://issues.apache.org/jira/browse/KAFKA-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535324#comment-16535324
 ] 

Matthias J. Sax commented on KAFKA-4113:
----------------------------------------

Thanks for the feedback [~twbecker] and [~graphex]! We prioritize feature based 
on user feedback. In the beginning there was not much complaint about it.

For GlobalKTable the behavior is different by design, because GlobalKTables are 
designed for "static" data. From my point of view, the design space has two 
dimensions: partitioned vs broadcasted data, and timestamp-alignment or 
non-alignment. Currently, we only offer partitions plus aligned (KTable) and 
broadcasted plus non-aligned (GlobalKTable). Thus, we are missing two more.

Bootstrapping/pre-loading only makes sense for the non-aligned cases IMHO.

We brainstormed about making the strategy plugable at some point – but never 
pushed it forward so far. I see some more additional use-cases for which this 
might make sense. It's all about feature prioritization and how much we can get 
done... Of course, it's an open-source project and contributions are very 
welcome :)

I personally believe that the timestamp aligned semantic is correct and we 
should not sacrifice it. As mentioned above, I am happy to complement the 
design space and offer all 4 KTable variants. The non-timestamp aligned KTable 
should not be too hard to implement. The broadcast plus timestamp alignment 
thing is the most difficult one. The plugable strategy might also not be too 
hard to implement. But all of those would require a KIP to get a sound design.

> Allow KTable bootstrap
> ----------------------
>
>                 Key: KAFKA-4113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4113
>             Project: Kafka
>          Issue Type: New Feature
>          Components: streams
>            Reporter: Matthias J. Sax
>            Assignee: Guozhang Wang
>            Priority: Major
>
> On the mailing list, there are multiple request about the possibility to 
> "fully populate" a KTable before actual stream processing start.
> Even if it is somewhat difficult to define, when the initial populating phase 
> should end, there are multiple possibilities:
> The main idea is, that there is a rarely updated topic that contains the 
> data. Only after this topic got read completely and the KTable is ready, the 
> application should start processing. This would indicate, that on startup, 
> the current partition sizes must be fetched and stored, and after KTable got 
> populated up to those offsets, stream processing can start.
> Other discussed ideas are:
> 1) an initial fixed time period for populating
> (it might be hard for a user to estimate the correct value)
> 2) an "idle" period, ie, if no update to a KTable for a certain time is
> done, we consider it as populated
> 3) a timestamp cut off point, ie, all records with an older timestamp
> belong to the initial populating phase
> The API change is not decided yet, and the API desing is part of this JIRA.
> One suggestion (for option (4)) was:
> {noformat}
> KTable table = builder.table("topic", 1000); // populate the table without 
> reading any other topics until see one record with timestamp 1000.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to