[jira] [Commented] (KAFKA-4113) Allow KTable bootstrap

Greg Fodor (JIRA) Tue, 01 Nov 2016 16:51:18 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15627097#comment-15627097
 ]


Greg Fodor commented on KAFKA-4113:
-----------------------------------

Hey [~guozhang], I have been able to reproduce a bootstrapping issue on a fresh 
local node, and I think there might be some stuff I either need clarity on or 
may even be a bug.

The root cause seems to be here:

https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/AbstractTask.java#L137

For a completely new node/topology with a KTable topic with existing state, 
there is no consumer metadata, so this initializes the offset limit to 0, which 
results in the state restoration loop to basically not consume any records. 
I've only reproduced this in a local case where I was sinking data to a KTable 
topic and then initialized the topology for the first time, which is a one-time 
event, but I'm wondering if this offset limit default of zero could be causing 
issues later in the lifecycle of the topology as well.

> Allow KTable bootstrap
> ----------------------
>
>                 Key: KAFKA-4113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4113
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: streams
>            Reporter: Matthias J. Sax
>            Assignee: Guozhang Wang
>
> On the mailing list, there are multiple request about the possibility to 
> "fully populate" a KTable before actual stream processing start.
> Even if it is somewhat difficult to define, when the initial populating phase 
> should end, there are multiple possibilities:
> The main idea is, that there is a rarely updated topic that contains the 
> data. Only after this topic got read completely and the KTable is ready, the 
> application should start processing. This would indicate, that on startup, 
> the current partition sizes must be fetched and stored, and after KTable got 
> populated up to those offsets, stream processing can start.
> Other discussed ideas are:
> 1) an initial fixed time period for populating
> (it might be hard for a user to estimate the correct value)
> 2) an "idle" period, ie, if no update to a KTable for a certain time is
> done, we consider it as populated
> 3) a timestamp cut off point, ie, all records with an older timestamp
> belong to the initial populating phase
> The API change is not decided yet, and the API desing is part of this JIRA.
> One suggestion (for option (4)) was:
> {noformat}
> KTable table = builder.table("topic", 1000); // populate the table without 
> reading any other topics until see one record with timestamp 1000.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4113) Allow KTable bootstrap

Reply via email to