I like the idea; I'll try and implement that now

EDIT: Looking at this  I have some more thoughts.

Why limit to just when people name the repartition topic?  Since we have a 
graph now, we can keep a reference to the repartition graph node and at this 
point in the code always re-use this node for repartitioning.  But this could 
be tricky as this will still affect an existing topology. 

For example, consider a user with multiple `KGroupedStream` calls where a 
repartition is required.  While this means we have created multiple repartition 
topics, this also means that we have incremented the processor counter N times 
(N being the number of repartition topics).  If we adopt this approach, and the 
user names the repartition topic, and we reuse the first created repartition 
topic, we'll change the number of all downstream operations including changelog 
topics and any other repartition topics.  This "skipping incrementing" is 
similar to what happened when re-using a source topic for source `KTable` 
changelogs.

While I realize most users will probably name all repartition topics, by doing 
so, they'll have to ensure they name any changelog topics as well if we reuse 
the repartition topics in-line.  With the current optimization approach the 
numbering isn't affected, we move the nodes around.

Additionally,  I"m not sure how this will affect the current optimization 
approach (maybe change it, as I think if we keep repartition node references as 
we go we could have "automatic" partial merging ?)

I'm thinking this approach is could worth looking into, but as an immediate 
follow-on PR to this one as this requires some thought.

WDYT?





[ Full content available at: https://github.com/apache/kafka/pull/5709 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to