Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Levani Kokhreidze Wed, 24 Jul 2019 13:16:30 -0700

Hi Matthias,

Thanks for the suggestion. I Don’t have strong opinion on that one.
Agree that avoiding unnecessary method overloads is a good idea.


Updated KIP

Regards,
Levani


> On Jul 24, 2019, at 8:50 PM, Matthias J. Sax <[email protected]> wrote:
> 
> One question:
> 
> Why do we add
> 
>> Repartitioned#with(final String name, final int numberOfPartitions)
> 
> It seems that `#with(String name)`, `#numberOfPartitions(int)` in
> combination with `withName()` and `withNumberOfPartitions()` should be
> sufficient. Users can chain the method calls.
> 
> (I think it's valuable to keep the number of overload small if possible.)
> 
> Otherwise LGTM.
> 
> 
> -Matthias
> 
> 
> On 7/23/19 2:18 PM, Levani Kokhreidze wrote:
>> Hello,
>> 
>> Thanks all for your feedback.
>> I started voting procedure for this KIP. If there’re any other concerns 
>> about this KIP, please let me know.
>> 
>> Regards,
>> Levani
>> 
>>> On Jul 20, 2019, at 8:39 PM, Levani Kokhreidze <[email protected]> 
>>> wrote:
>>> 
>>> Hi Matthias,
>>> 
>>> Thanks for the suggestion, makes sense.
>>> I’ve updated KIP 
>>> (https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>  
>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>).
>>> 
>>> Regards,
>>> Levani
>>> 
>>> 
>>>> On Jul 20, 2019, at 3:53 AM, Matthias J. Sax <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Thanks for driving the KIP.
>>>> 
>>>> I agree that users need to be able to specify a partitioning strategy.
>>>> 
>>>> Sophie raises a fair point about topic configs and producer configs. My
>>>> take is, that consider `Repartitioned` as an "extension" to `Produced`,
>>>> that adds topic configuration, is a good way to think about it and helps
>>>> to keep the API "clean".
>>>> 
>>>> 
>>>> With regard to method names. I would prefer to avoid abbreviations. Can
>>>> we rename:
>>>> 
>>>> `withNumOfPartitions` -> `withNumberOfPartitions`
>>>> 
>>>> Furthermore, it might be good to add some more `static` methods:
>>>> 
>>>> - Repartitioned.with(Serde<K>, Serde<V>)
>>>> - Repartitioned.withNumberOfPartitions(int)
>>>> - Repartitioned.streamPartitioner(StreamPartitioner)
>>>> 
>>>> 
>>>> -Matthias
>>>> 
>>>> On 7/19/19 3:33 PM, Levani Kokhreidze wrote:
>>>>> Totally agree. I think in KStream interface it makes sense to have some 
>>>>> duplicate configurations between operators in order to keep API simple 
>>>>> and usable.
>>>>> Also, as more surface API has, harder it is to have proper backward 
>>>>> compatibility.
>>>>> While initial idea of keeping topic level configs separate was exciting, 
>>>>> having Repartitioned class encapsulate some producer level configs makes 
>>>>> API more readable.
>>>>> 
>>>>> Regards,
>>>>> Levani
>>>>> 
>>>>>> On Jul 20, 2019, at 1:15 AM, Sophie Blee-Goldman <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> I think that is a good point about trying to keep producer level
>>>>>> configurations and (repartition) topic level considerations separate.
>>>>>> Number of partitions is definitely purely a topic level configuration. 
>>>>>> But
>>>>>> on some level, serdes and partitioners are just as much a topic
>>>>>> configuration as a producer one. You could have two producers configured
>>>>>> with different serdes and/or partitioners, but if they are writing to the
>>>>>> same topic the result would be very difficult to part. So in a sense, 
>>>>>> these
>>>>>> are configurations of topics in Streams, not just producers.
>>>>>> 
>>>>>> Another way to think of it: while the Streams API is not always true to
>>>>>> this, ideally all the relevant configs for an operator are wrapped into a
>>>>>> single object (in this case, Repartitioned). We could instead split out 
>>>>>> the
>>>>>> fields in common with Produced into a separate parameter to keep topic 
>>>>>> and
>>>>>> producer level configurations separate, but this increases the API 
>>>>>> surface
>>>>>> area by a lot. It's much more straightforward to just say "this is
>>>>>> everything that this particular operator needs" without worrying about 
>>>>>> what
>>>>>> exactly you're specifying.
>>>>>> 
>>>>>> I suppose you could alternatively make Produced a field of Repartitioned,
>>>>>> but I don't think we do this kind of composition elsewhere in Streams at
>>>>>> the moment
>>>>>> 
>>>>>> On Fri, Jul 19, 2019 at 1:45 PM Levani Kokhreidze 
>>>>>> <[email protected] <mailto:[email protected]>>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Bill,
>>>>>>> 
>>>>>>> Thanks a lot for the feedback.
>>>>>>> Yes, that makes sense. I’ve updated KIP with `Repartitioned#partitioner`
>>>>>>> configuration.
>>>>>>> In the beginning, I wanted to introduce a class for topic level
>>>>>>> configuration and keep topic level and producer level configurations 
>>>>>>> (such
>>>>>>> as Produced) separately (see my second email in this thread).
>>>>>>> But while looking at the semantics of KStream interface, I couldn’t 
>>>>>>> really
>>>>>>> figure out good operation name for Topic level configuration class and 
>>>>>>> just
>>>>>>> introducing `Topic` config class was kinda breaking the semantics.
>>>>>>> So I think having Repartitioned class which encapsulates topic and
>>>>>>> producer level configurations for internal topics is viable thing to do.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Levani
>>>>>>> 
>>>>>>>> On Jul 19, 2019, at 7:47 PM, Bill Bejeck <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> Hi Lavani,
>>>>>>>> 
>>>>>>>> Thanks for resurrecting this KIP.
>>>>>>>> 
>>>>>>>> I'm also a +1 for adding a partition option.  In addition to the reason
>>>>>>>> provided by John, my reasoning is:
>>>>>>>> 
>>>>>>>> 1. Users may want to use something other than hash-based partitioning
>>>>>>>> 2. Users may wish to partition on something different than the key
>>>>>>>> without having to change the key.  For example:
>>>>>>>>   1. A combination of fields in the value in conjunction with the key
>>>>>>>>   2. Something other than the key
>>>>>>>> 3. We allow users to specify a partitioner on Produced hence in
>>>>>>>> KStream.to and KStream.through, so it makes sense for API consistency.
>>>>>>>> 
>>>>>>>> Just my  2 cents.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Bill
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jul 19, 2019 at 5:46 AM Levani Kokhreidze <
>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi John,
>>>>>>>>> 
>>>>>>>>> In my mind it makes sense.
>>>>>>>>> If we add partitioner configuration to Repartitioned class, with the
>>>>>>>>> combination of specifying number of partitions for internal topics, 
>>>>>>>>> user
>>>>>>>>> will have opportunity to ensure co-partitioning before join operation.
>>>>>>>>> I think this can be quite powerful feature.
>>>>>>>>> Wondering what others think about this?
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Levani
>>>>>>>>> 
>>>>>>>>>> On Jul 18, 2019, at 1:20 AM, John Roesler <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Yes, I believe that's what I had in mind. Again, not totally sure it
>>>>>>>>>> makes sense, but I believe something similar is the rationale for
>>>>>>>>>> having the partitioner option in Produced.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> -John
>>>>>>>>>> 
>>>>>>>>>> On Wed, Jul 17, 2019 at 3:20 PM Levani Kokhreidze
>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hey John,
>>>>>>>>>>> 
>>>>>>>>>>> Oh that’s interesting use-case.
>>>>>>>>>>> Do I understand this correctly, in your example I would first issue
>>>>>>>>> repartition(Repartitioned) with proper partitioner that essentially
>>>>>>> would
>>>>>>>>> be the same as the topic I want to join with and then do the
>>>>>>> KStream#join
>>>>>>>>> with DSL?
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Levani
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 17, 2019, at 11:11 PM, John Roesler <[email protected] 
>>>>>>>>>>>> <mailto:[email protected]>>
>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hey, all, just to chime in,
>>>>>>>>>>>> 
>>>>>>>>>>>> I think it might be useful to have an option to specify the
>>>>>>>>>>>> partitioner. The case I have in mind is that some data may get
>>>>>>>>>>>> repartitioned and then joined with an input topic. If the 
>>>>>>>>>>>> right-side
>>>>>>>>>>>> input topic uses a custom partitioning strategy, then the
>>>>>>>>>>>> repartitioned stream also needs to be partitioned with the same
>>>>>>>>>>>> strategy.
>>>>>>>>>>>> 
>>>>>>>>>>>> Does that make sense, or did I maybe miss something important?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -John
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jul 17, 2019 at 2:48 PM Levani Kokhreidze
>>>>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes, I was thinking about it as well. To be honest I’m not sure
>>>>>>> about
>>>>>>>>> it yet.
>>>>>>>>>>>>> As Kafka Streams DSL user, I don’t really think I would need 
>>>>>>>>>>>>> control
>>>>>>>>> over partitioner for internal topics.
>>>>>>>>>>>>> As a user, I would assume that Kafka Streams knows best how to
>>>>>>>>> partition data for internal topics.
>>>>>>>>>>>>> In this KIP I wrote that Produced should be used only for topics
>>>>>>> that
>>>>>>>>> are created by user In advance.
>>>>>>>>>>>>> In those cases maybe it make sense to have possibility to specify
>>>>>>> the
>>>>>>>>> partitioner.
>>>>>>>>>>>>> I don’t have clear answer on that yet, but I guess specifying the
>>>>>>>>> partitioner can be added as well if there’s agreement on this.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Levani
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 17, 2019, at 10:42 PM, Sophie Blee-Goldman <
>>>>>>>>> [email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for clearing that up. I agree that Repartitioned would be 
>>>>>>>>>>>>>> a
>>>>>>>>> useful
>>>>>>>>>>>>>> addition. I'm wondering if it might also need to have
>>>>>>>>>>>>>> a withStreamPartitioner method/field, similar to Produced? I'm 
>>>>>>>>>>>>>> not
>>>>>>>>> sure how
>>>>>>>>>>>>>> widely this feature is really used, but seems it should be
>>>>>>> available
>>>>>>>>> for
>>>>>>>>>>>>>> repartition topics.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 11:26 AM Levani Kokhreidze <
>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hey Sophie,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> In both cases KStream#repartition and
>>>>>>>>> KStream#repartition(Repartitioned)
>>>>>>>>>>>>>>> topic will be created and managed by Kafka Streams.
>>>>>>>>>>>>>>> Idea of Repartitioned is to give user more control over the 
>>>>>>>>>>>>>>> topic
>>>>>>>>> such as
>>>>>>>>>>>>>>> num of partitions.
>>>>>>>>>>>>>>> I feel like Repartitioned parameter is something that is missing
>>>>>>> in
>>>>>>>>>>>>>>> current DSL design.
>>>>>>>>>>>>>>> Essentially giving user control over parallelism by configuring
>>>>>>> num
>>>>>>>>> of
>>>>>>>>>>>>>>> partitions for internal topics.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hope this answers your question.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 17, 2019, at 9:02 PM, Sophie Blee-Goldman <
>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hey Levani,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks for the KIP! Can you clarify one thing for me -- for the
>>>>>>>>>>>>>>>> KStream#repartition signature taking a Repartitioned, will the
>>>>>>>>> topic be
>>>>>>>>>>>>>>>> auto-created by Streams (which seems to be the case for the
>>>>>>>>> signature
>>>>>>>>>>>>>>>> without a Repartitioned) or does it have to be pre-created? The
>>>>>>>>> wording
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> the KIP makes it seem like one version of the method will
>>>>>>>>> auto-create
>>>>>>>>>>>>>>>> topics while the other will not.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Sophie
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:15 AM Levani Kokhreidze <
>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> One more bump about KIP-221 (
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>  
>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint>
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>> )
>>>>>>>>>>>>>>>>> so it doesn’t get lost in mailing list :)
>>>>>>>>>>>>>>>>> Would love to hear communities opinions/concerns about this 
>>>>>>>>>>>>>>>>> KIP.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Jul 12, 2019, at 5:27 PM, Levani Kokhreidze <
>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Kind reminder about this KIP:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Jul 9, 2019, at 11:38 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> In order to move this KIP forward, I’ve updated confluence
>>>>>>> page
>>>>>>>>> with
>>>>>>>>>>>>>>>>> the new proposal
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221%3A+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>> <
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-221:+Enhance+KStream+with+Connecting+Topic+Creation+and+Repartition+Hint
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> I’ve also filled “Rejected Alternatives” section.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Looking forward to discuss this KIP :)
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> King regards,
>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 1:08 PM, Levani Kokhreidze <
>>>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hello Matthias,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the feedback and ideas.
>>>>>>>>>>>>>>>>>>>> I like the idea of introducing dedicated `Topic` class for
>>>>>>>>> topic
>>>>>>>>>>>>>>>>> configuration for internal operators like `groupedBy`.
>>>>>>>>>>>>>>>>>>>> Would be great to hear others opinion about this as well.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Jul 3, 2019, at 7:00 AM, Matthias J. Sax <
>>>>>>>>> [email protected]
>>>>>>>>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Levani,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks for picking up this KIP! And thanks for summarizing
>>>>>>>>>>>>>>> everything.
>>>>>>>>>>>>>>>>>>>>> Even if some points may have been discussed already (can't
>>>>>>>>> really
>>>>>>>>>>>>>>>>>>>>> remember), it's helpful to get a good summary to refresh 
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> discussion.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> I think your reasoning makes sense. With regard to the
>>>>>>>>> distinction
>>>>>>>>>>>>>>>>>>>>> between operators that manage topics and operators that 
>>>>>>>>>>>>>>>>>>>>> use
>>>>>>>>>>>>>>>>> user-created
>>>>>>>>>>>>>>>>>>>>> topics: Following this argument, it might indicate that
>>>>>>>>> leaving
>>>>>>>>>>>>>>>>>>>>> `through()` as-is (as an operator that uses use-defined
>>>>>>>>> topics) and
>>>>>>>>>>>>>>>>>>>>> introducing a new `repartition()` operator (an operator 
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>> manages
>>>>>>>>>>>>>>>>>>>>> topics itself) might be good. Otherwise, there is one
>>>>>>> operator
>>>>>>>>>>>>>>>>>>>>> `through()` that sometimes manages topics but sometimes
>>>>>>> not; a
>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>> name, ie, new operator would make the distinction clearer.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> About adding `numOfPartitions` to `Grouped`. I am 
>>>>>>>>>>>>>>>>>>>>> wondering
>>>>>>>>> if the
>>>>>>>>>>>>>>>>> same
>>>>>>>>>>>>>>>>>>>>> argument as for `Produced` does apply and adding it is
>>>>>>>>> semantically
>>>>>>>>>>>>>>>>>>>>> questionable? Might be good to get opinions of others on
>>>>>>>>> this, too.
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> am
>>>>>>>>>>>>>>>>>>>>> not sure myself what solution I prefer atm.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> So far, KS uses configuration objects that allow to
>>>>>>> configure
>>>>>>>>> a
>>>>>>>>>>>>>>>>> certain
>>>>>>>>>>>>>>>>>>>>> "entity" like a consumer, producer, store. If we assume 
>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>> a topic
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> a similar entity, I am wonder if we should have a
>>>>>>>>>>>>>>>>>>>>> `Topic#withNumberOfPartitions()` class and method instead 
>>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>> a plain
>>>>>>>>>>>>>>>>>>>>> integer? This would allow us to add other configs, like
>>>>>>>>> replication
>>>>>>>>>>>>>>>>>>>>> factor, retention-time etc, easily, without the need to
>>>>>>>>> change the
>>>>>>>>>>>>>>>>> "main
>>>>>>>>>>>>>>>>>>>>> API".
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Just want to give some ideas. Not sure if I like them
>>>>>>> myself.
>>>>>>>>> :)
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 7/1/19 1:04 AM, Levani Kokhreidze wrote:
>>>>>>>>>>>>>>>>>>>>>> Actually, giving it more though - maybe enhancing 
>>>>>>>>>>>>>>>>>>>>>> Produced
>>>>>>>>> with num
>>>>>>>>>>>>>>>>> of partitions configuration is not the best approach. Let me
>>>>>>>>> explain
>>>>>>>>>>>>>>> why:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 1) If we enhance Produced class with this configuration,
>>>>>>>>> this will
>>>>>>>>>>>>>>>>> also affect KStream#to operation. Since KStream#to is the 
>>>>>>>>>>>>>>>>> final
>>>>>>>>> sink of
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> topology, for me, it seems to be reasonable assumption that 
>>>>>>>>>>>>>>>>> user
>>>>>>>>> needs
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> manually create sink topic in advance. And in that case, 
>>>>>>>>>>>>>>>>> having
>>>>>>>>> num of
>>>>>>>>>>>>>>>>> partitions configuration doesn’t make much sense.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 2) Looking at Produced class, based on API contract, 
>>>>>>>>>>>>>>>>>>>>>> seems
>>>>>>>>> like
>>>>>>>>>>>>>>>>> Produced is designed to be something that is explicitly for
>>>>>>>>> producer
>>>>>>>>>>>>>>> (key
>>>>>>>>>>>>>>>>> serializer, value serializer, partitioner those all are 
>>>>>>>>>>>>>>>>> producer
>>>>>>>>>>>>>>> specific
>>>>>>>>>>>>>>>>> configurations) and num of partitions is topic level
>>>>>>>>> configuration. And
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> don’t think mixing topic and producer level configurations
>>>>>>>>> together in
>>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> class is the good approach.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 3) Looking at KStream interface, seems like Produced
>>>>>>>>> parameter is
>>>>>>>>>>>>>>>>> for operations that work with non-internal (e.g topics created
>>>>>>> and
>>>>>>>>>>>>>>> managed
>>>>>>>>>>>>>>>>> internally by Kafka Streams) topics and I think we should 
>>>>>>>>>>>>>>>>> leave
>>>>>>>>> it as
>>>>>>>>>>>>>>> it is
>>>>>>>>>>>>>>>>> in that case.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Taking all this things into account, I think we should
>>>>>>>>> distinguish
>>>>>>>>>>>>>>>>> between DSL operations, where Kafka Streams should create and
>>>>>>>>> manage
>>>>>>>>>>>>>>>>> internal topics (KStream#groupBy) vs topics that should be
>>>>>>>>> created in
>>>>>>>>>>>>>>>>> advance (e.g KStream#to).
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> To sum it up, I think adding numPartitions configuration 
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> Produced
>>>>>>>>>>>>>>>>> will result in mixing topic and producer level configuration 
>>>>>>>>>>>>>>>>> in
>>>>>>>>> one
>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>> and it’s gonna break existing API semantics.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Regarding making topic name optional in KStream#through 
>>>>>>>>>>>>>>>>>>>>>> - I
>>>>>>>>> think
>>>>>>>>>>>>>>>>> underline idea is very useful and giving users possibility to
>>>>>>>>> specify
>>>>>>>>>>>>>>> num
>>>>>>>>>>>>>>>>> of partitions there is even more useful :) Considering 
>>>>>>>>>>>>>>>>> arguments
>>>>>>>>> against
>>>>>>>>>>>>>>>>> adding num of partitions in Produced class, I see two options
>>>>>>>>> here:
>>>>>>>>>>>>>>>>>>>>>> 1) Add following method overloads
>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>> partitions
>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num of
>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>> 2) Leave KStream#through as it is and introduce new 
>>>>>>>>>>>>>>>>>>>>>> method
>>>>>>> -
>>>>>>>>>>>>>>>>> KStream#repartition (I think Matthias suggested this in one of
>>>>>>> the
>>>>>>>>>>>>>>> threads)
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Considering all mentioned above I propose the following
>>>>>>> plan:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Option A:
>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class 
>>>>>>>>>>>>>>>>>>>>>> (as
>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>> 3) Add following method overloads to KStream#through
>>>>>>>>>>>>>>>>>>>>>> * through() - topic will be auto-generated and num of
>>>>>>>>> partitions
>>>>>>>>>>>>>>>>> will be taken from source topic
>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions) - topic will be auto
>>>>>>>>>>>>>>>>> generated with specified num of partitions
>>>>>>>>>>>>>>>>>>>>>> * through(final int numOfPartitions, final Produced<K, V>
>>>>>>>>>>>>>>>>> produced) - topic will be with generated with specified num of
>>>>>>>>>>>>>>> partitions
>>>>>>>>>>>>>>>>> and configuration taken from produced parameter.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Option B:
>>>>>>>>>>>>>>>>>>>>>> 1) Leave Produced as it is
>>>>>>>>>>>>>>>>>>>>>> 2) Add num of partitions configuration to Grouped class 
>>>>>>>>>>>>>>>>>>>>>> (as
>>>>>>>>>>>>>>>>> mentioned in the KIP)
>>>>>>>>>>>>>>>>>>>>>> 3) Add new operator KStream#repartition for creating and
>>>>>>>>> managing
>>>>>>>>>>>>>>>>> internal repartition topics
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> P.S. I’m sorry if all of this was already discussed in 
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>> mailing
>>>>>>>>>>>>>>>>> list, but I kinda got with all the threads that were about 
>>>>>>>>>>>>>>>>> this
>>>>>>>>> KIP :(
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Jul 1, 2019, at 9:56 AM, Levani Kokhreidze <
>>>>>>>>>>>>>>>>> [email protected] <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I would like to resurrect discussion around KIP-221. 
>>>>>>>>>>>>>>>>>>>>>>> Going
>>>>>>>>> through
>>>>>>>>>>>>>>>>> the discussion thread, there’s seems to agreement around
>>>>>>>>> usefulness of
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>> feature.
>>>>>>>>>>>>>>>>>>>>>>> Regarding the implementation, as far as I understood, 
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>> most
>>>>>>>>>>>>>>>>> optimal solution for me seems the following:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 1) Add two method overloads to KStream#through method
>>>>>>>>> (essentially
>>>>>>>>>>>>>>>>> making topic name optional)
>>>>>>>>>>>>>>>>>>>>>>> 2) Enhance Produced class with numOfPartitions
>>>>>>> configuration
>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Those two changes will allow DSL users to control
>>>>>>>>> parallelism and
>>>>>>>>>>>>>>>>> trigger re-partition without doing stateful operations.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> I will update KIP with interface changes around
>>>>>>>>> KStream#through if
>>>>>>>>>>>>>>>>> this changes sound sensible.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>>>>>>>> Levani
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: [DISCUSS] KIP-221: Repartition Topic Hints in Streams

Reply via email to