Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Matthias J. Sax Tue, 21 Mar 2017 16:34:03 -0700

@Guozhang:

I recognized that you want to have `Topology` in the name. But it seems
that more people preferred to not have it (Jay, Ram, Michael [?], myself).


@Michael:

You seemed to agree with Jay about not exposing the `Topology` concept
in our main entry class (ie, current KStreamBuilder), thus, I
interpreted that you do not want `Topology` in the name either (I am a
little surprised by your last response, that goes the opposite direction).

>     StreamsBuilder builder = new StreamsBuilder();
> 
>     // And here you'd define your...well, what actually?
>     // Ah right, you are composing a topology here, though you are not
> aware of it.

Yes. You are not aware of if -- that's the whole point about it -- don't
put the Topology concept in the focus...

Furthermore,

>>> So what are you building here with StreamsBuilder?  Streams (hint: No)?
>>> And what about tables -- is there a TableBuilder (hint: No)?

I am not sure, if this is too much a concern. In contrast to
`KStreamBuilder` (singular) that contains `KStream` and thus puts
KStream concept in focus and thus degrade `KTable`, `StreamsBuilder`
(plural) focuses on "Streams API". IMHO, it does not put focus on
KStream. It's just a builder from the Streams API -- you don't need to
worry what you are building -- and you don't need to think about the
`Topology` concept (of course, you see that .build() return a Topology).


Personally, I see pros and cons for both `StreamsBuilder` and
`StreamsTopologyBuilder` and thus, I am fine either way. Maybe Jay and
Ram can follow up and share their thoughts?

I would also help a lot if other people put their vote for a name, too.



-Matthias



On 3/21/17 2:11 PM, Guozhang Wang wrote:
> Just to clarify, I did want to have the term `Topology` as part of the
> class name, for the reasons above. I'm not too worried about to be
> consistent with the previous names, but I feel the `XXTopologyBuilder` is
> better than `XXStreamsBuilder` since it's build() function returns a
> Topology object.
> 
> 
> Guozhang
> 
> 
> On Mon, Mar 20, 2017 at 12:53 PM, Michael Noll <mich...@confluent.io> wrote:
> 
>> Hmm, I must admit I don't like this last update all too much.
>>
>> Basically we would have:
>>
>>     StreamsBuilder builder = new StreamsBuilder();
>>
>>     // And here you'd define your...well, what actually?
>>     // Ah right, you are composing a topology here, though you are not
>> aware of it.
>>
>>     KafkaStreams streams = new KafkaStreams(builder.build(),
>> streamsConfiguration);
>>
>> So what are you building here with StreamsBuilder?  Streams (hint: No)?
>> And what about tables -- is there a TableBuilder (hint: No)?
>>
>> I also interpret Guozhang's last response as that he'd prefer to have
>> "Topology" in the class/interface names.  I am aware that we shouldn't
>> necessarily use the status quo to make decisions about future changes, but
>> the very first concept we explain in the Kafka Streams documentation is
>> "Stream Processing Topology":
>> https://kafka.apache.org/0102/documentation/streams#streams_concepts
>>
>> -Michael
>>
>>
>>
>> On Mon, Mar 20, 2017 at 7:55 PM, Matthias J. Sax <matth...@confluent.io>
>> wrote:
>>
>>> \cc users list
>>>
>>>
>>> -------- Forwarded Message --------
>>> Subject: Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API
>>> Date: Mon, 20 Mar 2017 11:51:01 -0700
>>> From: Matthias J. Sax <matth...@confluent.io>
>>> Organization: Confluent Inc
>>> To: dev@kafka.apache.org
>>>
>>> I want to push this discussion further.
>>>
>>> Guozhang's argument about "exposing" the Topology class is valid. It's a
>>> public class anyway, so it's not as issue. However, I think the question
>>> is not too much about exposing but about "advertising" (ie, putting it
>>> into the focus) or not at DSL level.
>>>
>>>
>>> If I interpret the last replies correctly, it seems that we could agree
>>> on "StreamsBuilder" as name. I did update the KIP accordingly. Please
>>> correct me, if I got this wrong.
>>>
>>>
>>> If there are not other objects -- this naming discussion was the last
>>> open point to far -- I would like the start the VOTE thread.
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 3/14/17 2:37 PM, Guozhang Wang wrote:
>>>> I'd like to keep the term "Topology" inside the builder class since, as
>>>> Matthias mentioned, this builder#build() function returns a "Topology"
>>>> object, whose type is a public class anyways. Although you can argue to
>>> let
>>>> users always call
>>>>
>>>> "new KafkaStreams(builder.build())"
>>>>
>>>> I think it is still more benefit to expose this concept.
>>>>
>>>>
>>>>
>>>> Guozhang
>>>>
>>>> On Tue, Mar 14, 2017 at 10:43 AM, Matthias J. Sax <
>> matth...@confluent.io
>>>>
>>>> wrote:
>>>>
>>>>> Thanks for your input Michael.
>>>>>
>>>>>>> - KafkaStreams as the new name for the builder that creates the
>>> logical
>>>>>>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
>>>>>>> `KafkaStreams.table("input-topic")`.
>>>>>
>>>>> I don't thinks this is a good idea, for multiple reasons:
>>>>>
>>>>> (1) We would reuse a name for a completely different purpose. The same
>>>>> argument for not renaming KStreamBuilder to TopologyBuilder. The
>>>>> confusion would just be too large.
>>>>>
>>>>> So if we would start from scratch, it might be ok to do so, but now we
>>>>> cannot make this move, IMHO.
>>>>>
>>>>> Also a clarification question: do you suggest to have static methods
>>>>> #stream and #table -- I am not sure if this would work?
>>>>> (or was you code snippet just simplification?)
>>>>>
>>>>>
>>>>> (2) Kafka Streams is basically a "processing client" next to consumer
>>>>> and producer client. Thus, the name KafkaStreams aligns to the naming
>>>>> schema of KafkaConsumer and KafkaProducer. I am not sure if it would
>> be
>>>>> a good choice to "break" this naming scheme.
>>>>>
>>>>> Btw: this is also the reason, why we have KafkaStreams#close() -- and
>>>>> not KafkaStreams#stop() -- because #close() aligns with consumer and
>>>>> producer client.
>>>>>
>>>>>
>>>>> (3) On more argument against using KafkaStreams as DSL entry class
>> would
>>>>> be, that it would need to create a Topology that can be given to the
>>>>> "runner/processing-client". Thus the pattern would be
>>>>>
>>>>>> Topology topology = streams.build();
>>>>>> KafkaStramsRunner runner = new KafkaStreamsRunner(..., topology)
>>>>>
>>>>> (or of course as a one liner).
>>>>>
>>>>>
>>>>>
>>>>> On the other hand, there was the idea (that we intentionally excluded
>>>>> from the KIP), to change the "client instantiation" pattern.
>>>>>
>>>>> Right now, a new client in actively instantiated (ie, by calling
>> "new")
>>>>> and the topology if provided as a constructor argument. However,
>>>>> especially for DSL (not sure if it would make sense for PAPI), the DSL
>>>>> builder could create the client for the user.
>>>>>
>>>>> Something like this:
>>>>>
>>>>>> KStreamBuilder builder = new KStreamBuilder();
>>>>>> builder.whatever() // use the builder
>>>>>>
>>>>>> StreamsConfig config = ....
>>>>>> KafkaStreams streams = builder.getKafkaStreams(config);
>>>>>
>>>>> If we change the patter like this, the notion a the "DSL builder"
>> would
>>>>> change, as it does not create a topology anymore, but it creates the
>>>>> "processing client". This would address Jay's concern about "not
>>>>> exposing concept users don't need the understand" and would not
>> require
>>>>> to include the word "Topology" in the DSL builder class name, because
>>>>> the builder does not build a Topology anymore.
>>>>>
>>>>> I just put some names that came to my mind first hand -- did not think
>>>>> about good names. It's just to discuss the pattern.
>>>>>
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 3/14/17 3:36 AM, Michael Noll wrote:
>>>>>> I see Jay's point, and I agree with much of it -- notably about being
>>>>>> careful which concepts we do and do not expose, depending on which
>> user
>>>>>> group / user type is affected.  That said, I'm not sure yet whether
>> or
>>>>> not
>>>>>> we should get rid of "Topology" (or a similar term) in the DSL.
>>>>>>
>>>>>> For what it's worth, here's how related technologies define/name
>> their
>>>>>> "topologies" and "builders".  Note that, in all cases, it's about
>>>>>> constructing a logical processing plan, which then is being
>>> executed/run.
>>>>>>
>>>>>> - `Pipeline` (Google Dataflow/Apache Beam)
>>>>>>     - To add a source you first instantiate the Source (e.g.
>>>>>> `TextIO.Read.from("gs://some/inputData.txt")`),
>>>>>>       then attach it to your processing plan via
>>>>> `Pipeline#apply(<source>)`.
>>>>>>       This setup is a bit different to our DSL because in our DSL the
>>>>>> builder does both, i.e.
>>>>>>       instantiating + auto-attaching to itself.
>>>>>>     - To execute the processing plan you call `Pipeline#execute()`.
>>>>>> - `StreamingContext`` (Spark): This setup is similar to our DSL.
>>>>>>     - To add a source you call e.g.
>>>>>> `StreamingContext#socketTextStream("localhost", 9999)`.
>>>>>>     - To execute the processing plan you call
>>>>> `StreamingContext#execute()`.
>>>>>> - `StreamExecutionEnvironment` (Flink): This setup is similar to our
>>> DSL.
>>>>>>     - To add a source you call e.g.
>>>>>> `StreamExecutionEnvironment#socketTextStream("localhost", 9999)`.
>>>>>>     - To execute the processing plan you call
>>>>>> `StreamExecutionEnvironment#execute()`.
>>>>>> - `Graph`/`Flow` (Akka Streams), as a result of composing Sources (~
>>>>>> `KStreamBuilder.stream()`) and Sinks (~ `KStream#to()`)
>>>>>>   into Flows, which are [Runnable]Graphs.
>>>>>>     - You instantiate a Source directly, and then compose the Source
>>> with
>>>>>> Sinks to create a RunnableGraph:
>>>>>>       see signature `Source#to[Mat2](sink: Graph[SinkShape[Out],
>>> Mat2]):
>>>>>> RunnableGraph[Mat]`.
>>>>>>     - To execute the processing plan you call `Flow#run()`.
>>>>>>
>>>>>> In our DSL, in comparison, we do:
>>>>>>
>>>>>> - `KStreamBuilder` (Kafka Streams API)
>>>>>>     - To add a source you call e.g. `KStreamBuilder#stream("input-
>>>>> topic")`.
>>>>>>     - To execute the processing plan you create a `KafkaStreams`
>>> instance
>>>>>> from `KStreamBuilder`
>>>>>>       (where the builder will instantiate the topology = processing
>>> plan
>>>>> to
>>>>>> be executed), and then
>>>>>>       call `KafkaStreams#start()`.  Think of `KafkaStreams` as our
>>>>> runner.
>>>>>>
>>>>>> First, I agree with the sentiment that the current name of
>>>>> `KStreamBuilder`
>>>>>> isn't great (which is why we're having this discussion).  Also, that
>>>>>> finding a good name is tricky. ;-)
>>>>>>
>>>>>> Second, even though I agree with many of Jay's points I'm not sure
>>>>> whether
>>>>>> I like the `StreamsBuilder` suggestion (i.e. any name that does not
>>>>> include
>>>>>> "topology" or a similar term) that much more.  It still doesn't
>>> describe
>>>>>> what that class actually does, and what the difference to
>>> `KafkaStreams`
>>>>>> is.  IMHO, the point of `KStreamBuilder` is that it lets you build a
>>>>>> logical plan (what we call "topology"), and `KafkaStreams` is the
>> thing
>>>>>> that executes that plan.  I'm not yet convinced that abstracting
>> these
>>>>> two
>>>>>> points away from the user is a good idea if the argument is that it's
>>>>>> potentially confusing to beginners (a claim which I am not sure is
>>>>> actually
>>>>>> true).
>>>>>>
>>>>>> That said, if we rather favor "good-sounding but perhaps less
>>> technically
>>>>>> correct names", I'd argue we should not even use something like
>>>>> "Builder".
>>>>>> We could, for example, also pick the following names:
>>>>>>
>>>>>> - KafkaStreams as the new name for the builder that creates the
>> logical
>>>>>> plan, with e.g. `KafkaStreams.stream("intput-topic")` and
>>>>>> `KafkaStreams.table("input-topic")`.
>>>>>> - KafkaStreamsRunner as the new name for the executioner of the plan,
>>>>> with
>>>>>> `KafkaStreamsRunner(KafkaStreams).run()`.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 14, 2017 at 5:56 AM, Sriram Subramanian <
>> r...@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> StreamsBuilder would be my vote.
>>>>>>>
>>>>>>>> On Mar 13, 2017, at 9:42 PM, Jay Kreps <j...@confluent.io> wrote:
>>>>>>>>
>>>>>>>> Hey Matthias,
>>>>>>>>
>>>>>>>> Make sense, I'm more advocating for removing the word topology than
>>> any
>>>>>>>> particular new replacement.
>>>>>>>>
>>>>>>>> -Jay
>>>>>>>>
>>>>>>>> On Mon, Mar 13, 2017 at 12:30 PM, Matthias J. Sax <
>>>>> matth...@confluent.io
>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Jay,
>>>>>>>>>
>>>>>>>>> thanks for your feedback
>>>>>>>>>
>>>>>>>>>> What if instead we called it KStreamsBuilder?
>>>>>>>>>
>>>>>>>>> That's the current name and I personally think it's not the best
>>> one.
>>>>>>>>> The main reason why I don't like KStreamsBuilder is, that we have
>>> the
>>>>>>>>> concepts of KStreams and KTables, and the builder creates both.
>>>>> However,
>>>>>>>>> the name puts he focus on KStream and devalues KTable.
>>>>>>>>>
>>>>>>>>> I understand your argument, and I am personally open the remove
>> the
>>>>>>>>> "Topology" part, and name it "StreamsBuilder". Not sure what
>> others
>>>>>>>>> think about this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> About Processor API: I like the idea in general, but I thinks it's
>>> out
>>>>>>>>> of scope for this KIP. KIP-120 has the focus on removing leaking
>>>>>>>>> internal APIs and do some cleanup how our API reflects some
>>> concepts.
>>>>>>>>>
>>>>>>>>> However, I added your idea to API discussion Wiki page and we take
>>> if
>>>>>>>>> from there:
>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/
>>>>>>>>> Kafka+Streams+Discussions
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 3/13/17 11:52 AM, Jay Kreps wrote:
>>>>>>>>>> Two things:
>>>>>>>>>>
>>>>>>>>>>   1. This is a minor thing but the proposed new name for
>>>>> KStreamBuilder
>>>>>>>>>>   is StreamsTopologyBuilder. I actually think we should not put
>>>>>>>>> topology in
>>>>>>>>>>   the name as topology is not a concept you need to understand at
>>> the
>>>>>>>>>>   kstreams layer right now. I'd think of three categories of
>>>>> concepts:
>>>>>>>>> (1)
>>>>>>>>>>   concepts you need to understand to get going even for a simple
>>>>>>>>> example, (2)
>>>>>>>>>>   concepts you need to understand to operate and debug a real
>>>>>>>>> production app,
>>>>>>>>>>   (3) concepts we truly abstract and you don't need to ever
>>>>> understand.
>>>>>>>>> I
>>>>>>>>>>   think in the kstream layer topologies are currently category
>> (2),
>>>>> and
>>>>>>>>> this
>>>>>>>>>>   is where they belong. By introducing the name in even the
>>> simplest
>>>>>>>>> example
>>>>>>>>>>   it means the user has to go read about toplogies to really
>>>>> understand
>>>>>>>>> even
>>>>>>>>>>   this simple snippet. What if instead we called it
>>> KStreamsBuilder?
>>>>>>>>>>   2. For the processor api, I think this api is mostly not for
>> end
>>>>>>>>> users.
>>>>>>>>>>   However this are a couple cases where it might make sense to
>>> expose
>>>>>>>>> it. I
>>>>>>>>>>   think users coming from Samza, or JMS's MessageListener (
>>>>>>>>>>   https://docs.oracle.com/javaee/7/api/javax/jms/
>>>>> MessageListener.html)
>>>>>>>>>>   understand a simple callback interface for message processing.
>> In
>>>>>>>>> fact,
>>>>>>>>>>   people often ask why Kafka's consumer doesn't provide such an
>>>>>>>>> interface.
>>>>>>>>>>   I'd argue we do, it's KafkaStreams. The only issue is that the
>>>>>>>>> processor
>>>>>>>>>>   API documentation is a bit scary for a person implementing this
>>>>> type
>>>>>>>>> of
>>>>>>>>>>   api. My observation is that people using this style of API
>> don't
>>>>> do a
>>>>>>>>> lot
>>>>>>>>>>   of cross-message operations, then just do single message
>>> operations
>>>>>>>>> and use
>>>>>>>>>>   a database for anything that spans messages. They also don't
>>> factor
>>>>>>>>> their
>>>>>>>>>>   code into many MessageListeners and compose them, they just
>> have
>>>>> one
>>>>>>>>>>   listener that has the complete handling logic. Say I am a user
>>> who
>>>>>>>>> wants to
>>>>>>>>>>   implement a single Processor in this style. Do we have an easy
>>> way
>>>>> to
>>>>>>>>> do
>>>>>>>>>>   that today (either with the .transform/.process methods in
>>> kstreams
>>>>>>>>> or with
>>>>>>>>>>   the topology apis)? Is there anything we can do in the way of
>>>>> trivial
>>>>>>>>>>   helper code to make this better? Also, how can we explain that
>>>>>>>>> pattern to
>>>>>>>>>>   people? I think currently we have pretty in-depth docs on our
>>> apis
>>>>>>>>> but I
>>>>>>>>>>   suspect a person trying to figure out how to implement a simple
>>>>>>>>> callback
>>>>>>>>>>   might get a bit lost trying to figure out how to wire it up. A
>>>>> simple
>>>>>>>>> five
>>>>>>>>>>   line example in the docs would probably help a lot. Not sure if
>>>>> this
>>>>>>>>> is
>>>>>>>>>>   best addressed in this KIP or is a side comment.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>>
>>>>>>>>>> -Jay
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 3, 2017 at 3:33 PM, Matthias J. Sax <
>>>>> matth...@confluent.io
>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I did prepare a KIP to do some cleanup some of Kafka's Streaming
>>>>> API.
>>>>>>>>>>>
>>>>>>>>>>> Please have a look here:
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>> 120%3A+Cleanup+Kafka+Streams+builder+API
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to your feedback!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
> 
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-120: Cleanup Kafka Streams builder API

Reply via email to