Re: [DISCUSS] FLIP-149: Introduce the KTable Connector

Jingsong Li Fri, 23 Oct 2020 01:02:29 -0700

I just notice there is a limitation in the FLIP:

> Generally speaking, the underlying topic of the upsert-kafka source must
be compacted. Besides, the underlying topic must have all the data with the
same key in the same partition, otherwise, the result will be wrong.


According to my understanding, this is not accurate? Compact is an
optimization, not a limitation. It depends on users.

I don't want to stop voting, just want to make it clear.

Best,
Jingsong

On Fri, Oct 23, 2020 at 3:16 PM Timo Walther <[email protected]> wrote:

> +1 for voting
>
> Regards,
> Timo
>
> On 23.10.20 09:07, Jark Wu wrote:
> > Thanks Shengkai!
> >
> > +1 to start voting.
> >
> > Best,
> > Jark
> >
> > On Fri, 23 Oct 2020 at 15:02, Shengkai Fang <[email protected]> wrote:
> >
> >> Add one more message, I have already updated the FLIP[1].
> >>
> >> [1]
> >>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-149%3A+Introduce+the+upsert-kafka+Connector
> >>
> >> Shengkai Fang <[email protected]> 于2020年10月23日周五 下午2:55写道：
> >>
> >>> Hi, all.
> >>> It seems we have reached a consensus on the FLIP. If no one has other
> >>> objections, I would like to start the vote for FLIP-149.
> >>>
> >>> Best,
> >>> Shengkai
> >>>
> >>> Jingsong Li <[email protected]> 于2020年10月23日周五 下午2:25写道：
> >>>
> >>>> Thanks for explanation,
> >>>>
> >>>> I am OK for `upsert`. Yes, Its concept has been accepted by many
> >> systems.
> >>>>
> >>>> Best,
> >>>> Jingsong
> >>>>
> >>>> On Fri, Oct 23, 2020 at 12:38 PM Jark Wu <[email protected]> wrote:
> >>>>
> >>>>> Hi Timo,
> >>>>>
> >>>>> I have some concerns about `kafka-cdc`,
> >>>>> 1) cdc is an abbreviation of Change Data Capture which is commonly
> >> used
> >>>> for
> >>>>> databases, not for message queues.
> >>>>> 2) usually, cdc produces full content of changelog, including
> >>>>> UPDATE_BEFORE, however "upsert kafka" doesn't
> >>>>> 3) `kafka-cdc` sounds like a natively support for `debezium-json`
> >>>> format,
> >>>>> however, it is not and even we don't want
> >>>>>     "upsert kafka" to support "debezium-json"
> >>>>>
> >>>>>
> >>>>> Hi Jingsong,
> >>>>>
> >>>>> I think the terminology of "upsert" is fine, because Kafka also uses
> >>>>> "upsert" to define such behavior in their official documentation [1]:
> >>>>>
> >>>>>> a data record in a changelog stream is interpreted as an UPSERT aka
> >>>>> INSERT/UPDATE
> >>>>>
> >>>>> Materialize uses the "UPSERT" keyword to define such behavior too
> [2].
> >>>>> Users have been requesting such feature using "upsert kafka"
> >>>> terminology in
> >>>>> user mailing lists [3][4].
> >>>>> Many other systems support "UPSERT" statement natively, such as
> impala
> >>>> [5],
> >>>>> SAP [6], Phoenix [7], Oracle NoSQL [8], etc..
> >>>>>
> >>>>> Therefore, I think we don't need to be afraid of introducing "upsert"
> >>>>> terminology, it is widely accepted by users.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>>
> >>>>> [1]:
> >>>>>
> >>>>>
> >>>>
> >>
> https://kafka.apache.org/20/documentation/streams/developer-guide/dsl-api.html#streams_concepts_ktable
> >>>>> [2]:
> >>>>>
> >>>>>
> >>>>
> >>
> https://materialize.io/docs/sql/create-source/text-kafka/#upsert-on-a-kafka-topic
> >>>>> [3]:
> >>>>>
> >>>>>
> >>>>
> >>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/SQL-materialized-upsert-tables-td18482.html#a18503
> >>>>> [4]:
> >>>>>
> >>>>>
> >>>>
> >>
> http://apache-flink.147419.n8.nabble.com/Kafka-Sink-AppendStreamTableSink-doesn-t-support-consuming-update-changes-td5959.html
> >>>>> [5]:
> >>>> https://impala.apache.org/docs/build/html/topics/impala_upsert.html
> >>>>> [6]:
> >>>>>
> >>>>>
> >>>>
> >>
> https://help.sap.com/viewer/7c78579ce9b14a669c1f3295b0d8ca16/Cloud/en-US/ea8b6773be584203bcd99da76844c5ed.html
> >>>>> [7]: https://phoenix.apache.org/atomic_upsert.html
> >>>>> [8]:
> >>>>>
> >>>>>
> >>>>
> >>
> https://docs.oracle.com/en/database/other-databases/nosql-database/18.3/sqlfornosql/adding-table-rows-using-insert-and-upsert-statements.html
> >>>>>
> >>>>> On Fri, 23 Oct 2020 at 10:36, Jingsong Li <[email protected]>
> >>>> wrote:
> >>>>>
> >>>>>> The `kafka-cdc` looks good to me.
> >>>>>> We can even give options to indicate whether to turn on compact,
> >>>> because
> >>>>>> compact is just an optimization?
> >>>>>>
> >>>>>> - ktable let me think about KSQL.
> >>>>>> - kafka-compacted it is not just compacted, more than that, it still
> >>>> has
> >>>>>> the ability of CDC
> >>>>>> - upsert-kafka , upsert is back, and I don't really want to see it
> >>>> again
> >>>>>> since we have CDC
> >>>>>>
> >>>>>> Best,
> >>>>>> Jingsong
> >>>>>>
> >>>>>> On Fri, Oct 23, 2020 at 2:21 AM Timo Walther <[email protected]>
> >>>> wrote:
> >>>>>>
> >>>>>>> Hi Jark,
> >>>>>>>
> >>>>>>> I would be fine with `connector=upsert-kafka`. Another idea would
> >>>> be to
> >>>>>>> align the name to other available Flink connectors [1]:
> >>>>>>>
> >>>>>>> `connector=kafka-cdc`.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Timo
> >>>>>>>
> >>>>>>> [1] https://github.com/ververica/flink-cdc-connectors
> >>>>>>>
> >>>>>>> On 22.10.20 17:17, Jark Wu wrote:
> >>>>>>>> Another name is "connector=upsert-kafka', I think this can solve
> >>>>> Timo's
> >>>>>>>> concern on the "compacted" word.
> >>>>>>>>
> >>>>>>>> Materialize also uses "ENVELOPE UPSERT" [1] keyword to identify
> >>>> such
> >>>>>>> kafka
> >>>>>>>> sources.
> >>>>>>>> I think "upsert" is a well-known terminology widely used in many
> >>>>>> systems
> >>>>>>>> and matches the
> >>>>>>>>    behavior of how we handle the kafka messages.
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>> [1]:
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://materialize.io/docs/sql/create-source/text-kafka/#upsert-on-a-kafka-topic
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Thu, 22 Oct 2020 at 22:53, Kurt Young <[email protected]>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> Good validation messages can't solve the broken user
> >> experience,
> >>>>>>> especially
> >>>>>>>>> that
> >>>>>>>>> such update mode option will implicitly make half of current
> >>>> kafka
> >>>>>>> options
> >>>>>>>>> invalid or doesn't
> >>>>>>>>> make sense.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Kurt
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Oct 22, 2020 at 10:31 PM Jark Wu <[email protected]>
> >>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi Timo, Seth,
> >>>>>>>>>>
> >>>>>>>>>> The default value "inserting" of "mode" might be not suitable,
> >>>>>>>>>> because "debezium-json" emits changelog messages which include
> >>>>>> updates.
> >>>>>>>>>>
> >>>>>>>>>> On Thu, 22 Oct 2020 at 22:10, Seth Wiesman <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> +1 for supporting upsert results into Kafka.
> >>>>>>>>>>>
> >>>>>>>>>>> I have no comments on the implementation details.
> >>>>>>>>>>>
> >>>>>>>>>>> As far as configuration goes, I tend to favor Timo's option
> >>>> where
> >>>>> we
> >>>>>>>>> add
> >>>>>>>>>> a
> >>>>>>>>>>> "mode" property to the existing Kafka table with default
> >> value
> >>>>>>>>>> "inserting".
> >>>>>>>>>>> If the mode is set to "updating" then the validation changes
> >> to
> >>>>> the
> >>>>>>> new
> >>>>>>>>>>> requirements. I personally find it more intuitive than a
> >>>> seperate
> >>>>>>>>>>> connector, my fear is users won't understand its the same
> >>>> physical
> >>>>>>>>> kafka
> >>>>>>>>>>> sink under the hood and it will lead to other confusion like
> >>>> does
> >>>>> it
> >>>>>>>>>> offer
> >>>>>>>>>>> the same persistence guarantees? I think we are capable of
> >>>> adding
> >>>>>> good
> >>>>>>>>>>> valdiation messaging that solves Jark and Kurts concerns.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Oct 22, 2020 at 8:51 AM Timo Walther <
> >>>> [email protected]>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Jark,
> >>>>>>>>>>>>
> >>>>>>>>>>>> "calling it "kafka-compacted" can even remind users to
> >> enable
> >>>> log
> >>>>>>>>>>>> compaction"
> >>>>>>>>>>>>
> >>>>>>>>>>>> But sometimes users like to store a lineage of changes in
> >>>> their
> >>>>>>>>> topics.
> >>>>>>>>>>>> Indepent of any ktable/kstream interpretation.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I let the majority decide on this topic to not further block
> >>>> this
> >>>>>>>>>>>> effort. But we might find a better name like:
> >>>>>>>>>>>>
> >>>>>>>>>>>> connector = kafka
> >>>>>>>>>>>> mode = updating/inserting
> >>>>>>>>>>>>
> >>>>>>>>>>>> OR
> >>>>>>>>>>>>
> >>>>>>>>>>>> connector = kafka-updating
> >>>>>>>>>>>>
> >>>>>>>>>>>> ...
> >>>>>>>>>>>>
> >>>>>>>>>>>> Regards,
> >>>>>>>>>>>> Timo
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 22.10.20 15:24, Jark Wu wrote:
> >>>>>>>>>>>>> Hi Timo,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for your opinions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1) Implementation
> >>>>>>>>>>>>> We will have an stateful operator to generate INSERT and
> >>>>>>>>>> UPDATE_BEFORE.
> >>>>>>>>>>>>> This operator is keyby-ed (primary key as the shuffle key)
> >>>> after
> >>>>>>>>> the
> >>>>>>>>>>>> source
> >>>>>>>>>>>>> operator.
> >>>>>>>>>>>>> The implementation of this operator is very similar to the
> >>>>>> existing
> >>>>>>>>>>>>> `DeduplicateKeepLastRowFunction`.
> >>>>>>>>>>>>> The operator will register a value state using the primary
> >>>> key
> >>>>>>>>> fields
> >>>>>>>>>>> as
> >>>>>>>>>>>>> keys.
> >>>>>>>>>>>>> When the value state is empty under current key, we will
> >> emit
> >>>>>>>>> INSERT
> >>>>>>>>>>> for
> >>>>>>>>>>>>> the input row.
> >>>>>>>>>>>>> When the value state is not empty under current key, we
> >> will
> >>>>> emit
> >>>>>>>>>>>>> UPDATE_BEFORE using the row in state,
> >>>>>>>>>>>>> and emit UPDATE_AFTER using the input row.
> >>>>>>>>>>>>> When the input row is DELETE, we will clear state and emit
> >>>>> DELETE
> >>>>>>>>>> row.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 2) new option vs new connector
> >>>>>>>>>>>>>> We recently simplified the table options to a minimum
> >>>> amount of
> >>>>>>>>>>>>> characters to be as concise as possible in the DDL.
> >>>>>>>>>>>>> I think this is the reason why we want to introduce a new
> >>>>>>>>> connector,
> >>>>>>>>>>>>> because we can simplify the options in DDL.
> >>>>>>>>>>>>> For example, if using a new option, the DDL may look like
> >>>> this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> CREATE TABLE users (
> >>>>>>>>>>>>>      user_id BIGINT,
> >>>>>>>>>>>>>      user_name STRING,
> >>>>>>>>>>>>>      user_level STRING,
> >>>>>>>>>>>>>      region STRING,
> >>>>>>>>>>>>>      PRIMARY KEY (user_id) NOT ENFORCED
> >>>>>>>>>>>>> ) WITH (
> >>>>>>>>>>>>>      'connector' = 'kafka',
> >>>>>>>>>>>>>      'model' = 'table',
> >>>>>>>>>>>>>      'topic' = 'pageviews_per_region',
> >>>>>>>>>>>>>      'properties.bootstrap.servers' = '...',
> >>>>>>>>>>>>>      'properties.group.id' = 'testGroup',
> >>>>>>>>>>>>>      'scan.startup.mode' = 'earliest',
> >>>>>>>>>>>>>      'key.format' = 'csv',
> >>>>>>>>>>>>>      'key.fields' = 'user_id',
> >>>>>>>>>>>>>      'value.format' = 'avro',
> >>>>>>>>>>>>>      'sink.partitioner' = 'hash'
> >>>>>>>>>>>>> );
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If using a new connector, we can have a different default
> >>>> value
> >>>>>> for
> >>>>>>>>>> the
> >>>>>>>>>>>>> options and remove unnecessary options,
> >>>>>>>>>>>>> the DDL can look like this which is much more concise:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> CREATE TABLE pageviews_per_region (
> >>>>>>>>>>>>>      user_id BIGINT,
> >>>>>>>>>>>>>      user_name STRING,
> >>>>>>>>>>>>>      user_level STRING,
> >>>>>>>>>>>>>      region STRING,
> >>>>>>>>>>>>>      PRIMARY KEY (user_id) NOT ENFORCED
> >>>>>>>>>>>>> ) WITH (
> >>>>>>>>>>>>>      'connector' = 'kafka-compacted',
> >>>>>>>>>>>>>      'topic' = 'pageviews_per_region',
> >>>>>>>>>>>>>      'properties.bootstrap.servers' = '...',
> >>>>>>>>>>>>>      'key.format' = 'csv',
> >>>>>>>>>>>>>      'value.format' = 'avro'
> >>>>>>>>>>>>> );
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> When people read `connector=kafka-compacted` they might
> >> not
> >>>>> know
> >>>>>>>>>> that
> >>>>>>>>>>> it
> >>>>>>>>>>>>>> has ktable semantics. You don't need to enable log
> >>>> compaction
> >>>>> in
> >>>>>>>>>> order
> >>>>>>>>>>>>>> to use a KTable as far as I know.
> >>>>>>>>>>>>> We don't need to let users know it has ktable semantics, as
> >>>>>>>>>> Konstantin
> >>>>>>>>>>>>> mentioned this may carry more implicit
> >>>>>>>>>>>>> meaning than we want to imply here. I agree users don't
> >> need
> >>>> to
> >>>>>>>>>> enable
> >>>>>>>>>>>> log
> >>>>>>>>>>>>> compaction, but from the production perspective,
> >>>>>>>>>>>>> log compaction should always be enabled if it is used in
> >> this
> >>>>>>>>>> purpose.
> >>>>>>>>>>>>> Calling it "kafka-compacted" can even remind users to
> >> enable
> >>>> log
> >>>>>>>>>>>> compaction.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I don't agree to introduce "model = table/stream" option,
> >> or
> >>>>>>>>>>>>> "connector=kafka-table",
> >>>>>>>>>>>>> because this means we are introducing Table vs Stream
> >> concept
> >>>>> from
> >>>>>>>>>>> KSQL.
> >>>>>>>>>>>>> However, we don't have such top-level concept in Flink SQL
> >>>> now,
> >>>>>>>>> this
> >>>>>>>>>>> will
> >>>>>>>>>>>>> further confuse users.
> >>>>>>>>>>>>> In Flink SQL, all the things are STREAM, the differences
> >> are
> >>>>>>>>> whether
> >>>>>>>>>> it
> >>>>>>>>>>>> is
> >>>>>>>>>>>>> bounded or unbounded,
> >>>>>>>>>>>>>     whether it is insert-only or changelog.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best,
> >>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, 22 Oct 2020 at 20:39, Timo Walther <
> >>>> [email protected]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Shengkai, Hi Jark,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> thanks for this great proposal. It is time to finally
> >>>> connect
> >>>>> the
> >>>>>>>>>>>>>> changelog processor with a compacted Kafka topic.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> "The operator will produce INSERT rows, or additionally
> >>>>> generate
> >>>>>>>>>>>>>> UPDATE_BEFORE rows for the previous image, or produce
> >> DELETE
> >>>>> rows
> >>>>>>>>>> with
> >>>>>>>>>>>>>> all columns filled with values."
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Could you elaborate a bit on the implementation details in
> >>>> the
> >>>>>>>>> FLIP?
> >>>>>>>>>>> How
> >>>>>>>>>>>>>> are UPDATE_BEFOREs are generated. How much state is
> >>>> required to
> >>>>>>>>>>> perform
> >>>>>>>>>>>>>> this operation.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>     From a conceptual and semantical point of view, I'm
> >> fine
> >>>>> with
> >>>>>>>>> the
> >>>>>>>>>>>>>> proposal. But I would like to share my opinion about how
> >> we
> >>>>>> expose
> >>>>>>>>>>> this
> >>>>>>>>>>>>>> feature:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> ktable vs kafka-compacted
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm against having an additional connector like `ktable`
> >> or
> >>>>>>>>>>>>>> `kafka-compacted`. We recently simplified the table
> >> options
> >>>> to
> >>>>> a
> >>>>>>>>>>> minimum
> >>>>>>>>>>>>>> amount of characters to be as concise as possible in the
> >>>> DDL.
> >>>>>>>>>>> Therefore,
> >>>>>>>>>>>>>> I would keep the `connector=kafka` and introduce an
> >>>> additional
> >>>>>>>>>> option.
> >>>>>>>>>>>>>> Because a user wants to read "from Kafka". And the "how"
> >>>> should
> >>>>>> be
> >>>>>>>>>>>>>> determined in the lower options.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When people read `connector=ktable` they might not know
> >> that
> >>>>> this
> >>>>>>>>> is
> >>>>>>>>>>>>>> Kafka. Or they wonder where `kstream` is?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When people read `connector=kafka-compacted` they might
> >> not
> >>>>> know
> >>>>>>>>>> that
> >>>>>>>>>>> it
> >>>>>>>>>>>>>> has ktable semantics. You don't need to enable log
> >>>> compaction
> >>>>> in
> >>>>>>>>>> order
> >>>>>>>>>>>>>> to use a KTable as far as I know. Log compaction and table
> >>>>>>>>> semantics
> >>>>>>>>>>> are
> >>>>>>>>>>>>>> orthogonal topics.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In the end we will need 3 types of information when
> >>>> declaring a
> >>>>>>>>>> Kafka
> >>>>>>>>>>>>>> connector:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> CREATE TABLE ... WITH (
> >>>>>>>>>>>>>>       connector=kafka        -- Some information about the
> >>>>>> connector
> >>>>>>>>>>>>>>       end-offset = XXXX      -- Some information about the
> >>>>>>>>> boundedness
> >>>>>>>>>>>>>>       model = table/stream   -- Some information about
> >>>>>>>>> interpretation
> >>>>>>>>>>>>>> )
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We can still apply all the constraints mentioned in the
> >>>> FLIP.
> >>>>>> When
> >>>>>>>>>>>>>> `model` is set to `table`.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Timo
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 21.10.20 14:19, Jark Wu wrote:
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> IMO, if we are going to mix them in one connector,
> >>>>>>>>>>>>>>> 1) either users need to set some options to a specific
> >>>> value
> >>>>>>>>>>>> explicitly,
> >>>>>>>>>>>>>>> e.g. "scan.startup.mode=earliest",
> >> "sink.partitioner=hash",
> >>>>>> etc..
> >>>>>>>>>>>>>>> This makes the connector awkward to use. Users may face
> >> to
> >>>> fix
> >>>>>>>>>>> options
> >>>>>>>>>>>>>> one
> >>>>>>>>>>>>>>> by one according to the exception.
> >>>>>>>>>>>>>>> Besides, in the future, it is still possible to use
> >>>>>>>>>>>>>>> "sink.partitioner=fixed" (reduce network cost) if users
> >> are
> >>>>>> aware
> >>>>>>>>>> of
> >>>>>>>>>>>>>>> the partition routing,
> >>>>>>>>>>>>>>> however, it's error-prone to have "fixed" as default for
> >>>>>>>>> compacted
> >>>>>>>>>>>> mode.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> 2) or make those options a different default value when
> >>>>>>>>>>>> "compacted=true".
> >>>>>>>>>>>>>>> This would be more confusing and unpredictable if the
> >>>> default
> >>>>>>>>> value
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>> options will change according to other options.
> >>>>>>>>>>>>>>> What happens if we have a third mode in the future?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> In terms of usage and options, it's very different from
> >> the
> >>>>>>>>>>>>>>> original "kafka" connector.
> >>>>>>>>>>>>>>> It would be more handy to use and less fallible if
> >>>> separating
> >>>>>>>>> them
> >>>>>>>>>>> into
> >>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>> connectors.
> >>>>>>>>>>>>>>> In the implementation layer, we can reuse code as much as
> >>>>>>>>> possible.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Therefore, I'm still +1 to have a new connector.
> >>>>>>>>>>>>>>> The "kafka-compacted" name sounds good to me.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, 21 Oct 2020 at 17:58, Konstantin Knauf <
> >>>>>>>>> [email protected]>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Kurt, Hi Shengkai,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> thanks for answering my questions and the additional
> >>>>>>>>>>> clarifications. I
> >>>>>>>>>>>>>>>> don't have a strong opinion on whether to extend the
> >>>> "kafka"
> >>>>>>>>>>> connector
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>> to introduce a new connector. So, from my perspective
> >> feel
> >>>>> free
> >>>>>>>>> to
> >>>>>>>>>>> go
> >>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>> a separate connector. If we do introduce a new
> >> connector I
> >>>>>>>>>> wouldn't
> >>>>>>>>>>>>>> call it
> >>>>>>>>>>>>>>>> "ktable" for aforementioned reasons (In addition, we
> >> might
> >>>>>>>>> suggest
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> there is also a "kstreams" connector for symmetry
> >>>> reasons). I
> >>>>>>>>>> don't
> >>>>>>>>>>>>>> have a
> >>>>>>>>>>>>>>>> good alternative name, though, maybe "kafka-compacted"
> >> or
> >>>>>>>>>>>>>>>> "compacted-kafka".
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Konstantin
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Oct 21, 2020 at 4:43 AM Kurt Young <
> >>>> [email protected]
> >>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I want to describe the discussion process which drove
> >> us
> >>>> to
> >>>>>>>>> have
> >>>>>>>>>>> such
> >>>>>>>>>>>>>>>>> conclusion, this might make some of
> >>>>>>>>>>>>>>>>> the design choices easier to understand and keep
> >>>> everyone on
> >>>>>>>>> the
> >>>>>>>>>>> same
> >>>>>>>>>>>>>>>> page.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Back to the motivation, what functionality do we want
> >> to
> >>>>>>>>> provide
> >>>>>>>>>> in
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> first place? We got a lot of feedback and
> >>>>>>>>>>>>>>>>> questions from mailing lists that people want to write
> >>>>>>>>>>>> Not-Insert-Only
> >>>>>>>>>>>>>>>>> messages into kafka. They might be
> >>>>>>>>>>>>>>>>> intentional or by accident, e.g. wrote an non-windowed
> >>>>>>>>> aggregate
> >>>>>>>>>>>> query
> >>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>> non-windowed left outer join. And
> >>>>>>>>>>>>>>>>> some users from KSQL world also asked about why Flink
> >>>> didn't
> >>>>>>>>>>> leverage
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> Key concept of every kafka topic
> >>>>>>>>>>>>>>>>> and make kafka as a dynamic changing keyed table.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To work with kafka better, we were thinking to extend
> >> the
> >>>>>>>>>>>> functionality
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> the current kafka connector by letting it
> >>>>>>>>>>>>>>>>> accept updates and deletions. But due to the limitation
> >>>> of
> >>>>>>>>> kafka,
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> update has to be "update by key", aka a table
> >>>>>>>>>>>>>>>>> with primary key.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This introduces a couple of conflicts with current
> >> kafka
> >>>>>>>>> table's
> >>>>>>>>>>>>>> options:
> >>>>>>>>>>>>>>>>> 1. key.fields: as said above, we need the kafka table
> >> to
> >>>>> have
> >>>>>>>>> the
> >>>>>>>>>>>>>> primary
> >>>>>>>>>>>>>>>>> key constraint. And users can also configure
> >>>>>>>>>>>>>>>>> key.fields freely, this might cause friction. (Sure we
> >>>> can
> >>>>> do
> >>>>>>>>>> some
> >>>>>>>>>>>>>> sanity
> >>>>>>>>>>>>>>>>> check on this but it also creates friction.)
> >>>>>>>>>>>>>>>>> 2. sink.partitioner: to make the semantics right, we
> >>>> need to
> >>>>>>>>> make
> >>>>>>>>>>>> sure
> >>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>> the updates on the same key are written to
> >>>>>>>>>>>>>>>>> the same kafka partition, such we should force to use a
> >>>> hash
> >>>>>> by
> >>>>>>>>>> key
> >>>>>>>>>>>>>>>>> partition inside such table. Again, this has conflicts
> >>>>>>>>>>>>>>>>> and creates friction with current user options.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The above things are solvable, though not perfect or
> >> most
> >>>>> user
> >>>>>>>>>>>>>> friendly.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Let's take a look at the reading side. The keyed kafka
> >>>> table
> >>>>>>>>>>> contains
> >>>>>>>>>>>>>> two
> >>>>>>>>>>>>>>>>> kinds of messages: upsert or deletion. What upsert
> >>>>>>>>>>>>>>>>> means is "If the key doesn't exist yet, it's an insert
> >>>>> record.
> >>>>>>>>>>>>>> Otherwise
> >>>>>>>>>>>>>>>>> it's an update record". For the sake of correctness or
> >>>>>>>>>>>>>>>>> simplicity, the Flink SQL engine also needs such
> >>>>> information.
> >>>>>>>>> If
> >>>>>>>>>> we
> >>>>>>>>>>>>>>>>> interpret all messages to "update record", some queries
> >>>> or
> >>>>>>>>>>>>>>>>> operators may not work properly. It's weird to see an
> >>>> update
> >>>>>>>>>> record
> >>>>>>>>>>>> but
> >>>>>>>>>>>>>>>> you
> >>>>>>>>>>>>>>>>> haven't seen the insert record before.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> So what Flink should do is after reading out the
> >> records
> >>>>> from
> >>>>>>>>>> such
> >>>>>>>>>>>>>> table,
> >>>>>>>>>>>>>>>>> it needs to create a state to record which messages
> >> have
> >>>>>>>>>>>>>>>>> been seen and then generate the correct row type
> >>>>>>>>> correspondingly.
> >>>>>>>>>>>> This
> >>>>>>>>>>>>>>>> kind
> >>>>>>>>>>>>>>>>> of couples the state and the data of the message
> >>>>>>>>>>>>>>>>> queue, and it also creates conflicts with current kafka
> >>>>>>>>>> connector.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Think about if users suspend a running job (which
> >>>> contains
> >>>>>> some
> >>>>>>>>>>>> reading
> >>>>>>>>>>>>>>>>> state now), and then change the start offset of the
> >>>> reader.
> >>>>>>>>>>>>>>>>> By changing the reading offset, it actually change the
> >>>> whole
> >>>>>>>>>> story
> >>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> "which records should be insert messages and which
> >>>> records
> >>>>>>>>>>>>>>>>> should be update messages). And it will also make Flink
> >>>> to
> >>>>>> deal
> >>>>>>>>>>> with
> >>>>>>>>>>>>>>>>> another weird situation that it might receive a
> >> deletion
> >>>>>>>>>>>>>>>>> on a non existing message.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> We were unsatisfied with all the frictions and
> >> conflicts
> >>>> it
> >>>>>>>>> will
> >>>>>>>>>>>> create
> >>>>>>>>>>>>>>>> if
> >>>>>>>>>>>>>>>>> we enable the "upsert & deletion" support to the
> >> current
> >>>>> kafka
> >>>>>>>>>>>>>>>>> connector. And later we begin to realize that we
> >>>> shouldn't
> >>>>>>>>> treat
> >>>>>>>>>> it
> >>>>>>>>>>>> as
> >>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>> normal message queue, but should treat it as a changing
> >>>>> keyed
> >>>>>>>>>>>>>>>>> table. We should be able to always get the whole data
> >> of
> >>>>> such
> >>>>>>>>>> table
> >>>>>>>>>>>> (by
> >>>>>>>>>>>>>>>>> disabling the start offset option) and we can also read
> >>>> the
> >>>>>>>>>>>>>>>>> changelog out of such table. It's like a HBase table
> >> with
> >>>>>>>>> binlog
> >>>>>>>>>>>>>> support
> >>>>>>>>>>>>>>>>> but doesn't have random access capability (which can be
> >>>>>>>>> fulfilled
> >>>>>>>>>>>>>>>>> by Flink's state).
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> So our intention was instead of telling and persuading
> >>>> users
> >>>>>>>>> what
> >>>>>>>>>>>> kind
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>> options they should or should not use by extending
> >>>>>>>>>>>>>>>>> current kafka connector when enable upsert support, we
> >>>> are
> >>>>>>>>>> actually
> >>>>>>>>>>>>>>>> create
> >>>>>>>>>>>>>>>>> a whole new and different connector that has total
> >>>>>>>>>>>>>>>>> different abstractions in SQL layer, and should be
> >>>> treated
> >>>>>>>>>> totally
> >>>>>>>>>>>>>>>>> different with current kafka connector.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hope this can clarify some of the concerns.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> Kurt
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Oct 20, 2020 at 5:20 PM Shengkai Fang <
> >>>>>>>>> [email protected]
> >>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi devs,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> As many people are still confused about the difference
> >>>>> option
> >>>>>>>>>>>>>>>> behaviours
> >>>>>>>>>>>>>>>>>> between the Kafka connector and KTable connector, Jark
> >>>> and
> >>>>> I
> >>>>>>>>>> list
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> differences in the doc[1].
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> Shengkai
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://docs.google.com/document/d/13oAWAwQez0lZLsyfV21BfTEze1fc2cz4AZKiNOyBNPk/edit
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Shengkai Fang <[email protected]> 于2020年10月20日周二
> >>>>> 下午12:05写道：
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi Konstantin,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks for your reply.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> It uses the "kafka" connector and does not specify a
> >>>>>> primary
> >>>>>>>>>>> key.
> >>>>>>>>>>>>>>>>>>> The dimensional table `users` is a ktable connector
> >>>> and we
> >>>>>>>>> can
> >>>>>>>>>>>>>>>> specify
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> pk on the KTable.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Will it possible to use a "ktable" as a dimensional
> >>>> table
> >>>>>> in
> >>>>>>>>>>>>>>>> FLIP-132
> >>>>>>>>>>>>>>>>>>> Yes. We can specify the watermark on the KTable and
> >> it
> >>>> can
> >>>>>> be
> >>>>>>>>>>> used
> >>>>>>>>>>>>>>>> as a
> >>>>>>>>>>>>>>>>>>> dimension table in temporal join.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Introduce a new connector vs introduce a new
> >> property
> >>>>>>>>>>>>>>>>>>> The main reason behind is that the KTable connector
> >>>> almost
> >>>>>>>>> has
> >>>>>>>>>> no
> >>>>>>>>>>>>>>>>> common
> >>>>>>>>>>>>>>>>>>> options with the Kafka connector. The options that
> >> can
> >>>> be
> >>>>>>>>>> reused
> >>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>> KTable
> >>>>>>>>>>>>>>>>>>> connectors are 'topic',
> >> 'properties.bootstrap.servers'
> >>>> and
> >>>>>>>>>>>>>>>>>>> 'value.fields-include' . We can't set cdc format for
> >>>>>>>>>> 'key.format'
> >>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>> 'value.format' in KTable connector now, which is
> >>>>> available
> >>>>>>>>> in
> >>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>> connector. Considering the difference between the
> >>>> options
> >>>>> we
> >>>>>>>>>> can
> >>>>>>>>>>>> use,
> >>>>>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>> more suitable to introduce an another connector
> >> rather
> >>>>> than
> >>>>>> a
> >>>>>>>>>>>>>>>> property.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> We are also fine to use "compacted-kafka" as the name
> >>>> of
> >>>>> the
> >>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>> connector. What do you think?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>> Shengkai
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Konstantin Knauf <[email protected]> 于2020年10月19日周一
> >>>>>>>>> 下午10:15写道：
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi Shengkai,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thank you for driving this effort. I believe this a
> >>>> very
> >>>>>>>>>>> important
> >>>>>>>>>>>>>>>>>> feature
> >>>>>>>>>>>>>>>>>>>> for many users who use Kafka and Flink SQL
> >> together. A
> >>>>> few
> >>>>>>>>>>>> questions
> >>>>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>>>>> thoughts:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * Is your example "Use KTable as a
> >> reference/dimension
> >>>>>>>>> table"
> >>>>>>>>>>>>>>>> correct?
> >>>>>>>>>>>>>>>>>> It
> >>>>>>>>>>>>>>>>>>>> uses the "kafka" connector and does not specify a
> >>>> primary
> >>>>>>>>> key.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * Will it be possible to use a "ktable" table
> >> directly
> >>>>> as a
> >>>>>>>>>>>>>>>>> dimensional
> >>>>>>>>>>>>>>>>>>>> table in temporal join (*based on event time*)
> >>>>> (FLIP-132)?
> >>>>>>>>>> This
> >>>>>>>>>>> is
> >>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>> completely clear to me from the FLIP.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * I'd personally prefer not to introduce a new
> >>>> connector
> >>>>>> and
> >>>>>>>>>>>> instead
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> extend the Kafka connector. We could add an
> >> additional
> >>>>>>>>>> property
> >>>>>>>>>>>>>>>>>>>> "compacted"
> >>>>>>>>>>>>>>>>>>>> = "true"|"false". If it is set to "true", we can add
> >>>>>>>>>> additional
> >>>>>>>>>>>>>>>>>> validation
> >>>>>>>>>>>>>>>>>>>> logic (e.g. "scan.startup.mode" can not be set,
> >>>> primary
> >>>>> key
> >>>>>>>>>>>>>>>> required,
> >>>>>>>>>>>>>>>>>>>> etc.). If we stick to a separate connector I'd not
> >>>> call
> >>>>> it
> >>>>>>>>>>>> "ktable",
> >>>>>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>>>>> rather "compacted-kafka" or similar. KTable seems to
> >>>>> carry
> >>>>>>>>>> more
> >>>>>>>>>>>>>>>>> implicit
> >>>>>>>>>>>>>>>>>>>> meaning than we want to imply here.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> * I agree that this is not a bounded source. If we
> >>>> want
> >>>>> to
> >>>>>>>>>>>> support a
> >>>>>>>>>>>>>>>>>>>> bounded mode, this is an orthogonal concern that
> >> also
> >>>>>>>>> applies
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>>>>>> unbounded sources.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Konstantin
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Mon, Oct 19, 2020 at 3:26 PM Jark Wu <
> >>>>> [email protected]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Danny,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> First of all, we didn't introduce any concepts from
> >>>> KSQL
> >>>>>>>>>> (e.g.
> >>>>>>>>>>>>>>>>> Stream
> >>>>>>>>>>>>>>>>>> vs
> >>>>>>>>>>>>>>>>>>>>> Table notion).
> >>>>>>>>>>>>>>>>>>>>> This new connector will produce a changelog stream,
> >>>> so
> >>>>>> it's
> >>>>>>>>>>> still
> >>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>> dynamic
> >>>>>>>>>>>>>>>>>>>>> table and doesn't conflict with Flink core
> >> concepts.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> The "ktable" is just a connector name, we can also
> >>>> call
> >>>>> it
> >>>>>>>>>>>>>>>>>>>>> "compacted-kafka" or something else.
> >>>>>>>>>>>>>>>>>>>>> Calling it "ktable" is just because KSQL users can
> >>>>> migrate
> >>>>>>>>> to
> >>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>>>>>>> easily.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Regarding to why introducing a new connector vs a
> >> new
> >>>>>>>>>> property
> >>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> existing
> >>>>>>>>>>>>>>>>>>>>> kafka connector:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> I think the main reason is that we want to have a
> >>>> clear
> >>>>>>>>>>>> separation
> >>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> such
> >>>>>>>>>>>>>>>>>>>>> two use cases, because they are very different.
> >>>>>>>>>>>>>>>>>>>>> We also listed reasons in the FLIP, including:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> 1) It's hard to explain what's the behavior when
> >>>> users
> >>>>>>>>>> specify
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> start
> >>>>>>>>>>>>>>>>>>>>> offset from a middle position (e.g. how to process
> >>>> non
> >>>>>>>>> exist
> >>>>>>>>>>>>>>>> delete
> >>>>>>>>>>>>>>>>>>>>> events).
> >>>>>>>>>>>>>>>>>>>>>         It's dangerous if users do that. So we don't
> >>>>>> provide
> >>>>>>>>>> the
> >>>>>>>>>>>>>>>> offset
> >>>>>>>>>>>>>>>>>>>> option
> >>>>>>>>>>>>>>>>>>>>> in the new connector at the moment.
> >>>>>>>>>>>>>>>>>>>>> 2) It's a different perspective/abstraction on the
> >>>> same
> >>>>>>>>> kafka
> >>>>>>>>>>>>>>>> topic
> >>>>>>>>>>>>>>>>>>>> (append
> >>>>>>>>>>>>>>>>>>>>> vs. upsert). It would be easier to understand if we
> >>>> can
> >>>>>>>>>>> separate
> >>>>>>>>>>>>>>>>> them
> >>>>>>>>>>>>>>>>>>>>>         instead of mixing them in one connector. The
> >>>> new
> >>>>>>>>>>> connector
> >>>>>>>>>>>>>>>>>> requires
> >>>>>>>>>>>>>>>>>>>>> hash sink partitioner, primary key declared,
> >> regular
> >>>>>>>>> format.
> >>>>>>>>>>>>>>>>>>>>>         If we mix them in one connector, it might be
> >>>>>>>>> confusing
> >>>>>>>>>>> how
> >>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>> use
> >>>>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>> options correctly.
> >>>>>>>>>>>>>>>>>>>>> 3) The semantic of the KTable connector is just the
> >>>> same
> >>>>>> as
> >>>>>>>>>>>> KTable
> >>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>>>> Stream. So it's very handy for Kafka Stream and
> >> KSQL
> >>>>>> users.
> >>>>>>>>>>>>>>>>>>>>>         We have seen several questions in the
> >> mailing
> >>>>> list
> >>>>>>>>>> asking
> >>>>>>>>>>>> how
> >>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>> model
> >>>>>>>>>>>>>>>>>>>>> a KTable and how to join a KTable in Flink SQL.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 19:53, Jark Wu <
> >>>> [email protected]
> >>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Hi Jingsong,
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> As the FLIP describes, "KTable connector produces
> >> a
> >>>>>>>>>> changelog
> >>>>>>>>>>>>>>>>>> stream,
> >>>>>>>>>>>>>>>>>>>>>> where each data record represents an update or
> >>>> delete
> >>>>>>>>>> event.".
> >>>>>>>>>>>>>>>>>>>>>> Therefore, a ktable source is an unbounded stream
> >>>>> source.
> >>>>>>>>>>>>>>>>> Selecting
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>> ktable source is similar to selecting a kafka
> >> source
> >>>>> with
> >>>>>>>>>>>>>>>>>>>> debezium-json
> >>>>>>>>>>>>>>>>>>>>>> format
> >>>>>>>>>>>>>>>>>>>>>> that it never ends and the results are
> >> continuously
> >>>>>>>>> updated.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> It's possible to have a bounded ktable source in
> >> the
> >>>>>>>>> future,
> >>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>> example,
> >>>>>>>>>>>>>>>>>>>>>> add an option 'bounded=true' or 'end-offset=xxx'.
> >>>>>>>>>>>>>>>>>>>>>> In this way, the ktable will produce a bounded
> >>>>> changelog
> >>>>>>>>>>> stream.
> >>>>>>>>>>>>>>>>>>>>>> So I think this can be a compatible feature in the
> >>>>>> future.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> I don't think we should associate with ksql
> >> related
> >>>>>>>>>> concepts.
> >>>>>>>>>>>>>>>>>>>> Actually,
> >>>>>>>>>>>>>>>>>>>>> we
> >>>>>>>>>>>>>>>>>>>>>> didn't introduce any concepts from KSQL (e.g.
> >>>> Stream vs
> >>>>>>>>>> Table
> >>>>>>>>>>>>>>>>>> notion).
> >>>>>>>>>>>>>>>>>>>>>> The "ktable" is just a connector name, we can also
> >>>> call
> >>>>>> it
> >>>>>>>>>>>>>>>>>>>>>> "compacted-kafka" or something else.
> >>>>>>>>>>>>>>>>>>>>>> Calling it "ktable" is just because KSQL users can
> >>>>>> migrate
> >>>>>>>>>> to
> >>>>>>>>>>>>>>>>> Flink
> >>>>>>>>>>>>>>>>>>>> SQL
> >>>>>>>>>>>>>>>>>>>>>> easily.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Regarding the "value.fields-include", this is an
> >>>> option
> >>>>>>>>>>>>>>>> introduced
> >>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>> FLIP-107 for Kafka connector.
> >>>>>>>>>>>>>>>>>>>>>> I think we should keep the same behavior with the
> >>>> Kafka
> >>>>>>>>>>>>>>>> connector.
> >>>>>>>>>>>>>>>>>> I'm
> >>>>>>>>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>>>>>>>> sure what's the default behavior of KSQL.
> >>>>>>>>>>>>>>>>>>>>>> But I guess it also stores the keys in value from
> >>>> this
> >>>>>>>>>> example
> >>>>>>>>>>>>>>>>> docs
> >>>>>>>>>>>>>>>>>>>> (see
> >>>>>>>>>>>>>>>>>>>>>> the "users_original" table) [1].
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>> Jark
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://docs.confluent.io/current/ksqldb/tutorials/basics-local.html#create-a-stream-and-table
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Mon, 19 Oct 2020 at 18:17, Danny Chan <
> >>>>>>>>>>> [email protected]>
> >>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> The concept seems conflicts with the Flink
> >>>> abstraction
> >>>>>>>>>>> “dynamic
> >>>>>>>>>>>>>>>>>>>> table”,
> >>>>>>>>>>>>>>>>>>>>>>> in Flink we see both “stream” and “table” as a
> >>>> dynamic
> >>>>>>>>>> table,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I think we should make clear first how to express
> >>>>> stream
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>>>>>>>>> specific features on one “dynamic table”,
> >>>>>>>>>>>>>>>>>>>>>>> it is more natural for KSQL because KSQL takes
> >>>> stream
> >>>>>> and
> >>>>>>>>>>> table
> >>>>>>>>>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>>>>> different abstractions for representing
> >>>> collections.
> >>>>> In
> >>>>>>>>>> KSQL,
> >>>>>>>>>>>>>>>>> only
> >>>>>>>>>>>>>>>>>>>>> table is
> >>>>>>>>>>>>>>>>>>>>>>> mutable and can have a primary key.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Does this connector belongs to the “table” scope
> >> or
> >>>>>>>>>> “stream”
> >>>>>>>>>>>>>>>>> scope
> >>>>>>>>>>>>>>>>>> ?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Some of the concepts (such as the primary key on
> >>>>> stream)
> >>>>>>>>>>> should
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>>>>>>> suitable for all the connectors, not just Kafka,
> >>>>>>>>> Shouldn’t
> >>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> be
> >>>>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>>>>> extension of existing Kafka connector instead of
> >> a
> >>>>>>>>> totally
> >>>>>>>>>>> new
> >>>>>>>>>>>>>>>>>>>>> connector ?
> >>>>>>>>>>>>>>>>>>>>>>> What about the other connectors ?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Because this touches the core abstraction of
> >>>> Flink, we
> >>>>>>>>>> better
> >>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>> top-down overall design, following the KSQL
> >>>> directly
> >>>>> is
> >>>>>>>>> not
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> answer.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> P.S. For the source
> >>>>>>>>>>>>>>>>>>>>>>>> Shouldn’t this be an extension of existing Kafka
> >>>>>>>>> connector
> >>>>>>>>>>>>>>>>>> instead
> >>>>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>> totally new connector ?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> How could we achieve that (e.g. set up the
> >>>> parallelism
> >>>>>>>>>>>>>>>>> correctly) ?
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>> Danny Chan
> >>>>>>>>>>>>>>>>>>>>>>> 在 2020年10月19日 +0800 PM5:17，Jingsong Li <
> >>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>>>>>> ，写道：
> >>>>>>>>>>>>>>>>>>>>>>>> Thanks Shengkai for your proposal.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> +1 for this feature.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Future Work: Support bounded KTable source
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I don't think it should be a future work, I
> >> think
> >>>> it
> >>>>> is
> >>>>>>>>>> one
> >>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>>>>>> important concepts of this FLIP. We need to
> >>>>> understand
> >>>>>>>>> it
> >>>>>>>>>>>>>>>> now.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Intuitively, a ktable in my opinion is a bounded
> >>>>> table
> >>>>>>>>>>> rather
> >>>>>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>>>>>>>>>> stream, so select should produce a bounded table
> >>>> by
> >>>>>>>>>> default.
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> I think we can list Kafka related knowledge,
> >>>> because
> >>>>>> the
> >>>>>>>>>>> word
> >>>>>>>>>>>>>>>>>>>> `ktable`
> >>>>>>>>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>>>>>>>> easy to associate with ksql related concepts.
> >> (If
> >>>>>>>>>> possible,
> >>>>>>>>>>>>>>>>> it's
> >>>>>>>>>>>>>>>>>>>>> better
> >>>>>>>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>>>>> unify with it)
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> value.fields-include
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> What about the default behavior of KSQL?
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>> Jingsong
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Oct 19, 2020 at 4:33 PM Shengkai Fang <
> >>>>>>>>>>>>>>>>> [email protected]
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Hi, devs.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Jark and I want to start a new FLIP to
> >> introduce
> >>>> the
> >>>>>>>>>> KTable
> >>>>>>>>>>>>>>>>>>>>>>> connector. The
> >>>>>>>>>>>>>>>>>>>>>>>>> KTable is a shortcut of "Kafka Table", it also
> >>>> has
> >>>>> the
> >>>>>>>>>> same
> >>>>>>>>>>>>>>>>>>>>> semantics
> >>>>>>>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>>>>>>> the KTable notion in Kafka Stream.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> FLIP-149:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-149%3A+Introduce+the+KTable+Connector
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Currently many users have expressed their needs
> >>>> for
> >>>>>> the
> >>>>>>>>>>>>>>>>> upsert
> >>>>>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>>>>>> by
> >>>>>>>>>>>>>>>>>>>>>>>>> mail lists and issues. The KTable connector has
> >>>>>> several
> >>>>>>>>>>>>>>>>>> benefits
> >>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>>> users:
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> 1. Users are able to interpret a compacted
> >> Kafka
> >>>>> Topic
> >>>>>>>>> as
> >>>>>>>>>>>>>>>> an
> >>>>>>>>>>>>>>>>>>>> upsert
> >>>>>>>>>>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>>>>>>>>> in Apache Flink. And also be able to write a
> >>>>> changelog
> >>>>>>>>>>>>>>>> stream
> >>>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>>>>>>>> (into a compacted topic).
> >>>>>>>>>>>>>>>>>>>>>>>>> 2. As a part of the real time pipeline, store
> >>>> join
> >>>>> or
> >>>>>>>>>>>>>>>>> aggregate
> >>>>>>>>>>>>>>>>>>>>>>> result (may
> >>>>>>>>>>>>>>>>>>>>>>>>> contain updates) into a Kafka topic for further
> >>>>>>>>>>>>>>>> calculation;
> >>>>>>>>>>>>>>>>>>>>>>>>> 3. The semantic of the KTable connector is just
> >>>> the
> >>>>>>>>> same
> >>>>>>>>>> as
> >>>>>>>>>>>>>>>>>>>> KTable
> >>>>>>>>>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>>>>>>>>>> Kafka
> >>>>>>>>>>>>>>>>>>>>>>>>> Stream. So it's very handy for Kafka Stream and
> >>>> KSQL
> >>>>>>>>>> users.
> >>>>>>>>>>>>>>>>> We
> >>>>>>>>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>>>>> seen
> >>>>>>>>>>>>>>>>>>>>>>>>> several questions in the mailing list asking
> >> how
> >>>> to
> >>>>>>>>>> model a
> >>>>>>>>>>>>>>>>>>>> KTable
> >>>>>>>>>>>>>>>>>>>>>>> and how
> >>>>>>>>>>>>>>>>>>>>>>>>> to join a KTable in Flink SQL.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> We hope it can expand the usage of the Flink
> >> with
> >>>>>>>>> Kafka.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> I'm looking forward to your feedback.
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>>>>> Shengkai
> >>>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Konstantin Knauf
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> https://twitter.com/snntrable
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> https://github.com/knaufk
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Konstantin Knauf
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> https://twitter.com/snntrable
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> https://github.com/knaufk
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>>
> >>>>>>>>>>> Seth Wiesman | Solutions Architect
> >>>>>>>>>>>
> >>>>>>>>>>> +1 314 387 1463
> >>>>>>>>>>>
> >>>>>>>>>>> <https://www.ververica.com/>
> >>>>>>>>>>>
> >>>>>>>>>>> Follow us @VervericaData
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>>
> >>>>>>>>>>> Join Flink Forward <https://flink-forward.org/> - The Apache
> >>>>> Flink
> >>>>>>>>>>> Conference
> >>>>>>>>>>>
> >>>>>>>>>>> Stream Processing | Event Driven | Real Time
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Best, Jingsong Lee
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best, Jingsong Lee
> >>>>
> >>>
> >>
> >
>
>

-- 
Best, Jingsong Lee

Re: [DISCUSS] FLIP-149: Introduce the KTable Connector

Reply via email to