Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Artsem Semianenka Fri, 19 Apr 2019 08:59:13 -0700

Hi guys!

I've created the first draft of the design document [1] and attached it to
the Jira ticket [2]. Please let's continue our discussion in the ticket or
in comments in Google Docs.


[1]
https://docs.google.com/document/d/14thwgV2RY1AA9KgYztv_kLYSz4K1TckJ-YiGfkB5650/edit?usp=sharing
[2] https://issues.apache.org/jira/browse/FLINK-12256

Best regards,
Artsem

On Thu, 18 Apr 2019 at 23:57, Bowen Li <[email protected]> wrote:

> Great! I linked that JIRA to FLINK-11275
> <https://issues.apache.org/jira/browse/FLINK-11275> , and put it along
> with
> JIRAs for HiveCatalog and GenericHiveMetastoreCatalog.
>
> I have some initial thoughts on the solution you described, but I'll wait
> till a more complete google design doc comes up, since this discussion is
> about engaging community interest.
>
> On Thu, Apr 18, 2019 at 8:39 AM Artsem Semianenka <[email protected]>
> wrote:
>
> > Sorry guys I've attached the wrong link for Jira ticket in the
> > previous email. This is the correct link :
> > https://issues.apache.org/jira/browse/FLINK-12256
> >
> > On Thu, 18 Apr 2019 at 18:29, Artsem Semianenka <[email protected]>
> > wrote:
> >
> > > Thank you guys so much!
> > >
> > > You provided me a lot of helpful information.
> > > I've created the Jira ticket[1] and added to it an initial description
> > > only with the main purpose of the new feature. More detailed
> > implementation
> > > description will be added further.
> > >
> > > Hi Rong, to tell the truth, my first idea was to use some predefined
> > > prefix/postfix for topic name and lookup mapping between
> > > topic/schema-subject.  But the idea with a separated view of a logical
> > > table with schema looks more elegant and flexible.
> > >
> > > Also, I thought about other approaches on how to define the mapping
> > > between topic and schema-subject in case if they have different names:
> > > Define the "subject" as a part of the table definition:
> > >
> > > Select * from kafka.topic.subject
> > > or
> > > Select * from kafka.topic#subject
> > >
> > > In case if the subject is not defined try to find a subject with the
> same
> > > name as a topic.
> > > If the subject still not found  -  take one last message and try to
> infer
> > > the schema ( retrieve schema id from the message and get last defined
> > > schema)
> > >
> > > But I see one disadvantage for all of these approaches: the subject
> name
> > > may contain not supported in SQL symbols.
> > >
> > > I try to investigate how to escape the illegal symbols in the table
> name
> > > definition.
> > >
> > > Thanks,
> > > Artsem
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-11275
> > >
> > > On Thu, 18 Apr 2019 at 11:54, Timo Walther <[email protected]> wrote:
> > >
> > >> Hi Artsem,
> > >>
> > >> having a catalog support for Confluent Schema Registry would be a
> great
> > >> addition. Although the implementation of FLIP-30 is still ongoing, we
> > >> merged the stable interfaces today [0]. This should unblock people
> from
> > >> contributing new catalog implementations. So you could already start
> > >> designing an implementation. The implementation could be unit tested
> for
> > >> now until it can also be registered in a table environment for
> > >> integration tests/end-to-end tests.
> > >>
> > >> I hope we can reuse the existing SQL Kafka connector and SQL Avro
> > format?
> > >>
> > >> Looking forward to a JIRA issue and a little design document how to
> > >> connect the APIs.
> > >>
> > >> Thanks,
> > >> Timo
> > >>
> > >> [0] https://github.com/apache/flink/pull/8007
> > >>
> > >> Am 18.04.19 um 07:03 schrieb Bowen Li:
> > >> > Hi,
> > >> >
> > >> > Thanks Artsem and Rong for bringing up the demand from user
> > >> perspective. A
> > >> > Kafka/Confluent Schema Registry catalog would have a good use case
> in
> > >> > Flink. We actually mentioned the potential of Unified Catalog APIs
> for
> > >> > Kafka in our talk a couple weeks ago at Flink Forward SF [1], and
> glad
> > >> to
> > >> > learn you are interested in contributing. I think creating a JIRA
> > ticket
> > >> > with link in FLINK-11275 [2], and starting with discussions and
> design
> > >> > would help to advance the effort.
> > >> >
> > >> > The most interesting part of Confluent Schema Registry, from my
> point
> > of
> > >> > view, is the core idea of smoothing real production experience and
> > >> things
> > >> > built around it, including versioned schemas, schema evolution and
> > >> > compatibility checks, etc. Introducing a confluent-schema-registry
> > >> backed
> > >> > catalog to Flink may also help our design to benefit from those
> ideas.
> > >> >
> > >> > To add on Dawid's points. I assume the MVP for this project would be
> > >> > supporting Kafka as streaming tables thru the new catalog. FLIP-30
> is
> > >> for
> > >> > both streaming and batch tables, thus it won't be blocked by the
> whole
> > >> > FLIP-30. I think as soon as we finish the table operation APIs,
> > finalize
> > >> > properties and formats, and connect the APIs to Calcite, this work
> can
> > >> be
> > >> > unblocked. Timo and Xuefu may have more things to say.
> > >> >
> > >> > [1]
> > >> >
> > >>
> >
> https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
> > >> > [2] https://issues.apache.org/jira/browse/FLINK-11275
> > >> >
> > >> > On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[email protected]> wrote:
> > >> >
> > >> >> Hi Rong,
> > >> >>
> > >> >> Thanks for pointing out the missing FLIPs in the FLIP main page. I
> > >> added
> > >> >> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30,
> > >> FLIP-31) to
> > >> >> the page.
> > >> >>
> > >> >> I also include @[email protected] <[email protected]>
> > >> and @Bowen
> > >> >> Li <[email protected]>  into the thread who are familiar with
> the
> > >> >> latest catalog design.
> > >> >>
> > >> >> Thanks,
> > >> >> Jark
> > >> >>
> > >> >> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[email protected]>
> wrote:
> > >> >>
> > >> >>> Thanks Artsem for looking into this problem and Thanks Dawid for
> > >> bringing
> > >> >>> up the discussion on FLIP-30.
> > >> >>>
> > >> >>> We've observe similar scenarios when we also would like to reuse
> the
> > >> >>> schema
> > >> >>> registry of both Kafka stream as well as the raw ingested kafka
> > >> messages
> > >> >>> in
> > >> >>> datalake.
> > >> >>> FYI another more catalog-oriented document can be found here [1].
> I
> > do
> > >> >>> have
> > >> >>> one question to follow up with Dawid's point (2): are we
> suggesting
> > >> that
> > >> >>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod,
> > >> etc)
> > >> >>> considered as a "view" of a logical table with schema (e.g.
> > >> test-topic) ?
> > >> >>>
> > >> >>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not
> > >> linked
> > >> >>> in
> > >> >>> the main FLIP confluence wiki page [2] for some reason.
> > >> >>> I tried to fix that be seems like I don't have permission. Maybe
> > >> someone
> > >> >>> can also take a look?
> > >> >>>
> > >> >>> Thanks,
> > >> >>> Rong
> > >> >>>
> > >> >>>
> > >> >>> [1]
> > >> >>>
> > >> >>>
> > >>
> >
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
> > >> >>> [2]
> > >> >>>
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > >> >>>
> > >> >>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <
> > >> [email protected]
> > >> >>> wrote:
> > >> >>>
> > >> >>>> Thank you, Dawid!
> > >> >>>> This is very helpful information. I will keep a close eye on the
> > >> >>> updates of
> > >> >>>> FLIP-30 and contribute whenever it possible.
> > >> >>>> I guess I may create a Jira ticket for my proposal in which I
> > >> describe
> > >> >>> the
> > >> >>>> idea and attach intermediate pull request based on current
> API(just
> > >> for
> > >> >>>> initial discuss). But the final pull request definitely will be
> > >> based on
> > >> >>>> FLIP-30 API.
> > >> >>>>
> > >> >>>> Best regards,
> > >> >>>> Artsem
> > >> >>>>
> > >> >>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <
> > >> [email protected]>
> > >> >>>> wrote:
> > >> >>>>
> > >> >>>>> Hi Artsem,
> > >> >>>>>
> > >> >>>>> I think it totally makes sense to have a catalog for the Schema
> > >> >>>>> Registry. It is also good to hear you want to contribute that.
> > There
> > >> >>> is
> > >> >>>>> few important things to consider though:
> > >> >>>>>
> > >> >>>>> 1. The Catalog interface is currently under rework. You make
> take
> > a
> > >> >>> look
> > >> >>>>> at the corresponding FLIP-30[1], and also have a look at the
> first
> > >> PR
> > >> >>>>> that introduces the basic interfaces[2]. I think it would be
> worth
> > >> to
> > >> >>>>> already consider those changes. I cc Xuefu who is participating
> in
> > >> the
> > >> >>>>> efforts of Catalog integration.
> > >> >>>>>
> > >> >>>>> 2. There is still ongoing discussion about what properties
> should
> > we
> > >> >>>>> store for streaming tables and how. I think this might affect
> (but
> > >> >>> maybe
> > >> >>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who
> might
> > >> >>> give
> > >> >>>>> more insights if those should be blocking for the work around
> this
> > >> >>>> Catalog.
> > >> >>>>> Best,
> > >> >>>>>
> > >> >>>>> Dawid
> > >> >>>>>
> > >> >>>>> [1]
> > >> >>>>>
> > >> >>>>>
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> > >> >>>>> [2] https://github.com/apache/flink/pull/8007
> > >> >>>>>
> > >> >>>>> [3]
> > >> >>>>>
> > >> >>>>>
> > >> >>>
> > >>
> >
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
> > >> >>>>> On 16/04/2019 17:35, Artsem Semianenka wrote:
> > >> >>>>>> Hi guys!
> > >> >>>>>>
> > >> >>>>>> I'm working on External Catalog for Confluent Kafka. The main
> > idea
> > >> >>> to
> > >> >>>>>> register the external catalog which provides the list of Kafka
> > >> >>> topics
> > >> >>>> and
> > >> >>>>>> execute SQL queries like :
> > >> >>>>>> Select * form kafka.topic_name
> > >> >>>>>>
> > >> >>>>>> I'm going to receive the table schema from Confluent schema
> > >> >>> registry.
> > >> >>>> The
> > >> >>>>>> main disadvantage is: we should have the topic name with the
> same
> > >> >>> name
> > >> >>>>>> (prefix and postfix are accepted ) as this schema subject in
> > Schema
> > >> >>>>>> Registry.
> > >> >>>>>> For example :
> > >> >>>>>> topic: test-topic-prod
> > >> >>>>>> schema subject: test-topic
> > >> >>>>>>
> > >> >>>>>> I would like to contribute this solution into the main Flink
> > branch
> > >> >>> and
> > >> >>>>>> would like to discuss the pros and cons of this approach.
> > >> >>>>>>
> > >> >>>>>> Best regards,
> > >> >>>>>> Artsem
> > >> >>>>>>
> > >> >>>>>
> > >> >>>> --
> > >> >>>>
> > >> >>>> С уважением,
> > >> >>>> Артем Семененко
> > >> >>>>
> > >>
> > >>
> > >
> > > --
> > >
> > > С уважением,
> > > Артем Семененко
> > >
> >
> >
> > --
> >
> > С уважением,
> > Артем Семененко
> >
>


-- 

С уважением,
Артем Семененко

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Reply via email to