Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Bowen Li Thu, 18 Apr 2019 13:57:32 -0700

Great! I linked that JIRA to FLINK-11275
<https://issues.apache.org/jira/browse/FLINK-11275> , and put it along with
JIRAs for HiveCatalog and GenericHiveMetastoreCatalog.


I have some initial thoughts on the solution you described, but I'll wait
till a more complete google design doc comes up, since this discussion is
about engaging community interest.

On Thu, Apr 18, 2019 at 8:39 AM Artsem Semianenka <[email protected]>
wrote:

> Sorry guys I've attached the wrong link for Jira ticket in the
> previous email. This is the correct link :
> https://issues.apache.org/jira/browse/FLINK-12256
>
> On Thu, 18 Apr 2019 at 18:29, Artsem Semianenka <[email protected]>
> wrote:
>
> > Thank you guys so much!
> >
> > You provided me a lot of helpful information.
> > I've created the Jira ticket[1] and added to it an initial description
> > only with the main purpose of the new feature. More detailed
> implementation
> > description will be added further.
> >
> > Hi Rong, to tell the truth, my first idea was to use some predefined
> > prefix/postfix for topic name and lookup mapping between
> > topic/schema-subject.  But the idea with a separated view of a logical
> > table with schema looks more elegant and flexible.
> >
> > Also, I thought about other approaches on how to define the mapping
> > between topic and schema-subject in case if they have different names:
> > Define the "subject" as a part of the table definition:
> >
> > Select * from kafka.topic.subject
> > or
> > Select * from kafka.topic#subject
> >
> > In case if the subject is not defined try to find a subject with the same
> > name as a topic.
> > If the subject still not found  -  take one last message and try to infer
> > the schema ( retrieve schema id from the message and get last defined
> > schema)
> >
> > But I see one disadvantage for all of these approaches: the subject name
> > may contain not supported in SQL symbols.
> >
> > I try to investigate how to escape the illegal symbols in the table name
> > definition.
> >
> > Thanks,
> > Artsem
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-11275
> >
> > On Thu, 18 Apr 2019 at 11:54, Timo Walther <[email protected]> wrote:
> >
> >> Hi Artsem,
> >>
> >> having a catalog support for Confluent Schema Registry would be a great
> >> addition. Although the implementation of FLIP-30 is still ongoing, we
> >> merged the stable interfaces today [0]. This should unblock people from
> >> contributing new catalog implementations. So you could already start
> >> designing an implementation. The implementation could be unit tested for
> >> now until it can also be registered in a table environment for
> >> integration tests/end-to-end tests.
> >>
> >> I hope we can reuse the existing SQL Kafka connector and SQL Avro
> format?
> >>
> >> Looking forward to a JIRA issue and a little design document how to
> >> connect the APIs.
> >>
> >> Thanks,
> >> Timo
> >>
> >> [0] https://github.com/apache/flink/pull/8007
> >>
> >> Am 18.04.19 um 07:03 schrieb Bowen Li:
> >> > Hi,
> >> >
> >> > Thanks Artsem and Rong for bringing up the demand from user
> >> perspective. A
> >> > Kafka/Confluent Schema Registry catalog would have a good use case in
> >> > Flink. We actually mentioned the potential of Unified Catalog APIs for
> >> > Kafka in our talk a couple weeks ago at Flink Forward SF [1], and glad
> >> to
> >> > learn you are interested in contributing. I think creating a JIRA
> ticket
> >> > with link in FLINK-11275 [2], and starting with discussions and design
> >> > would help to advance the effort.
> >> >
> >> > The most interesting part of Confluent Schema Registry, from my point
> of
> >> > view, is the core idea of smoothing real production experience and
> >> things
> >> > built around it, including versioned schemas, schema evolution and
> >> > compatibility checks, etc. Introducing a confluent-schema-registry
> >> backed
> >> > catalog to Flink may also help our design to benefit from those ideas.
> >> >
> >> > To add on Dawid's points. I assume the MVP for this project would be
> >> > supporting Kafka as streaming tables thru the new catalog. FLIP-30 is
> >> for
> >> > both streaming and batch tables, thus it won't be blocked by the whole
> >> > FLIP-30. I think as soon as we finish the table operation APIs,
> finalize
> >> > properties and formats, and connect the APIs to Calcite, this work can
> >> be
> >> > unblocked. Timo and Xuefu may have more things to say.
> >> >
> >> > [1]
> >> >
> >>
> https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
> >> > [2] https://issues.apache.org/jira/browse/FLINK-11275
> >> >
> >> > On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[email protected]> wrote:
> >> >
> >> >> Hi Rong,
> >> >>
> >> >> Thanks for pointing out the missing FLIPs in the FLIP main page. I
> >> added
> >> >> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30,
> >> FLIP-31) to
> >> >> the page.
> >> >>
> >> >> I also include @[email protected] <[email protected]>
> >> and @Bowen
> >> >> Li <[email protected]>  into the thread who are familiar with the
> >> >> latest catalog design.
> >> >>
> >> >> Thanks,
> >> >> Jark
> >> >>
> >> >> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[email protected]> wrote:
> >> >>
> >> >>> Thanks Artsem for looking into this problem and Thanks Dawid for
> >> bringing
> >> >>> up the discussion on FLIP-30.
> >> >>>
> >> >>> We've observe similar scenarios when we also would like to reuse the
> >> >>> schema
> >> >>> registry of both Kafka stream as well as the raw ingested kafka
> >> messages
> >> >>> in
> >> >>> datalake.
> >> >>> FYI another more catalog-oriented document can be found here [1]. I
> do
> >> >>> have
> >> >>> one question to follow up with Dawid's point (2): are we suggesting
> >> that
> >> >>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod,
> >> etc)
> >> >>> considered as a "view" of a logical table with schema (e.g.
> >> test-topic) ?
> >> >>>
> >> >>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not
> >> linked
> >> >>> in
> >> >>> the main FLIP confluence wiki page [2] for some reason.
> >> >>> I tried to fix that be seems like I don't have permission. Maybe
> >> someone
> >> >>> can also take a look?
> >> >>>
> >> >>> Thanks,
> >> >>> Rong
> >> >>>
> >> >>>
> >> >>> [1]
> >> >>>
> >> >>>
> >>
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
> >> >>> [2]
> >> >>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >> >>>
> >> >>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <
> >> [email protected]
> >> >>> wrote:
> >> >>>
> >> >>>> Thank you, Dawid!
> >> >>>> This is very helpful information. I will keep a close eye on the
> >> >>> updates of
> >> >>>> FLIP-30 and contribute whenever it possible.
> >> >>>> I guess I may create a Jira ticket for my proposal in which I
> >> describe
> >> >>> the
> >> >>>> idea and attach intermediate pull request based on current API(just
> >> for
> >> >>>> initial discuss). But the final pull request definitely will be
> >> based on
> >> >>>> FLIP-30 API.
> >> >>>>
> >> >>>> Best regards,
> >> >>>> Artsem
> >> >>>>
> >> >>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <
> >> [email protected]>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> Hi Artsem,
> >> >>>>>
> >> >>>>> I think it totally makes sense to have a catalog for the Schema
> >> >>>>> Registry. It is also good to hear you want to contribute that.
> There
> >> >>> is
> >> >>>>> few important things to consider though:
> >> >>>>>
> >> >>>>> 1. The Catalog interface is currently under rework. You make take
> a
> >> >>> look
> >> >>>>> at the corresponding FLIP-30[1], and also have a look at the first
> >> PR
> >> >>>>> that introduces the basic interfaces[2]. I think it would be worth
> >> to
> >> >>>>> already consider those changes. I cc Xuefu who is participating in
> >> the
> >> >>>>> efforts of Catalog integration.
> >> >>>>>
> >> >>>>> 2. There is still ongoing discussion about what properties should
> we
> >> >>>>> store for streaming tables and how. I think this might affect (but
> >> >>> maybe
> >> >>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who might
> >> >>> give
> >> >>>>> more insights if those should be blocking for the work around this
> >> >>>> Catalog.
> >> >>>>> Best,
> >> >>>>>
> >> >>>>> Dawid
> >> >>>>>
> >> >>>>> [1]
> >> >>>>>
> >> >>>>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> >> >>>>> [2] https://github.com/apache/flink/pull/8007
> >> >>>>>
> >> >>>>> [3]
> >> >>>>>
> >> >>>>>
> >> >>>
> >>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
> >> >>>>> On 16/04/2019 17:35, Artsem Semianenka wrote:
> >> >>>>>> Hi guys!
> >> >>>>>>
> >> >>>>>> I'm working on External Catalog for Confluent Kafka. The main
> idea
> >> >>> to
> >> >>>>>> register the external catalog which provides the list of Kafka
> >> >>> topics
> >> >>>> and
> >> >>>>>> execute SQL queries like :
> >> >>>>>> Select * form kafka.topic_name
> >> >>>>>>
> >> >>>>>> I'm going to receive the table schema from Confluent schema
> >> >>> registry.
> >> >>>> The
> >> >>>>>> main disadvantage is: we should have the topic name with the same
> >> >>> name
> >> >>>>>> (prefix and postfix are accepted ) as this schema subject in
> Schema
> >> >>>>>> Registry.
> >> >>>>>> For example :
> >> >>>>>> topic: test-topic-prod
> >> >>>>>> schema subject: test-topic
> >> >>>>>>
> >> >>>>>> I would like to contribute this solution into the main Flink
> branch
> >> >>> and
> >> >>>>>> would like to discuss the pros and cons of this approach.
> >> >>>>>>
> >> >>>>>> Best regards,
> >> >>>>>> Artsem
> >> >>>>>>
> >> >>>>>
> >> >>>> --
> >> >>>>
> >> >>>> С уважением,
> >> >>>> Артем Семененко
> >> >>>>
> >>
> >>
> >
> > --
> >
> > С уважением,
> > Артем Семененко
> >
>
>
> --
>
> С уважением,
> Артем Семененко
>

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Reply via email to