Hi guys! I've created the first draft of the design document [1] and attached it to the Jira ticket [2]. Please let's continue our discussion in the ticket or in comments in Google Docs.
[1] https://docs.google.com/document/d/14thwgV2RY1AA9KgYztv_kLYSz4K1TckJ-YiGfkB5650/edit?usp=sharing [2] https://issues.apache.org/jira/browse/FLINK-12256 Best regards, Artsem On Thu, 18 Apr 2019 at 23:57, Bowen Li <[email protected]> wrote: > Great! I linked that JIRA to FLINK-11275 > <https://issues.apache.org/jira/browse/FLINK-11275> , and put it along > with > JIRAs for HiveCatalog and GenericHiveMetastoreCatalog. > > I have some initial thoughts on the solution you described, but I'll wait > till a more complete google design doc comes up, since this discussion is > about engaging community interest. > > On Thu, Apr 18, 2019 at 8:39 AM Artsem Semianenka <[email protected]> > wrote: > > > Sorry guys I've attached the wrong link for Jira ticket in the > > previous email. This is the correct link : > > https://issues.apache.org/jira/browse/FLINK-12256 > > > > On Thu, 18 Apr 2019 at 18:29, Artsem Semianenka <[email protected]> > > wrote: > > > > > Thank you guys so much! > > > > > > You provided me a lot of helpful information. > > > I've created the Jira ticket[1] and added to it an initial description > > > only with the main purpose of the new feature. More detailed > > implementation > > > description will be added further. > > > > > > Hi Rong, to tell the truth, my first idea was to use some predefined > > > prefix/postfix for topic name and lookup mapping between > > > topic/schema-subject. But the idea with a separated view of a logical > > > table with schema looks more elegant and flexible. > > > > > > Also, I thought about other approaches on how to define the mapping > > > between topic and schema-subject in case if they have different names: > > > Define the "subject" as a part of the table definition: > > > > > > Select * from kafka.topic.subject > > > or > > > Select * from kafka.topic#subject > > > > > > In case if the subject is not defined try to find a subject with the > same > > > name as a topic. > > > If the subject still not found - take one last message and try to > infer > > > the schema ( retrieve schema id from the message and get last defined > > > schema) > > > > > > But I see one disadvantage for all of these approaches: the subject > name > > > may contain not supported in SQL symbols. > > > > > > I try to investigate how to escape the illegal symbols in the table > name > > > definition. > > > > > > Thanks, > > > Artsem > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-11275 > > > > > > On Thu, 18 Apr 2019 at 11:54, Timo Walther <[email protected]> wrote: > > > > > >> Hi Artsem, > > >> > > >> having a catalog support for Confluent Schema Registry would be a > great > > >> addition. Although the implementation of FLIP-30 is still ongoing, we > > >> merged the stable interfaces today [0]. This should unblock people > from > > >> contributing new catalog implementations. So you could already start > > >> designing an implementation. The implementation could be unit tested > for > > >> now until it can also be registered in a table environment for > > >> integration tests/end-to-end tests. > > >> > > >> I hope we can reuse the existing SQL Kafka connector and SQL Avro > > format? > > >> > > >> Looking forward to a JIRA issue and a little design document how to > > >> connect the APIs. > > >> > > >> Thanks, > > >> Timo > > >> > > >> [0] https://github.com/apache/flink/pull/8007 > > >> > > >> Am 18.04.19 um 07:03 schrieb Bowen Li: > > >> > Hi, > > >> > > > >> > Thanks Artsem and Rong for bringing up the demand from user > > >> perspective. A > > >> > Kafka/Confluent Schema Registry catalog would have a good use case > in > > >> > Flink. We actually mentioned the potential of Unified Catalog APIs > for > > >> > Kafka in our talk a couple weeks ago at Flink Forward SF [1], and > glad > > >> to > > >> > learn you are interested in contributing. I think creating a JIRA > > ticket > > >> > with link in FLINK-11275 [2], and starting with discussions and > design > > >> > would help to advance the effort. > > >> > > > >> > The most interesting part of Confluent Schema Registry, from my > point > > of > > >> > view, is the core idea of smoothing real production experience and > > >> things > > >> > built around it, including versioned schemas, schema evolution and > > >> > compatibility checks, etc. Introducing a confluent-schema-registry > > >> backed > > >> > catalog to Flink may also help our design to benefit from those > ideas. > > >> > > > >> > To add on Dawid's points. I assume the MVP for this project would be > > >> > supporting Kafka as streaming tables thru the new catalog. FLIP-30 > is > > >> for > > >> > both streaming and batch tables, thus it won't be blocked by the > whole > > >> > FLIP-30. I think as soon as we finish the table operation APIs, > > finalize > > >> > properties and formats, and connect the APIs to Calcite, this work > can > > >> be > > >> > unblocked. Timo and Xuefu may have more things to say. > > >> > > > >> > [1] > > >> > > > >> > > > https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23 > > >> > [2] https://issues.apache.org/jira/browse/FLINK-11275 > > >> > > > >> > On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[email protected]> wrote: > > >> > > > >> >> Hi Rong, > > >> >> > > >> >> Thanks for pointing out the missing FLIPs in the FLIP main page. I > > >> added > > >> >> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30, > > >> FLIP-31) to > > >> >> the page. > > >> >> > > >> >> I also include @[email protected] <[email protected]> > > >> and @Bowen > > >> >> Li <[email protected]> into the thread who are familiar with > the > > >> >> latest catalog design. > > >> >> > > >> >> Thanks, > > >> >> Jark > > >> >> > > >> >> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[email protected]> > wrote: > > >> >> > > >> >>> Thanks Artsem for looking into this problem and Thanks Dawid for > > >> bringing > > >> >>> up the discussion on FLIP-30. > > >> >>> > > >> >>> We've observe similar scenarios when we also would like to reuse > the > > >> >>> schema > > >> >>> registry of both Kafka stream as well as the raw ingested kafka > > >> messages > > >> >>> in > > >> >>> datalake. > > >> >>> FYI another more catalog-oriented document can be found here [1]. > I > > do > > >> >>> have > > >> >>> one question to follow up with Dawid's point (2): are we > suggesting > > >> that > > >> >>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod, > > >> etc) > > >> >>> considered as a "view" of a logical table with schema (e.g. > > >> test-topic) ? > > >> >>> > > >> >>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not > > >> linked > > >> >>> in > > >> >>> the main FLIP confluence wiki page [2] for some reason. > > >> >>> I tried to fix that be seems like I don't have permission. Maybe > > >> someone > > >> >>> can also take a look? > > >> >>> > > >> >>> Thanks, > > >> >>> Rong > > >> >>> > > >> >>> > > >> >>> [1] > > >> >>> > > >> >>> > > >> > > > https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei > > >> >>> [2] > > >> >>> > > >> >>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > >> >>> > > >> >>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka < > > >> [email protected] > > >> >>> wrote: > > >> >>> > > >> >>>> Thank you, Dawid! > > >> >>>> This is very helpful information. I will keep a close eye on the > > >> >>> updates of > > >> >>>> FLIP-30 and contribute whenever it possible. > > >> >>>> I guess I may create a Jira ticket for my proposal in which I > > >> describe > > >> >>> the > > >> >>>> idea and attach intermediate pull request based on current > API(just > > >> for > > >> >>>> initial discuss). But the final pull request definitely will be > > >> based on > > >> >>>> FLIP-30 API. > > >> >>>> > > >> >>>> Best regards, > > >> >>>> Artsem > > >> >>>> > > >> >>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz < > > >> [email protected]> > > >> >>>> wrote: > > >> >>>> > > >> >>>>> Hi Artsem, > > >> >>>>> > > >> >>>>> I think it totally makes sense to have a catalog for the Schema > > >> >>>>> Registry. It is also good to hear you want to contribute that. > > There > > >> >>> is > > >> >>>>> few important things to consider though: > > >> >>>>> > > >> >>>>> 1. The Catalog interface is currently under rework. You make > take > > a > > >> >>> look > > >> >>>>> at the corresponding FLIP-30[1], and also have a look at the > first > > >> PR > > >> >>>>> that introduces the basic interfaces[2]. I think it would be > worth > > >> to > > >> >>>>> already consider those changes. I cc Xuefu who is participating > in > > >> the > > >> >>>>> efforts of Catalog integration. > > >> >>>>> > > >> >>>>> 2. There is still ongoing discussion about what properties > should > > we > > >> >>>>> store for streaming tables and how. I think this might affect > (but > > >> >>> maybe > > >> >>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who > might > > >> >>> give > > >> >>>>> more insights if those should be blocking for the work around > this > > >> >>>> Catalog. > > >> >>>>> Best, > > >> >>>>> > > >> >>>>> Dawid > > >> >>>>> > > >> >>>>> [1] > > >> >>>>> > > >> >>>>> > > >> >>> > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs > > >> >>>>> [2] https://github.com/apache/flink/pull/8007 > > >> >>>>> > > >> >>>>> [3] > > >> >>>>> > > >> >>>>> > > >> >>> > > >> > > > https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao > > >> >>>>> On 16/04/2019 17:35, Artsem Semianenka wrote: > > >> >>>>>> Hi guys! > > >> >>>>>> > > >> >>>>>> I'm working on External Catalog for Confluent Kafka. The main > > idea > > >> >>> to > > >> >>>>>> register the external catalog which provides the list of Kafka > > >> >>> topics > > >> >>>> and > > >> >>>>>> execute SQL queries like : > > >> >>>>>> Select * form kafka.topic_name > > >> >>>>>> > > >> >>>>>> I'm going to receive the table schema from Confluent schema > > >> >>> registry. > > >> >>>> The > > >> >>>>>> main disadvantage is: we should have the topic name with the > same > > >> >>> name > > >> >>>>>> (prefix and postfix are accepted ) as this schema subject in > > Schema > > >> >>>>>> Registry. > > >> >>>>>> For example : > > >> >>>>>> topic: test-topic-prod > > >> >>>>>> schema subject: test-topic > > >> >>>>>> > > >> >>>>>> I would like to contribute this solution into the main Flink > > branch > > >> >>> and > > >> >>>>>> would like to discuss the pros and cons of this approach. > > >> >>>>>> > > >> >>>>>> Best regards, > > >> >>>>>> Artsem > > >> >>>>>> > > >> >>>>> > > >> >>>> -- > > >> >>>> > > >> >>>> С уважением, > > >> >>>> Артем Семененко > > >> >>>> > > >> > > >> > > > > > > -- > > > > > > С уважением, > > > Артем Семененко > > > > > > > > > -- > > > > С уважением, > > Артем Семененко > > > -- С уважением, Артем Семененко
