Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Becket Qin Tue, 24 Mar 2020 07:04:12 -0700

Hey Kurt,

I don't think DataStream should see some SQL specific concepts such as
> Filtering or ComputedColumn.



Projectable and Filterable seems not necessarily SQL concepts, but could be
applicable to DataStream source as well to reduce the network load. For
example ORC and Parquet should probably also be readable from DataStream,
right?

ComputedColumn is not part of the Source, it is an interface extends the
Deserializer, which is a pluggable for the Source. From the SQL's
perspective it has the concept of computed column, but from the Source
perspective, It is essentially a Deserializer which also converts the
records internally, assuming we allow some conversion to be embedded to
the source in addition to just deserialization.

Thanks,

Jiangjie (Becket) Qin

On Tue, Mar 24, 2020 at 9:36 PM Jark Wu <[email protected]> wrote:

> Thanks Timo for updating the formats section. That would be very helpful
> for changelog supporting (FLIP-105).
>
> I just left 2 minor comment about some method names. In general, I'm +1 to
> start a voting.
>
>
> --------------------------------------------------------------------------------------------------
>
> Hi Becket,
>
> I agree we shouldn't duplicate codes, especiall the runtime
> implementations.
> However, the interfaces proposed by FLIP-95 are mainly used during
> optimization (compiling), not runtime.
> I don't think there is much to share for this. Because table/sql
> is declarative, but DataStream is imperative.
> For example, filter push down, DataStream FilterableSource may allow to
> accept a FilterFunction (which is a black box for the source).
> However, table sources should pick the pushed filter expressions, some
> sources may only support "=", "<", ">" conditions.
> Pushing a FilterFunction doesn't work in table ecosystem. That means, the
> connectors have to have some table-specific implementations.
>
>
> Best,
> Jark
>
> On Tue, 24 Mar 2020 at 20:41, Kurt Young <[email protected]> wrote:
>
> > Hi Becket,
> >
> > I don't think DataStream should see some SQL specific concepts such as
> > Filtering or ComputedColumn. It's
> > better to stay within SQL area and translate to more generic concept when
> > translating to DataStream/Runtime
> > layer, such as use MapFunction to represent computed column logic.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Mar 24, 2020 at 5:47 PM Becket Qin <[email protected]> wrote:
> >
> > > Hi Timo and Dawid,
> > >
> > > It's really great that we have the same goal. I am actually wondering
> if
> > we
> > > can go one step further to avoid some of the interfaces in Table as
> well.
> > >
> > > For example, if we have the FilterableSource, do we still need the
> > > FilterableTableSource? Should DynamicTableSource just become a
> > > Source<*Row*,
> > > SourceSplitT, EnumChkT>?
> > >
> > > Can you help me understand a bit more about the reason we need the
> > > following relational representation / wrapper interfaces v.s. the
> > > interfaces that we could put to the Source in FLIP-27?
> > >
> > > DynamicTableSource v.s. Source<Row, SourceSplitT, EnumChkT>
> > > SupportsFilterablePushDown v.s. FilterableSource
> > > SupportsProjectablePushDown v.s. ProjectableSource
> > > SupportsWatermarkPushDown v.s. WithWatermarkAssigner
> > > SupportsComputedColumnPushDown v.s. ComputedColumnDeserializer
> > > ScanTableSource v.s. ChangeLogDeserializer.
> > > LookUpTableSource v.s. LookUpSource
> > >
> > > Assuming we have all the interfaces on the right side, do we still need
> > the
> > > interfaces on the left side? Note that the interfaces on the right can
> be
> > > used by both DataStream and Table. If we do this, there will only be
> one
> > > set of Source interfaces Table and DataStream, the only difference is
> > that
> > > the Source for table will have some specific plugins and
> configurations.
> > An
> > > omnipotent Source can implement all the the above interfaces and take a
> > > Deserializer that implements both ComputedColumnDeserializer and
> > > ChangeLogDeserializer.
> > >
> > > Would the SQL planner work with that?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Tue, Mar 24, 2020 at 5:03 PM Jingsong Li <[email protected]>
> > > wrote:
> > >
> > > > +1. Thanks Timo for the design doc.
> > > >
> > > > We can also consider @Experimental too. But I am +1 to
> @PublicEvolving,
> > > we
> > > > should be confident in the current change.
> > > >
> > > > Best,
> > > > Jingsong Lee
> > > >
> > > > On Tue, Mar 24, 2020 at 4:30 PM Timo Walther <[email protected]>
> > wrote:
> > > >
> > > > > @Becket: We totally agree that we don't need table specific
> > connectors
> > > > > during runtime. As Dawid said, the interfaces proposed here are
> just
> > > for
> > > > > communication with the planner. Once the properties (watermarks,
> > > > > computed column, filters, projecttion etc.) are negotiated, we can
> > > > > configure a regular Flink connector.
> > > > >
> > > > > E.g. setting the watermark assigner and deserialization schema of a
> > > > > Kafka connector.
> > > > >
> > > > > For better separation of concerns, Flink connectors should not
> > include
> > > > > relational interfaces and depend on flink-table. This is the
> > > > > responsibility of table source/sink.
> > > > >
> > > > > @Kurt: I would like to mark them @PublicEvolving already because we
> > > need
> > > > > to deprecate the old interfaces as early as possible. We cannot
> > > redirect
> > > > > to @Internal interfaces. They are not marked @Public, so we can
> still
> > > > > evolve them. But a core design shift should not happen again, it
> > would
> > > > > leave a bad impression if we are redesign over and over again.
> > Instead
> > > > > we should be confident in the current change.
> > > > >
> > > > > Regards,
> > > > > Timo
> > > > >
> > > > >
> > > > > On 24.03.20 09:20, Dawid Wysakowicz wrote:
> > > > > > Hi Becket,
> > > > > >
> > > > > > Answering your question, we have the same intention not to
> > duplicate
> > > > > > connectors between datastream and table apis. The interfaces
> > proposed
> > > > in
> > > > > > the FLIP are a way to describe relational properties of a source.
> > The
> > > > > > intention is as you described to translate all of those expressed
> > as
> > > > > > expressions or other Table specific structures into a DataStream
> > > > source.
> > > > > > In other words I think what we are doing here is in line with
> what
> > > you
> > > > > > described.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Dawid
> > > > > >
> > > > > > On 24/03/2020 02:23, Becket Qin wrote:
> > > > > >> Hi Timo,
> > > > > >>
> > > > > >> Thanks for the proposal. I completely agree that the current
> Table
> > > > > >> connectors could be simplified quite a bit. I haven't finished
> > > reading
> > > > > >> everything, but here are some quick thoughts.
> > > > > >>
> > > > > >> Actually to me the biggest question is why should there be two
> > > > different
> > > > > >> connector systems for DataStream and Table? What is the
> > fundamental
> > > > > reason
> > > > > >> that is preventing us from merging them to one?
> > > > > >>
> > > > > >> The basic functionality of a connector is to provide
> capabilities
> > to
> > > > do
> > > > > IO
> > > > > >> and Serde. Conceptually, Table connectors should just be
> > DataStream
> > > > > >> connectors that are dealing with Rows. It seems that quite a few
> > of
> > > > the
> > > > > >> special connector requirements are just a specific way to do IO
> /
> > > > Serde.
> > > > > >> Taking SupportsFilterPushDown as an example, imagine we have the
> > > > > following
> > > > > >> interface:
> > > > > >>
> > > > > >> interface FilterableSource<PREDICATE> {
> > > > > >>      void applyFilterable(Supplier<PREDICATE> predicate);
> > > > > >> }
> > > > > >>
> > > > > >> And if a ParquetSource would like to support filterable, it will
> > > > become:
> > > > > >>
> > > > > >> class ParquetSource implements Source,
> > > > > FilterableSource(FilterPredicate> {
> > > > > >>      ...
> > > > > >> }
> > > > > >>
> > > > > >> For Table, one just need to provide an predicate supplier that
> > > > converts
> > > > > an
> > > > > >> Expression to the specified predicate type. This has a few
> > benefit:
> > > > > >> 1. Same unified API for filterable for sources, regardless of
> > > > > DataStream or
> > > > > >> Table.
> > > > > >> 2. The  DataStream users now can also use the
> > ExpressionToPredicate
> > > > > >> supplier if they want to.
> > > > > >>
> > > > > >> To summarize, my main point is that I am wondering if it is
> > possible
> > > > to
> > > > > >> have a single set of connector interface for both Table and
> > > > DataStream,
> > > > > >> rather than having two hierarchies. I am not 100% sure if this
> > would
> > > > > work,
> > > > > >> but if it works, this would be a huge win from both code
> > maintenance
> > > > and
> > > > > >> user experience perspective.
> > > > > >>
> > > > > >> Thanks,
> > > > > >>
> > > > > >> Jiangjie (Becket) Qin
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Mar 24, 2020 at 2:03 AM Dawid Wysakowicz <
> > > > > [email protected]>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Timo,
> > > > > >>>
> > > > > >>> Thank you for the proposal. I think it is an important
> > improvement
> > > > that
> > > > > >>> will benefit many parts of the Table API. The proposal looks
> > really
> > > > > good
> > > > > >>> to me and personally I would be comfortable with voting on the
> > > > current
> > > > > >>> state.
> > > > > >>>
> > > > > >>> Best,
> > > > > >>>
> > > > > >>> Dawid
> > > > > >>>
> > > > > >>> On 23/03/2020 18:53, Timo Walther wrote:
> > > > > >>>> Hi everyone,
> > > > > >>>>
> > > > > >>>> I received some questions around how the new interfaces play
> > > > together
> > > > > >>>> with formats and their factories.
> > > > > >>>>
> > > > > >>>> Furthermore, for MySQL or Postgres CDC logs, the format should
> > be
> > > > able
> > > > > >>>> to return a `ChangelogMode`.
> > > > > >>>>
> > > > > >>>> Also, I incorporated the feedback around the factory design in
> > > > > general.
> > > > > >>>>
> > > > > >>>> I added a new section `Factory Interfaces` to the design
> > document.
> > > > > >>>> This should be helpful to understand the big picture and
> > > connecting
> > > > > >>>> the concepts.
> > > > > >>>>
> > > > > >>>> Please let me know what you think?
> > > > > >>>>
> > > > > >>>> Thanks,
> > > > > >>>> Timo
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 18.03.20 13:43, Timo Walther wrote:
> > > > > >>>>> Hi Benchao,
> > > > > >>>>>
> > > > > >>>>> this is a very good question. I will update the FLIP about
> > this.
> > > > > >>>>>
> > > > > >>>>> The legacy planner will not support the new interfaces. It
> will
> > > > only
> > > > > >>>>> support the old interfaces. With the next release, I think
> the
> > > > Blink
> > > > > >>>>> planner is stable enough to be the default one as well.
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Timo
> > > > > >>>>>
> > > > > >>>>> On 18.03.20 08:45, Benchao Li wrote:
> > > > > >>>>>> Hi Timo,
> > > > > >>>>>>
> > > > > >>>>>> Thank you and others for the efforts to prepare this FLIP.
> > > > > >>>>>>
> > > > > >>>>>> The FLIP LGTM generally.
> > > > > >>>>>>
> > > > > >>>>>> +1 for moving blink data structures to table-common, it's
> > useful
> > > > to
> > > > > >>>>>> udf too
> > > > > >>>>>> in the future.
> > > > > >>>>>> A little question is, do we plan to support the new
> interfaces
> > > and
> > > > > data
> > > > > >>>>>> types in legacy planner?
> > > > > >>>>>> Or we only plan to support these new interfaces in blink
> > > planner.
> > > > > >>>>>>
> > > > > >>>>>> And using primary keys from DDL instead of derived key
> > > information
> > > > > from
> > > > > >>>>>> each query is also a good idea,
> > > > > >>>>>> we met some use cases where this does not works very well
> > > before.
> > > > > >>>>>>
> > > > > >>>>>> This FLIP also makes the dependencies of table modules more
> > > > clear, I
> > > > > >>>>>> like
> > > > > >>>>>> it very much.
> > > > > >>>>>>
> > > > > >>>>>> Timo Walther <[email protected]> 于2020年3月17日周二 上午1:36写道：
> > > > > >>>>>>
> > > > > >>>>>>> Hi everyone,
> > > > > >>>>>>>
> > > > > >>>>>>> I'm happy to present the results of long discussions that
> we
> > > had
> > > > > >>>>>>> internally. Jark, Dawid, Aljoscha, Kurt, Jingsong, me, and
> > many
> > > > > more
> > > > > >>>>>>> have contributed to this design document.
> > > > > >>>>>>>
> > > > > >>>>>>> We would like to propose new long-term table source and
> table
> > > > sink
> > > > > >>>>>>> interfaces:
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-95%3A+New+TableSource+and+TableSink+interfaces
> > > > > >>>>>>>
> > > > > >>>>>>> This is a requirement for FLIP-105 and finalizing FLIP-32.
> > > > > >>>>>>>
> > > > > >>>>>>> The goals of this FLIP are:
> > > > > >>>>>>>
> > > > > >>>>>>> - Simplify the current interface architecture:
> > > > > >>>>>>>        - Merge upsert, retract, and append sinks.
> > > > > >>>>>>>        - Unify batch and streaming sources.
> > > > > >>>>>>>        - Unify batch and streaming sinks.
> > > > > >>>>>>>
> > > > > >>>>>>> - Allow sources to produce a changelog:
> > > > > >>>>>>>        - UpsertTableSources have been requested a lot by
> > users.
> > > > Now
> > > > > >>>>>>> is the
> > > > > >>>>>>> time to open the internal planner capabilities via the new
> > > > > interfaces.
> > > > > >>>>>>>        - According to FLIP-105, we would like to support
> > > > > changelogs for
> > > > > >>>>>>> processing formats such as Debezium.
> > > > > >>>>>>>
> > > > > >>>>>>> - Don't rely on DataStream API for source and sinks:
> > > > > >>>>>>>        - According to FLIP-32, the Table API and SQL should
> > be
> > > > > >>>>>>> independent
> > > > > >>>>>>> of the DataStream API which is why the `table-common`
> module
> > > has
> > > > no
> > > > > >>>>>>> dependencies on `flink-streaming-java`.
> > > > > >>>>>>>        - Source and sink implementations should only depend
> > on
> > > > the
> > > > > >>>>>>> `table-common` module after FLIP-27.
> > > > > >>>>>>>        - Until FLIP-27 is ready, we still put most of the
> > > > > interfaces in
> > > > > >>>>>>> `table-common` and strictly separate interfaces that
> > > communicate
> > > > > >>>>>>> with a
> > > > > >>>>>>> planner and actual runtime reader/writers.
> > > > > >>>>>>>
> > > > > >>>>>>> - Implement efficient sources and sinks without planner
> > > > > dependencies:
> > > > > >>>>>>>        - Make Blink's internal data structures available to
> > > > > connectors.
> > > > > >>>>>>>        - Introduce stable interfaces for data structures
> that
> > > can
> > > > > be
> > > > > >>>>>>> marked as `@PublicEvolving`.
> > > > > >>>>>>>        - Only require dependencies on `flink-table-common`
> in
> > > the
> > > > > >>>>>>> future
> > > > > >>>>>>>
> > > > > >>>>>>> It finalizes the concept of dynamic tables and consideres
> how
> > > all
> > > > > >>>>>>> source/sink related classes play together.
> > > > > >>>>>>>
> > > > > >>>>>>> We look forward to your feedback.
> > > > > >>>>>>>
> > > > > >>>>>>> Regards,
> > > > > >>>>>>> Timo
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>
> > > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-95: New TableSource and TableSink interfaces

Reply via email to