Re: [DISCUSS] Incrementally deprecating the DataSet API

Seth Wiesman Thu, 08 Jul 2021 08:29:03 -0700

Hi Etienne,

The `toDataStream` method supports converting to concrete Java types, not
just Row, which can include your Avro specific-records. See example 2:


https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream

On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot <echauc...@apache.org>
wrote:

> Hi Timo,
>
> Thanks for your answers, no problem with the delay, I was in vacation
> too last week :)
>
> My comments are inline
>
> Best,
>
> Etienne
>
> On 07/07/2021 16:48, Timo Walther wrote:
> > Hi Etienne,
> >
> > sorry for the late reply due to my vacation last week.
> >
> > Regarding: "support of aggregations in batch mode for DataStream API
> > [...] is there a plan to solve it before the actual drop of DataSet API"
> >
> > Just to clarify it again: we will not drop the DataSet API any time
> > soon. So users will have enough time to update their pipelines. There
> > are a couple of features missing to fully switch to DataStream API in
> > batch mode. Thanks for opening an issue, this is helpful for us to
> > gradually remove those barriers. They don't need to have a "Blocker"
> > priority in JIRA for now.
>
>
> Ok I thought the drop was sooner, no problem then.
>
>
> >
> > But aggregations is a good example where we should discuss if it would
> > be easier to simply switch to Table API for that. Table API has a lot
> > of aggregation optimizations and can work on binary data. Also joins
> > should be easier in Table API. DataStream API can be a very low-level
> > API in the near future and most use cases (esp. the batch ones) should
> > be possible in Table API.
>
>
> Yes sure. As a matter of fact, my point was to use low level DataStream
> API in a benchmark to compare with Table API but I guess it is not a
> common user behavior.
>
>
> >
> > Regarding: "Is it needed to port these Avro enhancements to new
> > DataStream connectors (add a new equivalent of
> > ParquetColumnarRowInputFormat but for Avro)"
> >
> > We should definitely not loose functionality. The same functionality
> > should be present in the new connectors. The questions is rather
> > whether we need to offer a DataStream API connector or if a Table API
> > connector would be nicer to use (also nicely integrated with catalogs).
> >
> > So a user can use a simple CREATE TABLE statement to configure the
> > connector; an easier abstraction is almost not possible. With
> > `tableEnv.toDataStream(table)` you can then continue in DataStream API
> > if there is still a need for it.
>
>
> Yes I agree, there is no easier connector setup than CREATE TABLE, and
> with tableEnv.toDataStream(table) if one would want to stay with
> DataStream (in my bench for example) it is still possible. And by the
> way the doc for parquet (1) for example only mentions using Table API
> connector.
>
> So I guess the table connector would be nicer for the user indeed. But
> if we use tableEnv.toDataStream(table), we would be able to produce only
> usual types like Row, Tuples or Pojos, we still need to add Avro support
> right ?
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/
>
>
> >
> > Regarding: "there are parquet bugs still open on deprecated parquet
> > connector"
> >
> > Yes, bugs should still be fixed in 1.13.
>
>
> OK
>
>
> >
> > Regarading: "I've been doing TPCDS benchmarks with Flink lately"
> >
> > Great to hear that :-)
>
>
> And congrats again on blink performances !
>
>
> >
> > Did you also see the recent discussion? A TPC-DS benchmark can further
> > be improved by providing statistics. Maybe this is helpful to you:
> >
> >
> https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E
> >
>
>
> No, I missed that thread, thanks for the pointer, I'll read it and
> comment if I have something to add.
>
>
> >
> > I will prepare a blog post shortly.
>
>
> Good to hear :)
>
>
> >
> > Regards,
> > Timo
> >
> >
> >
> > On 06.07.21 15:05, Etienne Chauchot wrote:
> >> Hi all,
> >>
> >> Any comments ?
> >>
> >> cheers,
> >>
> >> Etienne
> >>
> >> On 25/06/2021 15:09, Etienne Chauchot wrote:
> >>> Hi everyone,
> >>>
> >>> @Timo, my comments are inline for steps 2, 4 and 5, please tell me
> >>> what you think.
> >>>
> >>> Best
> >>>
> >>> Etienne
> >>>
> >>>
> >>> On 23/06/2021 15:27, Chesnay Schepler wrote:
> >>>> If we want to publicize this plan more shouldn't we have a rough
> >>>> timeline for when 2.0 is on the table?
> >>>>
> >>>> On 6/23/2021 2:44 PM, Stephan Ewen wrote:
> >>>>> Thanks for writing this up, this also reflects my understanding.
> >>>>>
> >>>>> I think a blog post would be nice, ideally with an explicit call for
> >>>>> feedback so we learn about user concerns.
> >>>>> A blog post has a lot more reach than an ML thread.
> >>>>>
> >>>>> Best,
> >>>>> Stephan
> >>>>>
> >>>>>
> >>>>> On Wed, Jun 23, 2021 at 12:23 PM Timo Walther <twal...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> I'm sending this email to make sure everyone is on the same page
> >>>>>> about
> >>>>>> slowly deprecating the DataSet API.
> >>>>>>
> >>>>>> There have been a few thoughts mentioned in presentations, offline
> >>>>>> discussions, and JIRA issues. However, I have observed that there
> >>>>>> are
> >>>>>> still some concerns or different opinions on what steps are
> >>>>>> necessary to
> >>>>>> implement this change.
> >>>>>>
> >>>>>> Let me summarize some of the steps and assumpations and let's have a
> >>>>>> discussion about it:
> >>>>>>
> >>>>>> Step 1: Introduce a batch mode for Table API (FLIP-32)
> >>>>>> [DONE in 1.9]
> >>>>>>
> >>>>>> Step 2: Introduce a batch mode for DataStream API (FLIP-134)
> >>>>>> [DONE in 1.12]
> >>>
> >>>
> >>> I've been using DataSet API and I tested migrating to DataStream +
> >>> batch mode.
> >>>
> >>> I opened this (1) ticket regarding the support of aggregations in
> >>> batch mode for DataStream API. It seems that join operation (at
> >>> least) does not work in batch mode even though I managed to
> >>> implement a join using low level KeyedCoProcessFunction (thanks
> >>> Seth, for the pointer !).
> >>>
> >>> => Should it be considered a blocker ? Is there a plan to solve it
> >>> before the actual drop of DataSet API ? Maybe in step 6 ?
> >>>
> >>> [1] https://issues.apache.org/jira/browse/FLINK-22587
> >>>
> >>>
> >>>>>>
> >>>>>> Step 3: Soft deprecate DataSet API (FLIP-131)
> >>>>>> [DONE in 1.12]
> >>>>>>
> >>>>>> We updated the documentation recently to make this deprecation
> >>>>>> even more
> >>>>>> visible. There is a dedicated `(Legacy)` label right next to the
> >>>>>> menu
> >>>>>> item now.
> >>>>>>
> >>>>>> We won't deprecate concrete classes of the API with a @Deprecated
> >>>>>> annotation to avoid extensive warnings in logs until then.
> >>>>>>
> >>>>>> Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
> >>>>>> [DONE in 1.14]
> >>>>>>
> >>>>>> We dropped code for ORC, Parque, and HBase formats that were only
> >>>>>> used
> >>>>>> by DataSet API users. The removed classes had no documentation
> >>>>>> and were
> >>>>>> not annotated with one of our API stability annotations.
> >>>>>>
> >>>>>> The old functionality should be available through the new sources
> >>>>>> and
> >>>>>> sinks for Table API and DataStream API. If not, we should bring them
> >>>>>> into a shape that they can be a full replacement.
> >>>>>>
> >>>>>> DataSet users are encouraged to either upgrade the API or use Flink
> >>>>>> 1.13. Users can either just stay at Flink 1.13 or copy only the
> >>>>>> format's
> >>>>>> code to a newer Flink version. We aim to keep the core interfaces
> >>>>>> (i.e.
> >>>>>> InputFormat and OutputFormat) stable until the next major version.
> >>>>>>
> >>>>>> We will maintain/allow important contributions to dropped
> >>>>>> connectors in
> >>>>>> 1.13. So 1.13 could be considered as kind of a DataSet API LTS
> >>>>>> release.
> >>>
> >>>
> >>> I added several bug fixes and enhancements (avro support, automatic
> >>> schema etc...) to parquet DataSet connector. After discussing with
> >>> Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to
> >>> the fact that 1.13 is a LTS release receiving maintenance changes as
> >>> you mentioned here.
> >>>
> >>> => Is it needed to port these Avro enhancements to new DataStream
> >>> connectors (add a new equivalent of ParquetColumnarRowInputFormat
> >>> but for Avro) ? IMHO opinion it is an important feature that the
> >>> users will need. So, if I understand the plan correctly, we have
> >>> until the release of 2.0 to implement it, right ?
> >>>
> >>> => Also there are parquet bugs still open on deprecated parquet
> >>> connector: https://issues.apache.org/jira/browse/FLINK-21520,
> >>> https://issues.apache.org/jira/browse/FLINK-21468, I think that the
> >>> same applies, we should fix them on 1.13 right ?
> >>>
> >>>>>>
> >>>>>> Step 5: Drop the legacy SQL planner (FLINK-14437)
> >>>>>> [DONE in 1.14]
> >>>>>>
> >>>>>> This included dropping support of DataSet API with SQL.
> >>>
> >>>
> >>> That is a major point ! I've been doing TPCDS benchmarks with Flink
> >>> lately by coding query3 with a DataSet pipeline, a DataStream
> >>> pipeline and a SQL pipeline. What I can tell is that when I migrated
> >>> from the legacy SQL planer to blink SQL planner, I got 2 major
> >>> improvements:
> >>>
> >>> 1. around 25% gain in run times on 1TB input dataset (even if memory
> >>> conf was slightly different between runs of the 2 planners)
> >>>
> >>> 2. global order support: with legacy planer based on DataSet, only
> >>> local partition ordering was supported. As a consequence, a SQL
> >>> query with an ORDER BY clause actually produced wrong results. With
> >>> blink planner based on DataStream, global order is supported and now
> >>> the query results are correct !
> >>>
> >>> => congrats to everyone involved in these big SQL improvements !
> >>>
> >>>>>>
> >>>>>> Step 6: Connect both Table and DataStream API in batch mode
> >>>>>> (FLINK-20897)
> >>>>>> [PLANNED in 1.14]
> >>>>>>
> >>>>>> Step 7: Reach feature parity of Table API/DataStream API with
> >>>>>> DataSet API
> >>>>>> [PLANNED for 1.14++]
> >>>>>>
> >>>>>> We need to identify blockers when migrating from DataSet API to
> >>>>>> Table
> >>>>>> API/DataStream API. Here we need to estabilish a good feedback
> >>>>>> pipeline
> >>>>>> to include DataSet users in the roadmap planning.
> >>>>>>
> >>>>>> Step 7: Drop the Gelly library
> >>>>>>
> >>>>>> No concrete plan yet. Latest would be the next major Flink
> >>>>>> version aka
> >>>>>> Flink 2.0.
> >>>>>>
> >>>>>> Step 8: Drop DataSet API
> >>>>>>
> >>>>>> Planned for the next major Flink version aka Flink 2.0.
> >>>>>>
> >>>>>>
> >>>>>> Please let me know if this matches your thoughts. We can also
> >>>>>> convert
> >>>>>> this into a blog post or mention it in the next release notes.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Timo
> >>>>>>
> >>>>>>
> >>>>
> >>
> >
>

Re: [DISCUSS] Incrementally deprecating the DataSet API

Reply via email to