Re: [DISCUSS] Incrementally deprecating the DataSet API

Seth Wiesman Fri, 09 Jul 2021 09:04:18 -0700

Happy to help clarify, and certainly, I don't think anyone would reject the
contribution if you wanted to add that functionality. The important point
we want to communicate is that the table API has matured to a level where
we do not see it as supplementary to DataStream but as an equal first-class
interface. Both optimize towards different making specific use cases easier
- declarative definition of pipelines vs fine-grained control - and neither
needs to support every operation natively because it should be seamless to
interoperate them.


Seth

On Fri, Jul 9, 2021 at 10:35 AM Etienne Chauchot <[email protected]>
wrote:

> Hi Seth,
>
> Thanks for your comment.
>
> That is perfect then !
>
> If the community agrees to give precedence to the table API connectors
> over the DataStream connectors as discussed here, all the features are
> already available then.
>
> I will not need to port my latest additions to table connectors then.
>
> Best
>
> Etienne
>
> On 08/07/2021 17:28, Seth Wiesman wrote:
> > Hi Etienne,
> >
> > The `toDataStream` method supports converting to concrete Java types, not
> > just Row, which can include your Avro specific-records. See example 2:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream
> >
> > On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot <[email protected]>
> > wrote:
> >
> >> Hi Timo,
> >>
> >> Thanks for your answers, no problem with the delay, I was in vacation
> >> too last week :)
> >>
> >> My comments are inline
> >>
> >> Best,
> >>
> >> Etienne
> >>
> >> On 07/07/2021 16:48, Timo Walther wrote:
> >>> Hi Etienne,
> >>>
> >>> sorry for the late reply due to my vacation last week.
> >>>
> >>> Regarding: "support of aggregations in batch mode for DataStream API
> >>> [...] is there a plan to solve it before the actual drop of DataSet
> API"
> >>>
> >>> Just to clarify it again: we will not drop the DataSet API any time
> >>> soon. So users will have enough time to update their pipelines. There
> >>> are a couple of features missing to fully switch to DataStream API in
> >>> batch mode. Thanks for opening an issue, this is helpful for us to
> >>> gradually remove those barriers. They don't need to have a "Blocker"
> >>> priority in JIRA for now.
> >>
> >> Ok I thought the drop was sooner, no problem then.
> >>
> >>
> >>> But aggregations is a good example where we should discuss if it would
> >>> be easier to simply switch to Table API for that. Table API has a lot
> >>> of aggregation optimizations and can work on binary data. Also joins
> >>> should be easier in Table API. DataStream API can be a very low-level
> >>> API in the near future and most use cases (esp. the batch ones) should
> >>> be possible in Table API.
> >>
> >> Yes sure. As a matter of fact, my point was to use low level DataStream
> >> API in a benchmark to compare with Table API but I guess it is not a
> >> common user behavior.
> >>
> >>
> >>> Regarding: "Is it needed to port these Avro enhancements to new
> >>> DataStream connectors (add a new equivalent of
> >>> ParquetColumnarRowInputFormat but for Avro)"
> >>>
> >>> We should definitely not loose functionality. The same functionality
> >>> should be present in the new connectors. The questions is rather
> >>> whether we need to offer a DataStream API connector or if a Table API
> >>> connector would be nicer to use (also nicely integrated with catalogs).
> >>>
> >>> So a user can use a simple CREATE TABLE statement to configure the
> >>> connector; an easier abstraction is almost not possible. With
> >>> `tableEnv.toDataStream(table)` you can then continue in DataStream API
> >>> if there is still a need for it.
> >>
> >> Yes I agree, there is no easier connector setup than CREATE TABLE, and
> >> with tableEnv.toDataStream(table) if one would want to stay with
> >> DataStream (in my bench for example) it is still possible. And by the
> >> way the doc for parquet (1) for example only mentions using Table API
> >> connector.
> >>
> >> So I guess the table connector would be nicer for the user indeed. But
> >> if we use tableEnv.toDataStream(table), we would be able to produce only
> >> usual types like Row, Tuples or Pojos, we still need to add Avro support
> >> right ?
> >>
> >> [1]
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/
> >>
> >>
> >>> Regarding: "there are parquet bugs still open on deprecated parquet
> >>> connector"
> >>>
> >>> Yes, bugs should still be fixed in 1.13.
> >>
> >> OK
> >>
> >>
> >>> Regarading: "I've been doing TPCDS benchmarks with Flink lately"
> >>>
> >>> Great to hear that :-)
> >>
> >> And congrats again on blink performances !
> >>
> >>
> >>> Did you also see the recent discussion? A TPC-DS benchmark can further
> >>> be improved by providing statistics. Maybe this is helpful to you:
> >>>
> >>>
> >>
> https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E
> >>
> >> No, I missed that thread, thanks for the pointer, I'll read it and
> >> comment if I have something to add.
> >>
> >>
> >>> I will prepare a blog post shortly.
> >>
> >> Good to hear :)
> >>
> >>
> >>> Regards,
> >>> Timo
> >>>
> >>>
> >>>
> >>> On 06.07.21 15:05, Etienne Chauchot wrote:
> >>>> Hi all,
> >>>>
> >>>> Any comments ?
> >>>>
> >>>> cheers,
> >>>>
> >>>> Etienne
> >>>>
> >>>> On 25/06/2021 15:09, Etienne Chauchot wrote:
> >>>>> Hi everyone,
> >>>>>
> >>>>> @Timo, my comments are inline for steps 2, 4 and 5, please tell me
> >>>>> what you think.
> >>>>>
> >>>>> Best
> >>>>>
> >>>>> Etienne
> >>>>>
> >>>>>
> >>>>> On 23/06/2021 15:27, Chesnay Schepler wrote:
> >>>>>> If we want to publicize this plan more shouldn't we have a rough
> >>>>>> timeline for when 2.0 is on the table?
> >>>>>>
> >>>>>> On 6/23/2021 2:44 PM, Stephan Ewen wrote:
> >>>>>>> Thanks for writing this up, this also reflects my understanding.
> >>>>>>>
> >>>>>>> I think a blog post would be nice, ideally with an explicit call
> for
> >>>>>>> feedback so we learn about user concerns.
> >>>>>>> A blog post has a lot more reach than an ML thread.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Stephan
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jun 23, 2021 at 12:23 PM Timo Walther <[email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi everyone,
> >>>>>>>>
> >>>>>>>> I'm sending this email to make sure everyone is on the same page
> >>>>>>>> about
> >>>>>>>> slowly deprecating the DataSet API.
> >>>>>>>>
> >>>>>>>> There have been a few thoughts mentioned in presentations, offline
> >>>>>>>> discussions, and JIRA issues. However, I have observed that there
> >>>>>>>> are
> >>>>>>>> still some concerns or different opinions on what steps are
> >>>>>>>> necessary to
> >>>>>>>> implement this change.
> >>>>>>>>
> >>>>>>>> Let me summarize some of the steps and assumpations and let's
> have a
> >>>>>>>> discussion about it:
> >>>>>>>>
> >>>>>>>> Step 1: Introduce a batch mode for Table API (FLIP-32)
> >>>>>>>> [DONE in 1.9]
> >>>>>>>>
> >>>>>>>> Step 2: Introduce a batch mode for DataStream API (FLIP-134)
> >>>>>>>> [DONE in 1.12]
> >>>>>
> >>>>> I've been using DataSet API and I tested migrating to DataStream +
> >>>>> batch mode.
> >>>>>
> >>>>> I opened this (1) ticket regarding the support of aggregations in
> >>>>> batch mode for DataStream API. It seems that join operation (at
> >>>>> least) does not work in batch mode even though I managed to
> >>>>> implement a join using low level KeyedCoProcessFunction (thanks
> >>>>> Seth, for the pointer !).
> >>>>>
> >>>>> => Should it be considered a blocker ? Is there a plan to solve it
> >>>>> before the actual drop of DataSet API ? Maybe in step 6 ?
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-22587
> >>>>>
> >>>>>
> >>>>>>>> Step 3: Soft deprecate DataSet API (FLIP-131)
> >>>>>>>> [DONE in 1.12]
> >>>>>>>>
> >>>>>>>> We updated the documentation recently to make this deprecation
> >>>>>>>> even more
> >>>>>>>> visible. There is a dedicated `(Legacy)` label right next to the
> >>>>>>>> menu
> >>>>>>>> item now.
> >>>>>>>>
> >>>>>>>> We won't deprecate concrete classes of the API with a @Deprecated
> >>>>>>>> annotation to avoid extensive warnings in logs until then.
> >>>>>>>>
> >>>>>>>> Step 4: Drop the legacy SQL connectors and formats (FLINK-14437)
> >>>>>>>> [DONE in 1.14]
> >>>>>>>>
> >>>>>>>> We dropped code for ORC, Parque, and HBase formats that were only
> >>>>>>>> used
> >>>>>>>> by DataSet API users. The removed classes had no documentation
> >>>>>>>> and were
> >>>>>>>> not annotated with one of our API stability annotations.
> >>>>>>>>
> >>>>>>>> The old functionality should be available through the new sources
> >>>>>>>> and
> >>>>>>>> sinks for Table API and DataStream API. If not, we should bring
> them
> >>>>>>>> into a shape that they can be a full replacement.
> >>>>>>>>
> >>>>>>>> DataSet users are encouraged to either upgrade the API or use
> Flink
> >>>>>>>> 1.13. Users can either just stay at Flink 1.13 or copy only the
> >>>>>>>> format's
> >>>>>>>> code to a newer Flink version. We aim to keep the core interfaces
> >>>>>>>> (i.e.
> >>>>>>>> InputFormat and OutputFormat) stable until the next major version.
> >>>>>>>>
> >>>>>>>> We will maintain/allow important contributions to dropped
> >>>>>>>> connectors in
> >>>>>>>> 1.13. So 1.13 could be considered as kind of a DataSet API LTS
> >>>>>>>> release.
> >>>>>
> >>>>> I added several bug fixes and enhancements (avro support, automatic
> >>>>> schema etc...) to parquet DataSet connector. After discussing with
> >>>>> Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to
> >>>>> the fact that 1.13 is a LTS release receiving maintenance changes as
> >>>>> you mentioned here.
> >>>>>
> >>>>> => Is it needed to port these Avro enhancements to new DataStream
> >>>>> connectors (add a new equivalent of ParquetColumnarRowInputFormat
> >>>>> but for Avro) ? IMHO opinion it is an important feature that the
> >>>>> users will need. So, if I understand the plan correctly, we have
> >>>>> until the release of 2.0 to implement it, right ?
> >>>>>
> >>>>> => Also there are parquet bugs still open on deprecated parquet
> >>>>> connector: https://issues.apache.org/jira/browse/FLINK-21520,
> >>>>> https://issues.apache.org/jira/browse/FLINK-21468, I think that the
> >>>>> same applies, we should fix them on 1.13 right ?
> >>>>>
> >>>>>>>> Step 5: Drop the legacy SQL planner (FLINK-14437)
> >>>>>>>> [DONE in 1.14]
> >>>>>>>>
> >>>>>>>> This included dropping support of DataSet API with SQL.
> >>>>>
> >>>>> That is a major point ! I've been doing TPCDS benchmarks with Flink
> >>>>> lately by coding query3 with a DataSet pipeline, a DataStream
> >>>>> pipeline and a SQL pipeline. What I can tell is that when I migrated
> >>>>> from the legacy SQL planer to blink SQL planner, I got 2 major
> >>>>> improvements:
> >>>>>
> >>>>> 1. around 25% gain in run times on 1TB input dataset (even if memory
> >>>>> conf was slightly different between runs of the 2 planners)
> >>>>>
> >>>>> 2. global order support: with legacy planer based on DataSet, only
> >>>>> local partition ordering was supported. As a consequence, a SQL
> >>>>> query with an ORDER BY clause actually produced wrong results. With
> >>>>> blink planner based on DataStream, global order is supported and now
> >>>>> the query results are correct !
> >>>>>
> >>>>> => congrats to everyone involved in these big SQL improvements !
> >>>>>
> >>>>>>>> Step 6: Connect both Table and DataStream API in batch mode
> >>>>>>>> (FLINK-20897)
> >>>>>>>> [PLANNED in 1.14]
> >>>>>>>>
> >>>>>>>> Step 7: Reach feature parity of Table API/DataStream API with
> >>>>>>>> DataSet API
> >>>>>>>> [PLANNED for 1.14++]
> >>>>>>>>
> >>>>>>>> We need to identify blockers when migrating from DataSet API to
> >>>>>>>> Table
> >>>>>>>> API/DataStream API. Here we need to estabilish a good feedback
> >>>>>>>> pipeline
> >>>>>>>> to include DataSet users in the roadmap planning.
> >>>>>>>>
> >>>>>>>> Step 7: Drop the Gelly library
> >>>>>>>>
> >>>>>>>> No concrete plan yet. Latest would be the next major Flink
> >>>>>>>> version aka
> >>>>>>>> Flink 2.0.
> >>>>>>>>
> >>>>>>>> Step 8: Drop DataSet API
> >>>>>>>>
> >>>>>>>> Planned for the next major Flink version aka Flink 2.0.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Please let me know if this matches your thoughts. We can also
> >>>>>>>> convert
> >>>>>>>> this into a blog post or mention it in the next release notes.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
>

Re: [DISCUSS] Incrementally deprecating the DataSet API

Reply via email to