Hi Etienne, The `toDataStream` method supports converting to concrete Java types, not just Row, which can include your Avro specific-records. See example 2:
https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot <echauc...@apache.org> wrote: > Hi Timo, > > Thanks for your answers, no problem with the delay, I was in vacation > too last week :) > > My comments are inline > > Best, > > Etienne > > On 07/07/2021 16:48, Timo Walther wrote: > > Hi Etienne, > > > > sorry for the late reply due to my vacation last week. > > > > Regarding: "support of aggregations in batch mode for DataStream API > > [...] is there a plan to solve it before the actual drop of DataSet API" > > > > Just to clarify it again: we will not drop the DataSet API any time > > soon. So users will have enough time to update their pipelines. There > > are a couple of features missing to fully switch to DataStream API in > > batch mode. Thanks for opening an issue, this is helpful for us to > > gradually remove those barriers. They don't need to have a "Blocker" > > priority in JIRA for now. > > > Ok I thought the drop was sooner, no problem then. > > > > > > But aggregations is a good example where we should discuss if it would > > be easier to simply switch to Table API for that. Table API has a lot > > of aggregation optimizations and can work on binary data. Also joins > > should be easier in Table API. DataStream API can be a very low-level > > API in the near future and most use cases (esp. the batch ones) should > > be possible in Table API. > > > Yes sure. As a matter of fact, my point was to use low level DataStream > API in a benchmark to compare with Table API but I guess it is not a > common user behavior. > > > > > > Regarding: "Is it needed to port these Avro enhancements to new > > DataStream connectors (add a new equivalent of > > ParquetColumnarRowInputFormat but for Avro)" > > > > We should definitely not loose functionality. The same functionality > > should be present in the new connectors. The questions is rather > > whether we need to offer a DataStream API connector or if a Table API > > connector would be nicer to use (also nicely integrated with catalogs). > > > > So a user can use a simple CREATE TABLE statement to configure the > > connector; an easier abstraction is almost not possible. With > > `tableEnv.toDataStream(table)` you can then continue in DataStream API > > if there is still a need for it. > > > Yes I agree, there is no easier connector setup than CREATE TABLE, and > with tableEnv.toDataStream(table) if one would want to stay with > DataStream (in my bench for example) it is still possible. And by the > way the doc for parquet (1) for example only mentions using Table API > connector. > > So I guess the table connector would be nicer for the user indeed. But > if we use tableEnv.toDataStream(table), we would be able to produce only > usual types like Row, Tuples or Pojos, we still need to add Avro support > right ? > > [1] > > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/ > > > > > > Regarding: "there are parquet bugs still open on deprecated parquet > > connector" > > > > Yes, bugs should still be fixed in 1.13. > > > OK > > > > > > Regarading: "I've been doing TPCDS benchmarks with Flink lately" > > > > Great to hear that :-) > > > And congrats again on blink performances ! > > > > > > Did you also see the recent discussion? A TPC-DS benchmark can further > > be improved by providing statistics. Maybe this is helpful to you: > > > > > https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E > > > > > No, I missed that thread, thanks for the pointer, I'll read it and > comment if I have something to add. > > > > > > I will prepare a blog post shortly. > > > Good to hear :) > > > > > > Regards, > > Timo > > > > > > > > On 06.07.21 15:05, Etienne Chauchot wrote: > >> Hi all, > >> > >> Any comments ? > >> > >> cheers, > >> > >> Etienne > >> > >> On 25/06/2021 15:09, Etienne Chauchot wrote: > >>> Hi everyone, > >>> > >>> @Timo, my comments are inline for steps 2, 4 and 5, please tell me > >>> what you think. > >>> > >>> Best > >>> > >>> Etienne > >>> > >>> > >>> On 23/06/2021 15:27, Chesnay Schepler wrote: > >>>> If we want to publicize this plan more shouldn't we have a rough > >>>> timeline for when 2.0 is on the table? > >>>> > >>>> On 6/23/2021 2:44 PM, Stephan Ewen wrote: > >>>>> Thanks for writing this up, this also reflects my understanding. > >>>>> > >>>>> I think a blog post would be nice, ideally with an explicit call for > >>>>> feedback so we learn about user concerns. > >>>>> A blog post has a lot more reach than an ML thread. > >>>>> > >>>>> Best, > >>>>> Stephan > >>>>> > >>>>> > >>>>> On Wed, Jun 23, 2021 at 12:23 PM Timo Walther <twal...@apache.org> > >>>>> wrote: > >>>>> > >>>>>> Hi everyone, > >>>>>> > >>>>>> I'm sending this email to make sure everyone is on the same page > >>>>>> about > >>>>>> slowly deprecating the DataSet API. > >>>>>> > >>>>>> There have been a few thoughts mentioned in presentations, offline > >>>>>> discussions, and JIRA issues. However, I have observed that there > >>>>>> are > >>>>>> still some concerns or different opinions on what steps are > >>>>>> necessary to > >>>>>> implement this change. > >>>>>> > >>>>>> Let me summarize some of the steps and assumpations and let's have a > >>>>>> discussion about it: > >>>>>> > >>>>>> Step 1: Introduce a batch mode for Table API (FLIP-32) > >>>>>> [DONE in 1.9] > >>>>>> > >>>>>> Step 2: Introduce a batch mode for DataStream API (FLIP-134) > >>>>>> [DONE in 1.12] > >>> > >>> > >>> I've been using DataSet API and I tested migrating to DataStream + > >>> batch mode. > >>> > >>> I opened this (1) ticket regarding the support of aggregations in > >>> batch mode for DataStream API. It seems that join operation (at > >>> least) does not work in batch mode even though I managed to > >>> implement a join using low level KeyedCoProcessFunction (thanks > >>> Seth, for the pointer !). > >>> > >>> => Should it be considered a blocker ? Is there a plan to solve it > >>> before the actual drop of DataSet API ? Maybe in step 6 ? > >>> > >>> [1] https://issues.apache.org/jira/browse/FLINK-22587 > >>> > >>> > >>>>>> > >>>>>> Step 3: Soft deprecate DataSet API (FLIP-131) > >>>>>> [DONE in 1.12] > >>>>>> > >>>>>> We updated the documentation recently to make this deprecation > >>>>>> even more > >>>>>> visible. There is a dedicated `(Legacy)` label right next to the > >>>>>> menu > >>>>>> item now. > >>>>>> > >>>>>> We won't deprecate concrete classes of the API with a @Deprecated > >>>>>> annotation to avoid extensive warnings in logs until then. > >>>>>> > >>>>>> Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) > >>>>>> [DONE in 1.14] > >>>>>> > >>>>>> We dropped code for ORC, Parque, and HBase formats that were only > >>>>>> used > >>>>>> by DataSet API users. The removed classes had no documentation > >>>>>> and were > >>>>>> not annotated with one of our API stability annotations. > >>>>>> > >>>>>> The old functionality should be available through the new sources > >>>>>> and > >>>>>> sinks for Table API and DataStream API. If not, we should bring them > >>>>>> into a shape that they can be a full replacement. > >>>>>> > >>>>>> DataSet users are encouraged to either upgrade the API or use Flink > >>>>>> 1.13. Users can either just stay at Flink 1.13 or copy only the > >>>>>> format's > >>>>>> code to a newer Flink version. We aim to keep the core interfaces > >>>>>> (i.e. > >>>>>> InputFormat and OutputFormat) stable until the next major version. > >>>>>> > >>>>>> We will maintain/allow important contributions to dropped > >>>>>> connectors in > >>>>>> 1.13. So 1.13 could be considered as kind of a DataSet API LTS > >>>>>> release. > >>> > >>> > >>> I added several bug fixes and enhancements (avro support, automatic > >>> schema etc...) to parquet DataSet connector. After discussing with > >>> Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to > >>> the fact that 1.13 is a LTS release receiving maintenance changes as > >>> you mentioned here. > >>> > >>> => Is it needed to port these Avro enhancements to new DataStream > >>> connectors (add a new equivalent of ParquetColumnarRowInputFormat > >>> but for Avro) ? IMHO opinion it is an important feature that the > >>> users will need. So, if I understand the plan correctly, we have > >>> until the release of 2.0 to implement it, right ? > >>> > >>> => Also there are parquet bugs still open on deprecated parquet > >>> connector: https://issues.apache.org/jira/browse/FLINK-21520, > >>> https://issues.apache.org/jira/browse/FLINK-21468, I think that the > >>> same applies, we should fix them on 1.13 right ? > >>> > >>>>>> > >>>>>> Step 5: Drop the legacy SQL planner (FLINK-14437) > >>>>>> [DONE in 1.14] > >>>>>> > >>>>>> This included dropping support of DataSet API with SQL. > >>> > >>> > >>> That is a major point ! I've been doing TPCDS benchmarks with Flink > >>> lately by coding query3 with a DataSet pipeline, a DataStream > >>> pipeline and a SQL pipeline. What I can tell is that when I migrated > >>> from the legacy SQL planer to blink SQL planner, I got 2 major > >>> improvements: > >>> > >>> 1. around 25% gain in run times on 1TB input dataset (even if memory > >>> conf was slightly different between runs of the 2 planners) > >>> > >>> 2. global order support: with legacy planer based on DataSet, only > >>> local partition ordering was supported. As a consequence, a SQL > >>> query with an ORDER BY clause actually produced wrong results. With > >>> blink planner based on DataStream, global order is supported and now > >>> the query results are correct ! > >>> > >>> => congrats to everyone involved in these big SQL improvements ! > >>> > >>>>>> > >>>>>> Step 6: Connect both Table and DataStream API in batch mode > >>>>>> (FLINK-20897) > >>>>>> [PLANNED in 1.14] > >>>>>> > >>>>>> Step 7: Reach feature parity of Table API/DataStream API with > >>>>>> DataSet API > >>>>>> [PLANNED for 1.14++] > >>>>>> > >>>>>> We need to identify blockers when migrating from DataSet API to > >>>>>> Table > >>>>>> API/DataStream API. Here we need to estabilish a good feedback > >>>>>> pipeline > >>>>>> to include DataSet users in the roadmap planning. > >>>>>> > >>>>>> Step 7: Drop the Gelly library > >>>>>> > >>>>>> No concrete plan yet. Latest would be the next major Flink > >>>>>> version aka > >>>>>> Flink 2.0. > >>>>>> > >>>>>> Step 8: Drop DataSet API > >>>>>> > >>>>>> Planned for the next major Flink version aka Flink 2.0. > >>>>>> > >>>>>> > >>>>>> Please let me know if this matches your thoughts. We can also > >>>>>> convert > >>>>>> this into a blog post or mention it in the next release notes. > >>>>>> > >>>>>> Regards, > >>>>>> Timo > >>>>>> > >>>>>> > >>>> > >> > > >