Happy to help clarify, and certainly, I don't think anyone would reject the contribution if you wanted to add that functionality. The important point we want to communicate is that the table API has matured to a level where we do not see it as supplementary to DataStream but as an equal first-class interface. Both optimize towards different making specific use cases easier - declarative definition of pipelines vs fine-grained control - and neither needs to support every operation natively because it should be seamless to interoperate them.
Seth On Fri, Jul 9, 2021 at 10:35 AM Etienne Chauchot <echauc...@apache.org> wrote: > Hi Seth, > > Thanks for your comment. > > That is perfect then ! > > If the community agrees to give precedence to the table API connectors > over the DataStream connectors as discussed here, all the features are > already available then. > > I will not need to port my latest additions to table connectors then. > > Best > > Etienne > > On 08/07/2021 17:28, Seth Wiesman wrote: > > Hi Etienne, > > > > The `toDataStream` method supports converting to concrete Java types, not > > just Row, which can include your Avro specific-records. See example 2: > > > > > https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/data_stream_api/#examples-for-todatastream > > > > On Thu, Jul 8, 2021 at 5:11 AM Etienne Chauchot <echauc...@apache.org> > > wrote: > > > >> Hi Timo, > >> > >> Thanks for your answers, no problem with the delay, I was in vacation > >> too last week :) > >> > >> My comments are inline > >> > >> Best, > >> > >> Etienne > >> > >> On 07/07/2021 16:48, Timo Walther wrote: > >>> Hi Etienne, > >>> > >>> sorry for the late reply due to my vacation last week. > >>> > >>> Regarding: "support of aggregations in batch mode for DataStream API > >>> [...] is there a plan to solve it before the actual drop of DataSet > API" > >>> > >>> Just to clarify it again: we will not drop the DataSet API any time > >>> soon. So users will have enough time to update their pipelines. There > >>> are a couple of features missing to fully switch to DataStream API in > >>> batch mode. Thanks for opening an issue, this is helpful for us to > >>> gradually remove those barriers. They don't need to have a "Blocker" > >>> priority in JIRA for now. > >> > >> Ok I thought the drop was sooner, no problem then. > >> > >> > >>> But aggregations is a good example where we should discuss if it would > >>> be easier to simply switch to Table API for that. Table API has a lot > >>> of aggregation optimizations and can work on binary data. Also joins > >>> should be easier in Table API. DataStream API can be a very low-level > >>> API in the near future and most use cases (esp. the batch ones) should > >>> be possible in Table API. > >> > >> Yes sure. As a matter of fact, my point was to use low level DataStream > >> API in a benchmark to compare with Table API but I guess it is not a > >> common user behavior. > >> > >> > >>> Regarding: "Is it needed to port these Avro enhancements to new > >>> DataStream connectors (add a new equivalent of > >>> ParquetColumnarRowInputFormat but for Avro)" > >>> > >>> We should definitely not loose functionality. The same functionality > >>> should be present in the new connectors. The questions is rather > >>> whether we need to offer a DataStream API connector or if a Table API > >>> connector would be nicer to use (also nicely integrated with catalogs). > >>> > >>> So a user can use a simple CREATE TABLE statement to configure the > >>> connector; an easier abstraction is almost not possible. With > >>> `tableEnv.toDataStream(table)` you can then continue in DataStream API > >>> if there is still a need for it. > >> > >> Yes I agree, there is no easier connector setup than CREATE TABLE, and > >> with tableEnv.toDataStream(table) if one would want to stay with > >> DataStream (in my bench for example) it is still possible. And by the > >> way the doc for parquet (1) for example only mentions using Table API > >> connector. > >> > >> So I guess the table connector would be nicer for the user indeed. But > >> if we use tableEnv.toDataStream(table), we would be able to produce only > >> usual types like Row, Tuples or Pojos, we still need to add Avro support > >> right ? > >> > >> [1] > >> > >> > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/ > >> > >> > >>> Regarding: "there are parquet bugs still open on deprecated parquet > >>> connector" > >>> > >>> Yes, bugs should still be fixed in 1.13. > >> > >> OK > >> > >> > >>> Regarading: "I've been doing TPCDS benchmarks with Flink lately" > >>> > >>> Great to hear that :-) > >> > >> And congrats again on blink performances ! > >> > >> > >>> Did you also see the recent discussion? A TPC-DS benchmark can further > >>> be improved by providing statistics. Maybe this is helpful to you: > >>> > >>> > >> > https://lists.apache.org/thread.html/ra383c23f230ab8e7fa16ec64b4f277c267d6358d55cc8a0edc77bb63%40%3Cuser.flink.apache.org%3E > >> > >> No, I missed that thread, thanks for the pointer, I'll read it and > >> comment if I have something to add. > >> > >> > >>> I will prepare a blog post shortly. > >> > >> Good to hear :) > >> > >> > >>> Regards, > >>> Timo > >>> > >>> > >>> > >>> On 06.07.21 15:05, Etienne Chauchot wrote: > >>>> Hi all, > >>>> > >>>> Any comments ? > >>>> > >>>> cheers, > >>>> > >>>> Etienne > >>>> > >>>> On 25/06/2021 15:09, Etienne Chauchot wrote: > >>>>> Hi everyone, > >>>>> > >>>>> @Timo, my comments are inline for steps 2, 4 and 5, please tell me > >>>>> what you think. > >>>>> > >>>>> Best > >>>>> > >>>>> Etienne > >>>>> > >>>>> > >>>>> On 23/06/2021 15:27, Chesnay Schepler wrote: > >>>>>> If we want to publicize this plan more shouldn't we have a rough > >>>>>> timeline for when 2.0 is on the table? > >>>>>> > >>>>>> On 6/23/2021 2:44 PM, Stephan Ewen wrote: > >>>>>>> Thanks for writing this up, this also reflects my understanding. > >>>>>>> > >>>>>>> I think a blog post would be nice, ideally with an explicit call > for > >>>>>>> feedback so we learn about user concerns. > >>>>>>> A blog post has a lot more reach than an ML thread. > >>>>>>> > >>>>>>> Best, > >>>>>>> Stephan > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Jun 23, 2021 at 12:23 PM Timo Walther <twal...@apache.org> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi everyone, > >>>>>>>> > >>>>>>>> I'm sending this email to make sure everyone is on the same page > >>>>>>>> about > >>>>>>>> slowly deprecating the DataSet API. > >>>>>>>> > >>>>>>>> There have been a few thoughts mentioned in presentations, offline > >>>>>>>> discussions, and JIRA issues. However, I have observed that there > >>>>>>>> are > >>>>>>>> still some concerns or different opinions on what steps are > >>>>>>>> necessary to > >>>>>>>> implement this change. > >>>>>>>> > >>>>>>>> Let me summarize some of the steps and assumpations and let's > have a > >>>>>>>> discussion about it: > >>>>>>>> > >>>>>>>> Step 1: Introduce a batch mode for Table API (FLIP-32) > >>>>>>>> [DONE in 1.9] > >>>>>>>> > >>>>>>>> Step 2: Introduce a batch mode for DataStream API (FLIP-134) > >>>>>>>> [DONE in 1.12] > >>>>> > >>>>> I've been using DataSet API and I tested migrating to DataStream + > >>>>> batch mode. > >>>>> > >>>>> I opened this (1) ticket regarding the support of aggregations in > >>>>> batch mode for DataStream API. It seems that join operation (at > >>>>> least) does not work in batch mode even though I managed to > >>>>> implement a join using low level KeyedCoProcessFunction (thanks > >>>>> Seth, for the pointer !). > >>>>> > >>>>> => Should it be considered a blocker ? Is there a plan to solve it > >>>>> before the actual drop of DataSet API ? Maybe in step 6 ? > >>>>> > >>>>> [1] https://issues.apache.org/jira/browse/FLINK-22587 > >>>>> > >>>>> > >>>>>>>> Step 3: Soft deprecate DataSet API (FLIP-131) > >>>>>>>> [DONE in 1.12] > >>>>>>>> > >>>>>>>> We updated the documentation recently to make this deprecation > >>>>>>>> even more > >>>>>>>> visible. There is a dedicated `(Legacy)` label right next to the > >>>>>>>> menu > >>>>>>>> item now. > >>>>>>>> > >>>>>>>> We won't deprecate concrete classes of the API with a @Deprecated > >>>>>>>> annotation to avoid extensive warnings in logs until then. > >>>>>>>> > >>>>>>>> Step 4: Drop the legacy SQL connectors and formats (FLINK-14437) > >>>>>>>> [DONE in 1.14] > >>>>>>>> > >>>>>>>> We dropped code for ORC, Parque, and HBase formats that were only > >>>>>>>> used > >>>>>>>> by DataSet API users. The removed classes had no documentation > >>>>>>>> and were > >>>>>>>> not annotated with one of our API stability annotations. > >>>>>>>> > >>>>>>>> The old functionality should be available through the new sources > >>>>>>>> and > >>>>>>>> sinks for Table API and DataStream API. If not, we should bring > them > >>>>>>>> into a shape that they can be a full replacement. > >>>>>>>> > >>>>>>>> DataSet users are encouraged to either upgrade the API or use > Flink > >>>>>>>> 1.13. Users can either just stay at Flink 1.13 or copy only the > >>>>>>>> format's > >>>>>>>> code to a newer Flink version. We aim to keep the core interfaces > >>>>>>>> (i.e. > >>>>>>>> InputFormat and OutputFormat) stable until the next major version. > >>>>>>>> > >>>>>>>> We will maintain/allow important contributions to dropped > >>>>>>>> connectors in > >>>>>>>> 1.13. So 1.13 could be considered as kind of a DataSet API LTS > >>>>>>>> release. > >>>>> > >>>>> I added several bug fixes and enhancements (avro support, automatic > >>>>> schema etc...) to parquet DataSet connector. After discussing with > >>>>> Jingsong and Arvid, we agreed to merge them to 1.13 in accordance to > >>>>> the fact that 1.13 is a LTS release receiving maintenance changes as > >>>>> you mentioned here. > >>>>> > >>>>> => Is it needed to port these Avro enhancements to new DataStream > >>>>> connectors (add a new equivalent of ParquetColumnarRowInputFormat > >>>>> but for Avro) ? IMHO opinion it is an important feature that the > >>>>> users will need. So, if I understand the plan correctly, we have > >>>>> until the release of 2.0 to implement it, right ? > >>>>> > >>>>> => Also there are parquet bugs still open on deprecated parquet > >>>>> connector: https://issues.apache.org/jira/browse/FLINK-21520, > >>>>> https://issues.apache.org/jira/browse/FLINK-21468, I think that the > >>>>> same applies, we should fix them on 1.13 right ? > >>>>> > >>>>>>>> Step 5: Drop the legacy SQL planner (FLINK-14437) > >>>>>>>> [DONE in 1.14] > >>>>>>>> > >>>>>>>> This included dropping support of DataSet API with SQL. > >>>>> > >>>>> That is a major point ! I've been doing TPCDS benchmarks with Flink > >>>>> lately by coding query3 with a DataSet pipeline, a DataStream > >>>>> pipeline and a SQL pipeline. What I can tell is that when I migrated > >>>>> from the legacy SQL planer to blink SQL planner, I got 2 major > >>>>> improvements: > >>>>> > >>>>> 1. around 25% gain in run times on 1TB input dataset (even if memory > >>>>> conf was slightly different between runs of the 2 planners) > >>>>> > >>>>> 2. global order support: with legacy planer based on DataSet, only > >>>>> local partition ordering was supported. As a consequence, a SQL > >>>>> query with an ORDER BY clause actually produced wrong results. With > >>>>> blink planner based on DataStream, global order is supported and now > >>>>> the query results are correct ! > >>>>> > >>>>> => congrats to everyone involved in these big SQL improvements ! > >>>>> > >>>>>>>> Step 6: Connect both Table and DataStream API in batch mode > >>>>>>>> (FLINK-20897) > >>>>>>>> [PLANNED in 1.14] > >>>>>>>> > >>>>>>>> Step 7: Reach feature parity of Table API/DataStream API with > >>>>>>>> DataSet API > >>>>>>>> [PLANNED for 1.14++] > >>>>>>>> > >>>>>>>> We need to identify blockers when migrating from DataSet API to > >>>>>>>> Table > >>>>>>>> API/DataStream API. Here we need to estabilish a good feedback > >>>>>>>> pipeline > >>>>>>>> to include DataSet users in the roadmap planning. > >>>>>>>> > >>>>>>>> Step 7: Drop the Gelly library > >>>>>>>> > >>>>>>>> No concrete plan yet. Latest would be the next major Flink > >>>>>>>> version aka > >>>>>>>> Flink 2.0. > >>>>>>>> > >>>>>>>> Step 8: Drop DataSet API > >>>>>>>> > >>>>>>>> Planned for the next major Flink version aka Flink 2.0. > >>>>>>>> > >>>>>>>> > >>>>>>>> Please let me know if this matches your thoughts. We can also > >>>>>>>> convert > >>>>>>>> this into a blog post or mention it in the next release notes. > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> Timo > >>>>>>>> > >>>>>>>> >