echauchot commented on a change in pull request #15156:
URL: https://github.com/apache/flink/pull/15156#discussion_r639567300
##########
File path: flink-formats/flink-parquet/pom.xml
##########
@@ -45,6 +45,12 @@ under the License.
<scope>provided</scope>
</dependency>
+ <dependency>
+ <groupId>org.apache.flink</groupId>
Review comment:
> > @AHeise ParquetInputFormat base class was removed since I submitted
my PR hence the compilation issues, commit
[ce3631a](https://github.com/apache/flink/commit/ce3631af7313855f675e29b8faa386f6e5a2d43c)
removed it. This commit mentions "Use the filesystem connector with a Parquet
format as a replacement". I guess it refers to
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/
which is SQL based. But what if our pipeline pipeline does not use SQL but
DataSet API ?
>
> It's a good and tough question. I spoke to @twalthr offline and for now
the plan is as follows:
>
> * Table API drops old planner and thus, most of the removed code in that
commit is dead code.
> * Table API will support everything running on DataStream that used to
work when it ran on DataSet.
> * 1.13 will be the last release with full DataSet support
(BatchTableEnvironment will be dropped)
>
> Since a few features of DataSet are still not supported in DataStream, we
expect users to stick to 1.13 for a longer time and probably skip 1.14. So,
we'd suggest to merge your 2 PRs to release-1.13 instead of master. Then, all
DataSet users would benefit from your contributions while we unblock future
developments that would break DataSet as of 1.13.
>
> If it turns out that the community wants to have these features in 1.14
for some reasons (Table API not as far as it should), we can still re-add
`ParquetInputFormat` and forward port your PR before feature freeze in 3 months.
>
> Note 1: It might still be possible to have a Flink 1.14 DataSet
application using 1.13 formats. In general, it's always possible to copy the
old code into your own project.
> Note 2: If you are missing combineable aggregations in DataStream, maybe
it would be better to move to Table API to begin with. At this point, no-one
really knows how much of DataSet will be ported to DataStream. It doesn't
really make sense to have 2 APIs with high-level primitives like joins. The
main idea is to use Table API by default with a rich user experience and go
down to DataStream only when needed (timer, user state, ...).
Hi @AHeise, thanks for your detailed comment.
Merging my 2 PRs to 1.13 looks fine to me, I think backporting to 1.12 might
me considered as well as Amazon EMR is still 1.12 based (latest is 1.12.1) and
they are likely to move to latest 1.12 in short term releases of EMR.
For the context, what I'm doing is a benchmark comparing Flink
Dataset/Datstream/SQL and other technologies such as Apache Beam. So I'd like
to have a working DataSet/DataStream pipeline in addition to my existing
working SQL one (using blink planner based on DataStream and
BatchTableEnvironment). And I guess there will be some users that will want to
stick to DataSet/DataStream if they have existing pipelines that they do not
want to re-code but just migrate to next flink version.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]