[GitHub] [flink] AHeise commented on pull request #15156: [FLINK-21393] [formats] Implement ParquetAvroInputFormat

GitBox Tue, 25 May 2021 01:14:38 -0700


AHeise commented on pull request #15156:
URL: https://github.com/apache/flink/pull/15156#issuecomment-847655919

> @AHeise ParquetInputFormat base class was removed since I submitted my PR
hence the compilation issues, commit
[ce3631a](https://github.com/apache/flink/commit/ce3631af7313855f675e29b8faa386f6e5a2d43c)
removed it. This commit mentions "Use the filesystem connector with a Parquet
format as a replacement". I guess it refers to
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/
which is SQL based. But what if our pipeline pipeline does not use SQL but
DataSet API ?

It's a good and tough question. I spoke to @twalthr offline and for now the
plan is as follows:
* Table API drops old planner and thus, most of the removed code in that
commit is dead code.
* Table API will support everything running on DataStream that used to work
when it ran on DataSet.
* 1.13 will be the last release with full DataSet support
(BatchTableEnvironment will be dropped)

Since a few features of DataSet are still not supported in DataStream, we
expect users to stick to 1.13 for a longer time and probably skip 1.14. So,
we'd suggest to merge your 2 PRs to release-1.13 instead of master. Then, all
DataSet users would benefit from your contributions while we unblock future
developments that would break DataSet as of 1.13.

If it turns out that the community wants to have these features in 1.14 for
some reasons (Table API not as far as it should), we can still re-add
`ParquetInputFormat` and forward port your PR before feature freeze in 3 months.

Note 1: It might still be possible to have a Flink 1.14 DataSet application
using 1.13 formats. In general, it's always possible to copy the old code into
your own project.
Note 2: If you are missing combineable aggregations in DataStream, maybe it
would be better to move to Table API to begin with. At this point, no-one
really knows how much of DataSet will be ported to DataStream. It doesn't
really make sense to have 2 APIs with high-level primitives like joins. The
main idea is to use Table API by default with a rich user experience and go
down to DataStream only when needed (timer, user state, ...).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] AHeise commented on pull request #15156: [FLINK-21393] [formats] Implement ParquetAvroInputFormat

Reply via email to