[GitHub] [flink] echauchot commented on a change in pull request #15156: [FLINK-21393] [formats] Implement ParquetAvroInputFormat

GitBox Wed, 26 May 2021 02:43:55 -0700


echauchot commented on a change in pull request #15156:
URL: https://github.com/apache/flink/pull/15156#discussion_r639567300




##########
File path: flink-formats/flink-parquet/pom.xml
##########
@@ -45,6 +45,12 @@ under the License.
                        <scope>provided</scope>
                </dependency>
 
+               <dependency>
+                       <groupId>org.apache.flink</groupId>

Review comment:
       > > @AHeise ParquetInputFormat base class was removed since I submitted 
my PR hence the compilation issues, commit 
[ce3631a](https://github.com/apache/flink/commit/ce3631af7313855f675e29b8faa386f6e5a2d43c)
 removed it. This commit mentions "Use the filesystem connector with a Parquet 
format as a replacement". I guess it refers to 
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/filesystem/
 which is SQL based. But what if our pipeline pipeline does not use SQL but 
DataSet API ?
   > 
   > It's a good and tough question. I spoke to @twalthr offline and for now 
the plan is as follows:
   > 
   > * Table API drops old planner and thus, most of the removed code in that 
commit is dead code.
   > * Table API will support everything running on DataStream that used to 
work when it ran on DataSet.
   > * 1.13 will be the last release with full DataSet support 
(BatchTableEnvironment will be dropped)
   > 
   > Since a few features of DataSet are still not supported in DataStream, we 
expect users to stick to 1.13 for a longer time and probably skip 1.14. So, 
we'd suggest to merge your 2 PRs to release-1.13 instead of master. Then, all 
DataSet users would benefit from your contributions while we unblock future 
developments that would break DataSet as of 1.13.
   > 
   > If it turns out that the community wants to have these features in 1.14 
for some reasons (Table API not as far as it should), we can still re-add 
`ParquetInputFormat` and forward port your PR before feature freeze in 3 months.
   > 
   > Note 1: It might still be possible to have a Flink 1.14 DataSet 
application using 1.13 formats. In general, it's always possible to copy the 
old code into your own project.
   > Note 2: If you are missing combineable aggregations in DataStream, maybe 
it would be better to move to Table API to begin with. At this point, no-one 
really knows how much of DataSet will be ported to DataStream. It doesn't 
really make sense to have 2 APIs with high-level primitives like joins. The 
main idea is to use Table API by default with a rich user experience and go 
down to DataStream only when needed (timer, user state, ...).
   
   
   Hi @AHeise, thanks for your detailed comment.
   Merging my 2 PRs to 1.13 looks fine to me, I think backporting to 1.12 might 
me considered as well as Amazon EMR is still 1.12 based (latest is 1.12.1) and 
they are likely to move to latest 1.12 in short term releases of EMR.
   
   For the context, what I'm doing is a benchmark comparing Flink 
Dataset/Datstream/SQL and other technologies such as Apache Beam. So I'd like 
to have a working DataSet/DataStream pipeline in addition to my existing 
working SQL one (using blink planner based on DataStream and 
BatchTableEnvironment). And I guess there will be some users that will want to 
stick to DataSet/DataStream if they have existing pipelines that they do not 
want to re-code but just migrate to next flink version.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] echauchot commented on a change in pull request #15156: [FLINK-21393] [formats] Implement ParquetAvroInputFormat

Reply via email to