Hi Beam devs, I'd like to start a discussion about migrating the Flink runner to execute the batch jobs in DataStream API instead of DataSet API.
Today Flink runner executes batch jobs with DataSet API which is semi-deprecated and will be removed sometime in future Flink releases. Flink DataStream API has been extended to replace DataSet API for batch job execution. So here we propose to migrate the Flink Beam runner from DataSet to DataStream for batch job execution. I have compiled this one pager[1] to explain the motivation, interface change, migration plan and proposed changes. We also have a PoC implementation of this migration[2] which has passed the existing unit tests and runner validation tests. Would love to get your thoughts on this. BTW, I am starting this discussion thread as I am not sure whether this change is considered as a large change[3] or not. If there is no concern for the change, I'll just create the GitHub issues and start to work on it. Also, I have worked with Xinyu Liu on the PoC implementation, and Xinyu has agreed to help review the patches (thank you Xinyu). It would be great if someone who has worked on Flink runner before can also help with the PR reviews. Thanks, Jiangjie (Becket) Qin [1] https://docs.google.com/document/d/1cjUJHOS1eEkH76hMNeBuc-kPhbIIc9w2gvjm8miIFS8/edit?usp=sharing [2] https://github.com/becketqin/beam/tree/flink-batch-runner-migration [3] https://github.com/apache/beam/blob/14e8de6e99a031ba7376bdb6837d471648878932/CONTRIBUTING.md