Hi Beam devs,

I'd like to start a discussion about migrating the Flink runner to execute
the batch jobs in DataStream API instead of DataSet API.

Today Flink runner executes batch jobs with DataSet API which is
semi-deprecated and will be removed sometime in future Flink releases.
Flink DataStream API has been extended to replace DataSet API for batch job
execution. So here we propose to migrate the Flink Beam runner from DataSet
to DataStream for batch job execution.

I have compiled this one pager[1] to explain the motivation, interface
change, migration plan and proposed changes. We also have a PoC
implementation of this migration[2] which has passed the existing unit
tests and runner validation tests.

Would love to get your thoughts on this.

BTW, I am starting this discussion thread as I am not sure whether this
change is considered as a large change[3] or not. If there is no concern
for the change, I'll just create the GitHub issues and start to work on it.

Also, I have worked with Xinyu Liu on the PoC implementation, and Xinyu has
agreed to help review the patches (thank you Xinyu). It would be great if
someone who has worked on Flink runner before can also help with the PR
reviews.

Thanks,

Jiangjie (Becket) Qin

[1]
https://docs.google.com/document/d/1cjUJHOS1eEkH76hMNeBuc-kPhbIIc9w2gvjm8miIFS8/edit?usp=sharing
[2] https://github.com/becketqin/beam/tree/flink-batch-runner-migration
[3]
https://github.com/apache/beam/blob/14e8de6e99a031ba7376bdb6837d471648878932/CONTRIBUTING.md

Reply via email to