Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

Becket Qin Wed, 01 Feb 2023 15:33:38 -0800

Hi Robert,

Thanks for the feedback. This change will be transparent to the user
applications in most cases. However, there are still a few differences
visible to the users.


1. Configurations. DataStream and DataSet take different configurations.
2. Metrics. DataStream operators and DataSet operators may emit different
metrics.
3. Some other potential behavior change. The DataSet API currently goes
through a simple optimizer, while the DataStream API does not. And the
underlying operator implementations are also different. So users may find
their job execution topology changes after switching to DataStream.
4. Resource consumption. Because the underlying operator implementations
are different, the resource consumption may be different.

So, in general I feel it is probably safer to keep the DataSet execution
path for some time before we remove it completely.

Thanks,

Jiangjie (Becket) Qin



On Thu, Feb 2, 2023 at 1:23 AM Robert Bradshaw via dev <[email protected]>
wrote:

> This sounds reasonable to me. One question I have is why a user would
> prefer to stick with the DataSet API if the DataStream API is
> available. Would there be any user-visible difference?
>
> On Wed, Feb 1, 2023 at 1:11 AM Becket Qin <[email protected]> wrote:
> >
> > Hi Beam devs,
> >
> > I'd like to start a discussion about migrating the Flink runner to
> execute the batch jobs in DataStream API instead of DataSet API.
> >
> > Today Flink runner executes batch jobs with DataSet API which is
> semi-deprecated and will be removed sometime in future Flink releases.
> Flink DataStream API has been extended to replace DataSet API for batch job
> execution. So here we propose to migrate the Flink Beam runner from DataSet
> to DataStream for batch job execution.
> >
> > I have compiled this one pager[1] to explain the motivation, interface
> change, migration plan and proposed changes. We also have a PoC
> implementation of this migration[2] which has passed the existing unit
> tests and runner validation tests.
> >
> > Would love to get your thoughts on this.
> >
> > BTW, I am starting this discussion thread as I am not sure whether this
> change is considered as a large change[3] or not. If there is no concern
> for the change, I'll just create the GitHub issues and start to work on it.
> >
> > Also, I have worked with Xinyu Liu on the PoC implementation, and Xinyu
> has agreed to help review the patches (thank you Xinyu). It would be great
> if someone who has worked on Flink runner before can also help with the PR
> reviews.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > [1]
> https://docs.google.com/document/d/1cjUJHOS1eEkH76hMNeBuc-kPhbIIc9w2gvjm8miIFS8/edit?usp=sharing
> > [2] https://github.com/becketqin/beam/tree/flink-batch-runner-migration
> > [3]
> https://github.com/apache/beam/blob/14e8de6e99a031ba7376bdb6837d471648878932/CONTRIBUTING.md
>

Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

Reply via email to