Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

Robert Bradshaw via dev Wed, 01 Feb 2023 16:40:44 -0800

Thanks. In that case keeping both in parallel, and tying the switch in
the default to a (possibly overridable) choice of Flink version, makes
a lot of sense.


On Wed, Feb 1, 2023 at 3:33 PM Becket Qin <[email protected]> wrote:
>
> Hi Robert,
>
> Thanks for the feedback. This change will be transparent to the user 
> applications in most cases. However, there are still a few differences 
> visible to the users.
>
> 1. Configurations. DataStream and DataSet take different configurations.
> 2. Metrics. DataStream operators and DataSet operators may emit different 
> metrics.
> 3. Some other potential behavior change. The DataSet API currently goes 
> through a simple optimizer, while the DataStream API does not. And the 
> underlying operator implementations are also different. So users may find 
> their job execution topology changes after switching to DataStream.
> 4. Resource consumption. Because the underlying operator implementations are 
> different, the resource consumption may be different.
>
> So, in general I feel it is probably safer to keep the DataSet execution path 
> for some time before we remove it completely.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Thu, Feb 2, 2023 at 1:23 AM Robert Bradshaw via dev <[email protected]> 
> wrote:
>>
>> This sounds reasonable to me. One question I have is why a user would
>> prefer to stick with the DataSet API if the DataStream API is
>> available. Would there be any user-visible difference?
>>
>> On Wed, Feb 1, 2023 at 1:11 AM Becket Qin <[email protected]> wrote:
>> >
>> > Hi Beam devs,
>> >
>> > I'd like to start a discussion about migrating the Flink runner to execute 
>> > the batch jobs in DataStream API instead of DataSet API.
>> >
>> > Today Flink runner executes batch jobs with DataSet API which is 
>> > semi-deprecated and will be removed sometime in future Flink releases. 
>> > Flink DataStream API has been extended to replace DataSet API for batch 
>> > job execution. So here we propose to migrate the Flink Beam runner from 
>> > DataSet to DataStream for batch job execution.
>> >
>> > I have compiled this one pager[1] to explain the motivation, interface 
>> > change, migration plan and proposed changes. We also have a PoC 
>> > implementation of this migration[2] which has passed the existing unit 
>> > tests and runner validation tests.
>> >
>> > Would love to get your thoughts on this.
>> >
>> > BTW, I am starting this discussion thread as I am not sure whether this 
>> > change is considered as a large change[3] or not. If there is no concern 
>> > for the change, I'll just create the GitHub issues and start to work on it.
>> >
>> > Also, I have worked with Xinyu Liu on the PoC implementation, and Xinyu 
>> > has agreed to help review the patches (thank you Xinyu). It would be great 
>> > if someone who has worked on Flink runner before can also help with the PR 
>> > reviews.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > [1] 
>> > https://docs.google.com/document/d/1cjUJHOS1eEkH76hMNeBuc-kPhbIIc9w2gvjm8miIFS8/edit?usp=sharing
>> > [2] https://github.com/becketqin/beam/tree/flink-batch-runner-migration
>> > [3] 
>> > https://github.com/apache/beam/blob/14e8de6e99a031ba7376bdb6837d471648878932/CONTRIBUTING.md

Re: [DISCUSS] Migrate Flink runner to run batch jobs in DataStream API

Reply via email to