Re: [Discuss] Planning Flink 1.14

Johannes Moser Mon, 28 Jun 2021 00:56:08 -0700

Hi all,

We discussed the matter again in our latest release planning (see [1]). We see 
a lot of valid
points in this thread. As we were not able to come to a clear conclusion within 
the meeting and 
most of the arguments mentioned will still be valid even if we extend the 
feature freeze by
a month. We are keeping this for now at early August. I will collect all the 
inputs and talk to
some users to further improve the experience also for those who extended Flink.


Best Joe


[1] https://cwiki.apache.org/confluence/display/FLINK/1.14+Release

> On 07.06.2021, at 05:30, Benchao Li <libenc...@apache.org> wrote:
> 
> Hi all,
> 
> Thanks Xintong for bringing this up.
> 
> I would like to share some experience of the usage of Flink in our company
> (ByteDance).
> 
> 1. We started building our SQL platform in mid 2019, using v1.9 blink
> planner, and it's amazing.
> Also we added many internal features which is still missing in this
> version, including DDL/Computed Column/
> a lot of internal formats and connectors, and some other planner changes.
> 
> 2. At early 2020, we plan to upgrade to v1.10. Before we finished
> cherry-picking internal commits to v1.10, we found
> that v1.11 is going to be released soon. Hence we decided to upgrade to
> v1.11.
> Till late 2020, we almost finished internal feature check-picking work. (It
> takes us so long because we still adding new features
> to our online version v1.9 at the same time)
> 
> 3. Now
> Although we tried a lot of work to reduce the overhead for our users to
> upgrading from v1.9 to v1.11, this process is still slow, because:
> a) All the connectors/formats properties changed (although we have a tool
> for them to upgrade in one click, they still have a lot of learning cost)
> b) The checkpoint cannot be upgraded
> 
> 4. Future
> We have 5000+ online SQL jobs and hundreds of commits, we do not plan to do
> an upgrade in short term.
> However v1.11 still lacks a lot of features, for example:
> a) new UDF type inference does not support aggregate function
> b) FLIP-27 new source interface cannot be used in SQL
> We may need to to a lot of cherry-picking to our v1.11
> 
> So, from our point, longer release circle and more fully finished features
> may benefit us a lot.
> 
> 
> JING ZHANG <beyond1...@gmail.com> 于2021年6月4日周五 下午6:02写道：
> 
>> Hi all,
>> 
>> @Xintong Song
>> Thanks for reminding me, I would contact Jark to update the wiki page.
>> 
>> Besides, I'd like to provide more inputs by sharing our experience about
>> upgrading Internal version of Flink.
>> 
>> Flink has been widely used in the production environment since 2018 in our
>> company. Our internal version is far behind the latest stable version of
>> the community by about 1 year. We upgraded the internal Flink version to
>> 1.10 version in March last year, and we plan to upgrade directly to 1.13
>> next month (missed 1.11 and 1.12 versions). We wish to use the latest
>> version as soon as possible. However, in fact we follow up with the
>> community's latest stable release version almost once a year because
>> upgrading to a new version is a time-consuming process.
>> 
>> I list detailed works as follows.
>> 
>> a. Before release new internal version
>> 1) Required: Cherrypick internal features to the new Flink branch. A few
>> features need to be redeveloped based on the new branch code base.
>>    BTW, The cost would be more and more heavy since we maintain more and
>> more internal features in our internal version.
>> 2) Optional: Some internal connectors need to adapt to the new API
>> 3) Required: Surrounding products need to updated based on the new API, for
>> example, Internal Flink SQL WEB development platform
>> 4) Required: Regression tests
>> 
>> b. After release, encourage users to upgrade existing jobs (Thousands of
>> jobs) to the new version, User need some time to do :
>> 1) Repackage jar for dataStream job
>> 2) For critical jobs, users need to run jobs at the two versions at the
>> same time for a while. Migrated to a new job only after comparing the
>> data carefully.
>> 3) Pure ETL SQL jobs are easy to bump up. But other Flink SQL jobs with
>> stateful operators need extra efforts because Flink SQL Job does not
>> support state compatibility yet.
>> 
>> Best regards,
>> JING ZHANG
>> 
>> Prasanna kumar <prasannakumarram...@gmail.com> 于2021年6月4日周五 下午2:27写道：
>> 
>>> Hi all,
>>> 
>>> We are using Flink for our eventing system. Overall we are very happy
>> with
>>> the tech, documentation and community support and quick replies in mails.
>>> 
>>> My last 1 year experience with versions.
>>> 
>>> We were working on 1.10 initially during our research phase then we
>>> stabilised with 1.11 as we moved on but by the time we are about to get
>>> into production 1.12 was released. As with all software and products,
>>> there were bugs reported. So we waited till 1.12.2 was released and then
>>> upgraded. Within a month of us doing it 1.13 got released.
>>> 
>>> But by past experience , we waited till at least a couple of minor
>>> versions(fixing bugs) get released before we move onto a newer version.
>>> The development happens at a rapid/good phase in flink (which is good in
>>> terms of features) but adoption and moving the production code to newer
>>> version 3/4 times a year is an onerous effort. For example , the memory
>>> model was changed in one of the releases (there is a good documentation)
>> .
>>> But as a production user to adopt the newer version, at least a month of
>>> testing is required with a huge scale environment. We also do not want to
>>> be behind more than 2 versions at any point of time.
>>> 
>>> I Personally feel 2 major releases a year or at max a release once 5
>> months
>>> is good.
>>> 
>>> Thanks
>>> Prasanna.
>>> 
>>> On Fri, Jun 4, 2021 at 9:38 AM Xintong Song <tonysong...@gmail.com>
>> wrote:
>>> 
>>>> Thanks everyone for the feedback.
>>>> 
>>>> @Jing,
>>>> Thanks for the inputs. Could you please ask a committer who works
>>> together
>>>> with you on these items to fill them into the feature collecting wiki
>>> page
>>>> [1]? I assume Jark, who co-edited the flip wiki page, is working with
>>> you?
>>>> 
>>>> @Kurt, @Till and @Seth,
>>>> First of all, a few things that potentially demotivate users from
>>>> upgrading, observed from users that I've been in touch with.
>>>> 1. It takes time for Flink major releases to get stabilized. Many users
>>>> tend to waitting for the bugfix releases (x.y.1/2, or even x.y.3/4)
>>> rather
>>>> than upgrading to x.y.0 immediately. This could take months, sometimes
>>> even
>>>> after the next major release.
>>>> 2. Many users maintain an internal version of Flink, with customized
>>>> features for their specific businesses. For them, upgrading Flink
>>> requires
>>>> significant efforts to rebase those customized features. On the other
>>> hand,
>>>> the more versions they are left behind, the harder to contribute those
>>>> features to the community, becoming a vicious cycle.
>>>> 
>>>> I think the question to be answered is how do we prioritize between
>>>> stabilizing a previous major release and casting a new major release.
>> So
>>>> far, it feels like the new release is prior. I recall that we have
>> waited
>>>> for weeks to release 1.11.3 because people were busy stabilizing
>> 1.12.0.
>>>> What if more resources are lean to the bugfix releases? We may have a
>>> more
>>>> explicit schedule for the bugfix releases. E.g., try to always release
>>> the
>>>> first bugfix release 2 weeks after the major release, the second bugfix
>>>> release 4 weeks after that, and release on-demand starting from the
>> third
>>>> bugfix release. Or some other rules like this. Would that help speeding
>>> up
>>>> the stabilization of release and give the users more confidence to
>>> upgrade
>>>> earlier?
>>>> 
>>>> A related question is how do we prioritize between casting a release
>> and
>>>> motivating more contributors. According to my experience, what Kurt
>>>> described, that committers cannot help contributors due to "planned
>>>> features", usually happens during the release testing period or right
>>>> before that (when people are struggling to catch the feature freeze).
>>> This
>>>> probably indicates that currently casting a release timely is
>> prioritized
>>>> over the contributor's experience. Do we need to change that?
>>>> 
>>>> If extending the release period does not come in a way that simply more
>>>> features are pushed into each release, but rather allowing a longer
>>> period
>>>> for the release to get stabilized while leaving more capacity for
>> bugfix
>>>> releases and helping contributors, it might be a good idea. To be
>>> specific,
>>>> currently we have the 4 months period as 3 months feature developing +
>> 1
>>>> month release testing. We might consider a 5 months period as 3 months
>>>> feature developing + 2 month release testing.
>>>> 
>>>> To sum up, I'm leaning towards extending the overall release period a
>>> bit,
>>>> while keeping the period before feature freeze. WDYT?
>>>> 
>>>> Thank you~
>>>> 
>>>> Xintong Song
>>>> 
>>>> 
>>>> [1] https://cwiki.apache.org/confluence/display/FLINK/1.14+Release
>>>> 
>>>> On Thu, Jun 3, 2021 at 9:00 PM Seth Wiesman <sjwies...@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Everyone,
>>>>> 
>>>>> +1 for the Release Managers. Thank you all for volunteering.
>>>>> 
>>>>> @Till Rohrmann <trohrm...@apache.org> A common sentiment that I have
>>>> heard
>>>>> from many users is that upgrading off of 1.9 was very difficult. In
>>>>> particular, a lot of people struggled to understand the new memory
>>> model.
>>>>> Many users who required custom memory configurations in earlier
>>> versions
>>>>> assumed they should carry those configurations into latter versions
>> and
>>>>> then found themselves with OOM and instability issues. The good news
>> is
>>>>> Flink did what it was supposed to do and so for the majority dropping
>>>> their
>>>>> custom configurations and just setting total process memory was the
>>>> correct
>>>>> solution; this was not an issue of a buggy release. The problem is
>>> people
>>>>> do not read the release notes or fully understood the implications of
>>> the
>>>>> change. Back to Kurt's point, this transition seems to have left a
>> bad
>>>>> taste in many mouths, slowing some user's adoption of newer
>> versions. I
>>>>> don't know I have a solution to this problem. I think it is more
>>>>> communication than engineering, but I'm open to continuing the
>>>> discussion.
>>>>> 
>>>>> On Thu, Jun 3, 2021 at 5:04 AM Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>>> 
>>>>>> Thanks for volunteering as our release managers Xintong, Dawid and
>>> Joe!
>>>>>> 
>>>>>> Thanks for starting the discussion about the release date Kurt.
>>>>> Personally,
>>>>>> I prefer in general shorter release cycles as it allows us to
>> deliver
>>>>>> features faster and people feel less pressure to merge half-done
>>>> features
>>>>>> last minute because they fear that they have to wait a long time if
>>>> they
>>>>>> missed the train. Also, it forces us to make the release process
>> less
>>>> of
>>>>> a
>>>>>> stop-the-world event and cut down the costs of releases.
>>>>>> 
>>>>>> On the other hand, if our users don't upgrade Flink fast enough,
>> then
>>>>>> releasing more often won't have the effect of shipping features to
>>> our
>>>>>> users and getting feedback faster from our users faster. What I
>>> believe
>>>>> we
>>>>>> should try to do is to understand why upgrading Flink is so
>> difficult
>>>> for
>>>>>> them. What are the things preventing a quick upgrade and how can we
>>>>> improve
>>>>>> the situation for our users? Are our APIs not stable enough? Does
>>>> Flink's
>>>>>> behavior changes too drastically between versions? Is the tooling
>> for
>>>>>> upgrades lacking behind? Are they just cautious and don't want to
>> use
>>>>>> bleeding edge software?
>>>>>> 
>>>>>> If there is a problem that the majority of users is using an
>>>> unsupported
>>>>>> version, then one solution could also be to extend the list of
>>>> supported
>>>>>> Flink versions to the latest 3 versions, for example.
>>>>>> 
>>>>>> About your 2) point I am a bit skeptical. I think that we will
>> simply
>>>>> plan
>>>>>> more features and end up in the same situation wrt external
>>>>> contributions.
>>>>>> If it weren't the case, then it would also work with shorter
>> release
>>>>> cycles
>>>>>> by simply planning less feature work and including the external
>>>>>> contribution, which could not be done in the past release, in the
>>> next
>>>>>> release. So in the end it is about what we plan for a release and
>> not
>>>> so
>>>>>> much how much time we have (assuming that we plan less if we have
>>> less
>>>>> time
>>>>>> and vice versa).
>>>>>> 
>>>>>> Cheers,
>>>>>> Till
>>>>>> 
>>>>>> On Thu, Jun 3, 2021 at 5:08 AM Kurt Young <ykt...@gmail.com>
>> wrote:
>>>>>> 
>>>>>>> Thanks for bringing this up.
>>>>>>> 
>>>>>>> I have one thought about the release period. In a short word:
>> shall
>>>> we
>>>>>> try
>>>>>>> to extend the release period for 1 month?
>>>>>>> 
>>>>>>> There are a couple of reasons why I want to bring up this
>> proposal.
>>>>>>> 
>>>>>>> 1) I observed that lots of users are actually far behind the
>>> current
>>>>>> Flink
>>>>>>> version. For example, we are now actively
>>>>>>> developing 1.14 but most users I know who have a migration or
>>> upgrade
>>>>>> plan
>>>>>>> are planning to upgrade to 1.12. This means
>>>>>>> we need to back port bug fixes to 1.12 and 1.13. If we extend the
>>>>> release
>>>>>>> period by 1 month, I think there may be some
>>>>>>> chances that users can have a proper time frame to upgrade to the
>>>>>> previous
>>>>>>> released version. Then we can have a
>>>>>>> good development cycle which looks like "actively developing the
>>>>> current
>>>>>>> version and making the previous version stable,
>>>>>>> not 2 ~ 3 versions before". Always far away from Flink's latest
>>>> version
>>>>>>> also suppresses the motivation to contribute to Flink
>>>>>>> from users perspective.
>>>>>>> 
>>>>>>> 2) Increasing the release period also eases the workload of
>>>> committers
>>>>>>> which I think can improve the contributor experience.
>>>>>>> I have seen several times that when some contributors want to do
>>> some
>>>>> new
>>>>>>> features or improvements, we have to response
>>>>>>> with "sorry we are right now focusing with
>> implementing/stabilizing
>>>>>> planned
>>>>>>> feature for this version", and the contributions are
>>>>>>> mostly like being stalled and never brought up again.
>>>>>>> 
>>>>>>> BTW extending the release period also has downsides. It slows
>> down
>>>> the
>>>>>>> delivery speed of new features. And I'm also not
>>>>>>> sure how much it can improve the above 2 issues.
>>>>>>> 
>>>>>>> Looking forward to hearing some feedback from the community, both
>>>> users
>>>>>> and
>>>>>>> developers.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Kurt
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jun 2, 2021 at 8:39 PM JING ZHANG <beyond1...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Dawid, Joe & Xintong,
>>>>>>>> 
>>>>>>>> Thanks for starting the discussion.
>>>>>>>> 
>>>>>>>> I would like to polish Window TVFs[1][2] which is a popular
>>> feature
>>>>> in
>>>>>>> SQL
>>>>>>>> introduced in 1.13.
>>>>>>>> 
>>>>>>>> The detailed items are as follows.
>>>>>>>> 1. Add more computations based on Window TVF
>>>>>>>>    * Window Join (which is already merged in master branch)
>>>>>>>>    * Window Table Function
>>>>>>>>    * Window Deduplicate
>>>>>>>> 2. Finish related JIRA to improve user experience
>>>>>>>>   * Add offset support for TUMBLE, HOP, session window
>>>>>>>> 3. Complement the missing functions compared to the group
>> window,
>>>>> which
>>>>>>> is
>>>>>>>> a precondition of deprecating the legacy Grouped Window
>> Function
>>> in
>>>>> the
>>>>>>>> later versions.
>>>>>>>>   * Support Session windows
>>>>>>>>   * Support allow-lateness
>>>>>>>>   * Support retract input stream
>>>>>>>>   * Support window TVF in batch mode
>>>>>>>> 
>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-19604
>>>>>>>> [2]
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> JING ZHANG
>>>>>>>> 
>>>>>>>> Xintong Song <xts...@apache.org> 于2021年6月2日周三 下午6:45写道：
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> As 1.13 has been released for a while, I think it is a good
>>> time
>>>> to
>>>>>>> start
>>>>>>>>> planning for the 1.14 release cycle.
>>>>>>>>> 
>>>>>>>>> - Release managers: This time we'd like to have a team of 3
>>>> release
>>>>>>>>> managers. Dawid, Joe and I would like to volunteer for it.
>> What
>>>> do
>>>>>> you
>>>>>>>>> think about it?
>>>>>>>>> 
>>>>>>>>> - Timeline: According to our approximate 4 months release
>>> period,
>>>>> we
>>>>>>>>> propose to aim for a feature freeze roughly in early August
>>>> (which
>>>>>>> could
>>>>>>>>> mean something like early September for the 1.14. release).
>>> Does
>>>> it
>>>>>>> work
>>>>>>>>> for everyone?
>>>>>>>>> 
>>>>>>>>> - Collecting features: It would be helpful to have a rough
>>>> overview
>>>>>> of
>>>>>>>> the
>>>>>>>>> new features that will likely be included in this release. We
>>>> have
>>>>>>>> created
>>>>>>>>> a wiki page [1] for collecting such information. We'd like to
>>>>> kindly
>>>>>>> ask
>>>>>>>>> all committers to fill in the page with features that they
>>> intend
>>>>> to
>>>>>>> work
>>>>>>>>> on.
>>>>>>>>> 
>>>>>>>>> We would also like to emphasize some aspects of the
>> engineering
>>>>>>> process:
>>>>>>>>> 
>>>>>>>>> - Stability of master: This has been an issue during the 1.13
>>>>> feature
>>>>>>>>> freeze phase and it is still going on. We encourage every
>>>> committer
>>>>>> to
>>>>>>>> not
>>>>>>>>> merge PRs through the Github button, but do this manually,
>> with
>>>>>> caution
>>>>>>>> for
>>>>>>>>> the commits merged after the CI being triggered. It would be
>>>>>>> appreciated
>>>>>>>> to
>>>>>>>>> always build the project before merging to master.
>>>>>>>>> 
>>>>>>>>> - Documentation: Please try to see documentation as an
>>> integrated
>>>>>> part
>>>>>>> of
>>>>>>>>> the engineering process and don't push it to the feature
>> freeze
>>>>> phase
>>>>>>> or
>>>>>>>>> even after. You might even think about going documentation
>>> first.
>>>>> We,
>>>>>>> as
>>>>>>>>> the Flink community, are adding great stuff, that is pushing
>>> the
>>>>>> limits
>>>>>>>> of
>>>>>>>>> streaming data processors, with every release. We should also
>>>> make
>>>>>> this
>>>>>>>>> stuff usable for our users by documenting it well.
>>>>>>>>> 
>>>>>>>>> - Promotion of 1.14: What applies to documentation also
>> applies
>>>> to
>>>>>> all
>>>>>>>> the
>>>>>>>>> activity around the release. We encourage every contributor
>> to
>>>> also
>>>>>>> think
>>>>>>>>> about, plan and prepare activities like blog posts and talk,
>>> that
>>>>>> will
>>>>>>>>> promote and spread the release once it is done.
>>>>>>>>> 
>>>>>>>>> Please let us know what you think.
>>>>>>>>> 
>>>>>>>>> Thank you~
>>>>>>>>> Dawid, Joe & Xintong
>>>>>>>>> 
>>>>>>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/FLINK/1.14+Release
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> -- 
> 
> Best,
> Benchao Li

Re: [Discuss] Planning Flink 1.14

Reply via email to