Hi all, We discussed the matter again in our latest release planning (see [1]). We see a lot of valid points in this thread. As we were not able to come to a clear conclusion within the meeting and most of the arguments mentioned will still be valid even if we extend the feature freeze by a month. We are keeping this for now at early August. I will collect all the inputs and talk to some users to further improve the experience also for those who extended Flink.
Best Joe [1] https://cwiki.apache.org/confluence/display/FLINK/1.14+Release > On 07.06.2021, at 05:30, Benchao Li <libenc...@apache.org> wrote: > > Hi all, > > Thanks Xintong for bringing this up. > > I would like to share some experience of the usage of Flink in our company > (ByteDance). > > 1. We started building our SQL platform in mid 2019, using v1.9 blink > planner, and it's amazing. > Also we added many internal features which is still missing in this > version, including DDL/Computed Column/ > a lot of internal formats and connectors, and some other planner changes. > > 2. At early 2020, we plan to upgrade to v1.10. Before we finished > cherry-picking internal commits to v1.10, we found > that v1.11 is going to be released soon. Hence we decided to upgrade to > v1.11. > Till late 2020, we almost finished internal feature check-picking work. (It > takes us so long because we still adding new features > to our online version v1.9 at the same time) > > 3. Now > Although we tried a lot of work to reduce the overhead for our users to > upgrading from v1.9 to v1.11, this process is still slow, because: > a) All the connectors/formats properties changed (although we have a tool > for them to upgrade in one click, they still have a lot of learning cost) > b) The checkpoint cannot be upgraded > > 4. Future > We have 5000+ online SQL jobs and hundreds of commits, we do not plan to do > an upgrade in short term. > However v1.11 still lacks a lot of features, for example: > a) new UDF type inference does not support aggregate function > b) FLIP-27 new source interface cannot be used in SQL > We may need to to a lot of cherry-picking to our v1.11 > > So, from our point, longer release circle and more fully finished features > may benefit us a lot. > > > JING ZHANG <beyond1...@gmail.com> 于2021年6月4日周五 下午6:02写道: > >> Hi all, >> >> @Xintong Song >> Thanks for reminding me, I would contact Jark to update the wiki page. >> >> Besides, I'd like to provide more inputs by sharing our experience about >> upgrading Internal version of Flink. >> >> Flink has been widely used in the production environment since 2018 in our >> company. Our internal version is far behind the latest stable version of >> the community by about 1 year. We upgraded the internal Flink version to >> 1.10 version in March last year, and we plan to upgrade directly to 1.13 >> next month (missed 1.11 and 1.12 versions). We wish to use the latest >> version as soon as possible. However, in fact we follow up with the >> community's latest stable release version almost once a year because >> upgrading to a new version is a time-consuming process. >> >> I list detailed works as follows. >> >> a. Before release new internal version >> 1) Required: Cherrypick internal features to the new Flink branch. A few >> features need to be redeveloped based on the new branch code base. >> BTW, The cost would be more and more heavy since we maintain more and >> more internal features in our internal version. >> 2) Optional: Some internal connectors need to adapt to the new API >> 3) Required: Surrounding products need to updated based on the new API, for >> example, Internal Flink SQL WEB development platform >> 4) Required: Regression tests >> >> b. After release, encourage users to upgrade existing jobs (Thousands of >> jobs) to the new version, User need some time to do : >> 1) Repackage jar for dataStream job >> 2) For critical jobs, users need to run jobs at the two versions at the >> same time for a while. Migrated to a new job only after comparing the >> data carefully. >> 3) Pure ETL SQL jobs are easy to bump up. But other Flink SQL jobs with >> stateful operators need extra efforts because Flink SQL Job does not >> support state compatibility yet. >> >> Best regards, >> JING ZHANG >> >> Prasanna kumar <prasannakumarram...@gmail.com> 于2021年6月4日周五 下午2:27写道: >> >>> Hi all, >>> >>> We are using Flink for our eventing system. Overall we are very happy >> with >>> the tech, documentation and community support and quick replies in mails. >>> >>> My last 1 year experience with versions. >>> >>> We were working on 1.10 initially during our research phase then we >>> stabilised with 1.11 as we moved on but by the time we are about to get >>> into production 1.12 was released. As with all software and products, >>> there were bugs reported. So we waited till 1.12.2 was released and then >>> upgraded. Within a month of us doing it 1.13 got released. >>> >>> But by past experience , we waited till at least a couple of minor >>> versions(fixing bugs) get released before we move onto a newer version. >>> The development happens at a rapid/good phase in flink (which is good in >>> terms of features) but adoption and moving the production code to newer >>> version 3/4 times a year is an onerous effort. For example , the memory >>> model was changed in one of the releases (there is a good documentation) >> . >>> But as a production user to adopt the newer version, at least a month of >>> testing is required with a huge scale environment. We also do not want to >>> be behind more than 2 versions at any point of time. >>> >>> I Personally feel 2 major releases a year or at max a release once 5 >> months >>> is good. >>> >>> Thanks >>> Prasanna. >>> >>> On Fri, Jun 4, 2021 at 9:38 AM Xintong Song <tonysong...@gmail.com> >> wrote: >>> >>>> Thanks everyone for the feedback. >>>> >>>> @Jing, >>>> Thanks for the inputs. Could you please ask a committer who works >>> together >>>> with you on these items to fill them into the feature collecting wiki >>> page >>>> [1]? I assume Jark, who co-edited the flip wiki page, is working with >>> you? >>>> >>>> @Kurt, @Till and @Seth, >>>> First of all, a few things that potentially demotivate users from >>>> upgrading, observed from users that I've been in touch with. >>>> 1. It takes time for Flink major releases to get stabilized. Many users >>>> tend to waitting for the bugfix releases (x.y.1/2, or even x.y.3/4) >>> rather >>>> than upgrading to x.y.0 immediately. This could take months, sometimes >>> even >>>> after the next major release. >>>> 2. Many users maintain an internal version of Flink, with customized >>>> features for their specific businesses. For them, upgrading Flink >>> requires >>>> significant efforts to rebase those customized features. On the other >>> hand, >>>> the more versions they are left behind, the harder to contribute those >>>> features to the community, becoming a vicious cycle. >>>> >>>> I think the question to be answered is how do we prioritize between >>>> stabilizing a previous major release and casting a new major release. >> So >>>> far, it feels like the new release is prior. I recall that we have >> waited >>>> for weeks to release 1.11.3 because people were busy stabilizing >> 1.12.0. >>>> What if more resources are lean to the bugfix releases? We may have a >>> more >>>> explicit schedule for the bugfix releases. E.g., try to always release >>> the >>>> first bugfix release 2 weeks after the major release, the second bugfix >>>> release 4 weeks after that, and release on-demand starting from the >> third >>>> bugfix release. Or some other rules like this. Would that help speeding >>> up >>>> the stabilization of release and give the users more confidence to >>> upgrade >>>> earlier? >>>> >>>> A related question is how do we prioritize between casting a release >> and >>>> motivating more contributors. According to my experience, what Kurt >>>> described, that committers cannot help contributors due to "planned >>>> features", usually happens during the release testing period or right >>>> before that (when people are struggling to catch the feature freeze). >>> This >>>> probably indicates that currently casting a release timely is >> prioritized >>>> over the contributor's experience. Do we need to change that? >>>> >>>> If extending the release period does not come in a way that simply more >>>> features are pushed into each release, but rather allowing a longer >>> period >>>> for the release to get stabilized while leaving more capacity for >> bugfix >>>> releases and helping contributors, it might be a good idea. To be >>> specific, >>>> currently we have the 4 months period as 3 months feature developing + >> 1 >>>> month release testing. We might consider a 5 months period as 3 months >>>> feature developing + 2 month release testing. >>>> >>>> To sum up, I'm leaning towards extending the overall release period a >>> bit, >>>> while keeping the period before feature freeze. WDYT? >>>> >>>> Thank you~ >>>> >>>> Xintong Song >>>> >>>> >>>> [1] https://cwiki.apache.org/confluence/display/FLINK/1.14+Release >>>> >>>> On Thu, Jun 3, 2021 at 9:00 PM Seth Wiesman <sjwies...@gmail.com> >> wrote: >>>> >>>>> Hi Everyone, >>>>> >>>>> +1 for the Release Managers. Thank you all for volunteering. >>>>> >>>>> @Till Rohrmann <trohrm...@apache.org> A common sentiment that I have >>>> heard >>>>> from many users is that upgrading off of 1.9 was very difficult. In >>>>> particular, a lot of people struggled to understand the new memory >>> model. >>>>> Many users who required custom memory configurations in earlier >>> versions >>>>> assumed they should carry those configurations into latter versions >> and >>>>> then found themselves with OOM and instability issues. The good news >> is >>>>> Flink did what it was supposed to do and so for the majority dropping >>>> their >>>>> custom configurations and just setting total process memory was the >>>> correct >>>>> solution; this was not an issue of a buggy release. The problem is >>> people >>>>> do not read the release notes or fully understood the implications of >>> the >>>>> change. Back to Kurt's point, this transition seems to have left a >> bad >>>>> taste in many mouths, slowing some user's adoption of newer >> versions. I >>>>> don't know I have a solution to this problem. I think it is more >>>>> communication than engineering, but I'm open to continuing the >>>> discussion. >>>>> >>>>> On Thu, Jun 3, 2021 at 5:04 AM Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>>> >>>>>> Thanks for volunteering as our release managers Xintong, Dawid and >>> Joe! >>>>>> >>>>>> Thanks for starting the discussion about the release date Kurt. >>>>> Personally, >>>>>> I prefer in general shorter release cycles as it allows us to >> deliver >>>>>> features faster and people feel less pressure to merge half-done >>>> features >>>>>> last minute because they fear that they have to wait a long time if >>>> they >>>>>> missed the train. Also, it forces us to make the release process >> less >>>> of >>>>> a >>>>>> stop-the-world event and cut down the costs of releases. >>>>>> >>>>>> On the other hand, if our users don't upgrade Flink fast enough, >> then >>>>>> releasing more often won't have the effect of shipping features to >>> our >>>>>> users and getting feedback faster from our users faster. What I >>> believe >>>>> we >>>>>> should try to do is to understand why upgrading Flink is so >> difficult >>>> for >>>>>> them. What are the things preventing a quick upgrade and how can we >>>>> improve >>>>>> the situation for our users? Are our APIs not stable enough? Does >>>> Flink's >>>>>> behavior changes too drastically between versions? Is the tooling >> for >>>>>> upgrades lacking behind? Are they just cautious and don't want to >> use >>>>>> bleeding edge software? >>>>>> >>>>>> If there is a problem that the majority of users is using an >>>> unsupported >>>>>> version, then one solution could also be to extend the list of >>>> supported >>>>>> Flink versions to the latest 3 versions, for example. >>>>>> >>>>>> About your 2) point I am a bit skeptical. I think that we will >> simply >>>>> plan >>>>>> more features and end up in the same situation wrt external >>>>> contributions. >>>>>> If it weren't the case, then it would also work with shorter >> release >>>>> cycles >>>>>> by simply planning less feature work and including the external >>>>>> contribution, which could not be done in the past release, in the >>> next >>>>>> release. So in the end it is about what we plan for a release and >> not >>>> so >>>>>> much how much time we have (assuming that we plan less if we have >>> less >>>>> time >>>>>> and vice versa). >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>>> On Thu, Jun 3, 2021 at 5:08 AM Kurt Young <ykt...@gmail.com> >> wrote: >>>>>> >>>>>>> Thanks for bringing this up. >>>>>>> >>>>>>> I have one thought about the release period. In a short word: >> shall >>>> we >>>>>> try >>>>>>> to extend the release period for 1 month? >>>>>>> >>>>>>> There are a couple of reasons why I want to bring up this >> proposal. >>>>>>> >>>>>>> 1) I observed that lots of users are actually far behind the >>> current >>>>>> Flink >>>>>>> version. For example, we are now actively >>>>>>> developing 1.14 but most users I know who have a migration or >>> upgrade >>>>>> plan >>>>>>> are planning to upgrade to 1.12. This means >>>>>>> we need to back port bug fixes to 1.12 and 1.13. If we extend the >>>>> release >>>>>>> period by 1 month, I think there may be some >>>>>>> chances that users can have a proper time frame to upgrade to the >>>>>> previous >>>>>>> released version. Then we can have a >>>>>>> good development cycle which looks like "actively developing the >>>>> current >>>>>>> version and making the previous version stable, >>>>>>> not 2 ~ 3 versions before". Always far away from Flink's latest >>>> version >>>>>>> also suppresses the motivation to contribute to Flink >>>>>>> from users perspective. >>>>>>> >>>>>>> 2) Increasing the release period also eases the workload of >>>> committers >>>>>>> which I think can improve the contributor experience. >>>>>>> I have seen several times that when some contributors want to do >>> some >>>>> new >>>>>>> features or improvements, we have to response >>>>>>> with "sorry we are right now focusing with >> implementing/stabilizing >>>>>> planned >>>>>>> feature for this version", and the contributions are >>>>>>> mostly like being stalled and never brought up again. >>>>>>> >>>>>>> BTW extending the release period also has downsides. It slows >> down >>>> the >>>>>>> delivery speed of new features. And I'm also not >>>>>>> sure how much it can improve the above 2 issues. >>>>>>> >>>>>>> Looking forward to hearing some feedback from the community, both >>>> users >>>>>> and >>>>>>> developers. >>>>>>> >>>>>>> Best, >>>>>>> Kurt >>>>>>> >>>>>>> >>>>>>> On Wed, Jun 2, 2021 at 8:39 PM JING ZHANG <beyond1...@gmail.com> >>>>> wrote: >>>>>>> >>>>>>>> Hi Dawid, Joe & Xintong, >>>>>>>> >>>>>>>> Thanks for starting the discussion. >>>>>>>> >>>>>>>> I would like to polish Window TVFs[1][2] which is a popular >>> feature >>>>> in >>>>>>> SQL >>>>>>>> introduced in 1.13. >>>>>>>> >>>>>>>> The detailed items are as follows. >>>>>>>> 1. Add more computations based on Window TVF >>>>>>>> * Window Join (which is already merged in master branch) >>>>>>>> * Window Table Function >>>>>>>> * Window Deduplicate >>>>>>>> 2. Finish related JIRA to improve user experience >>>>>>>> * Add offset support for TUMBLE, HOP, session window >>>>>>>> 3. Complement the missing functions compared to the group >> window, >>>>> which >>>>>>> is >>>>>>>> a precondition of deprecating the legacy Grouped Window >> Function >>> in >>>>> the >>>>>>>> later versions. >>>>>>>> * Support Session windows >>>>>>>> * Support allow-lateness >>>>>>>> * Support retract input stream >>>>>>>> * Support window TVF in batch mode >>>>>>>> >>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-19604 >>>>>>>> [2] >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function#FLIP145:SupportSQLwindowingtablevaluedfunction-CumulatingWindows >>>>>>>> >>>>>>>> Best regards, >>>>>>>> JING ZHANG >>>>>>>> >>>>>>>> Xintong Song <xts...@apache.org> 于2021年6月2日周三 下午6:45写道: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> As 1.13 has been released for a while, I think it is a good >>> time >>>> to >>>>>>> start >>>>>>>>> planning for the 1.14 release cycle. >>>>>>>>> >>>>>>>>> - Release managers: This time we'd like to have a team of 3 >>>> release >>>>>>>>> managers. Dawid, Joe and I would like to volunteer for it. >> What >>>> do >>>>>> you >>>>>>>>> think about it? >>>>>>>>> >>>>>>>>> - Timeline: According to our approximate 4 months release >>> period, >>>>> we >>>>>>>>> propose to aim for a feature freeze roughly in early August >>>> (which >>>>>>> could >>>>>>>>> mean something like early September for the 1.14. release). >>> Does >>>> it >>>>>>> work >>>>>>>>> for everyone? >>>>>>>>> >>>>>>>>> - Collecting features: It would be helpful to have a rough >>>> overview >>>>>> of >>>>>>>> the >>>>>>>>> new features that will likely be included in this release. We >>>> have >>>>>>>> created >>>>>>>>> a wiki page [1] for collecting such information. We'd like to >>>>> kindly >>>>>>> ask >>>>>>>>> all committers to fill in the page with features that they >>> intend >>>>> to >>>>>>> work >>>>>>>>> on. >>>>>>>>> >>>>>>>>> We would also like to emphasize some aspects of the >> engineering >>>>>>> process: >>>>>>>>> >>>>>>>>> - Stability of master: This has been an issue during the 1.13 >>>>> feature >>>>>>>>> freeze phase and it is still going on. We encourage every >>>> committer >>>>>> to >>>>>>>> not >>>>>>>>> merge PRs through the Github button, but do this manually, >> with >>>>>> caution >>>>>>>> for >>>>>>>>> the commits merged after the CI being triggered. It would be >>>>>>> appreciated >>>>>>>> to >>>>>>>>> always build the project before merging to master. >>>>>>>>> >>>>>>>>> - Documentation: Please try to see documentation as an >>> integrated >>>>>> part >>>>>>> of >>>>>>>>> the engineering process and don't push it to the feature >> freeze >>>>> phase >>>>>>> or >>>>>>>>> even after. You might even think about going documentation >>> first. >>>>> We, >>>>>>> as >>>>>>>>> the Flink community, are adding great stuff, that is pushing >>> the >>>>>> limits >>>>>>>> of >>>>>>>>> streaming data processors, with every release. We should also >>>> make >>>>>> this >>>>>>>>> stuff usable for our users by documenting it well. >>>>>>>>> >>>>>>>>> - Promotion of 1.14: What applies to documentation also >> applies >>>> to >>>>>> all >>>>>>>> the >>>>>>>>> activity around the release. We encourage every contributor >> to >>>> also >>>>>>> think >>>>>>>>> about, plan and prepare activities like blog posts and talk, >>> that >>>>>> will >>>>>>>>> promote and spread the release once it is done. >>>>>>>>> >>>>>>>>> Please let us know what you think. >>>>>>>>> >>>>>>>>> Thank you~ >>>>>>>>> Dawid, Joe & Xintong >>>>>>>>> >>>>>>>>> [1] >>>> https://cwiki.apache.org/confluence/display/FLINK/1.14+Release >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > > -- > > Best, > Benchao Li