Thanks everyone for the feedback. I've just updated the status of Flink 1.11.3 earlier, in its corresponding discussion thread [1].
>From the looks of it, it seems like it makes sense to proceed with StateFun 2.2.1 without waiting for Flink 1.11.3. Since this is also the consensus we've reached here, I have proceeded to create RC1 for StateFun 2.2.1 [2]. [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-StateFun-hotfix-version-2-2-1-td46239.html On Tue, Nov 3, 2020 at 10:42 PM Robert Metzger <rmetz...@apache.org> wrote: > Hi Gordon, > thanks a lot for this clarification. > > In this case I would vote for releasing StateFun 2.2.1 asap and not wait > for 1.11.3. > > Thanks a lot for your efforts! > > > On Tue, Nov 3, 2020 at 3:38 PM Tzu-Li (Gordon) Tai <tzuli...@apache.org> > wrote: > >> Hi Robert, >> >> So far we've only seen a single user report the issue, but the severity >> of FLINK-19692 is actually pretty huge. >> TL;DR: If a checkpoint / savepoint that contains feedback events (which >> is considered normal under typical StateFun operations) is attempted to be >> restored from, the restore would always fail. >> >> That's why we came up with the discussion to potentially release a >> "partial" solution with StateFun 2.2.1 already so that at least there is a >> StateFun release available that works properly with failure recoveries, >> and then after that release another follow-up StateFun hotfix release >> 2.2.2, which would include Flink 1.11.3, to address the remaining part of >> the problem. >> >> BR, >> Gordon >> >> On Tue, Nov 3, 2020 at 9:33 PM Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> Thanks a lot for starting this thread. >>> How many users are affected by the problem? Is it somebody else besides >>> the initial issue reporter? >>> If it is just one person, I would suggest to rather help pushing the >>> 1.11.3 release over the line or work on more StateFun features ;) >>> >>> On Tue, Nov 3, 2020 at 11:58 AM Igal Shilman <i...@ververica.com> wrote: >>> >>>> Hi Gordon, >>>> Thanks for driving this discussion! >>>> >>>> I would go with the second suggestion - having two consecutive StateFun >>>> releases 2.2.1 and 2.2.2, since the Flink-1.11.3 release >>>> might take a while, and this hot-fix release is important enough to get >>>> out >>>> as early as possible. >>>> >>>> Cheers, >>>> Igal. >>>> >>>> >>>> >>>> >>>> On Mon, Nov 2, 2020 at 11:43 AM Tzu-Li (Gordon) Tai < >>>> tzuli...@apache.org> >>>> wrote: >>>> >>>> > Hi, >>>> > >>>> > We’re currently thinking about releasing StateFun 2.2.1, to address a >>>> > critical bug that causes restores from checkpoints / savepoints to >>>> fail >>>> > under certain circumstances [1]. >>>> > >>>> > To provide a bit more context, the full fix for this issue is >>>> two-fold: >>>> > >>>> > 1. *Fix restoring from checkpoints / savepoints taken with the same >>>> > StateFun version:* this has already been fixed in StateFun, with >>>> > changes backported to `flink-statefun/release-2.2`. >>>> > 2. *Allow restoring from older savepoints taken with StateFun <= >>>> > 2.2.0:* this requires a few fixes to Flink around restoring >>>> heap-based >>>> > timers [2] and iterating through key groups in restored raw keyed >>>> state >>>> > streams [3]. These fixes will be included in Flink 1.11.3 [4], >>>> meaning that >>>> > to fix this, StateFun will need to wait until Flink 1.11.3 is out >>>> and >>>> > upgrade its Flink dependency. >>>> > >>>> > The main discussion point here is whether or not it makes sense for >>>> > StateFun 2.2.1 to wait for Flink 1.11.3, so that both parts of the >>>> problems >>>> > 1) and 2) can be solved together in a single hotfix release. >>>> > >>>> > The other option is to release StateFun 2.2.1 already with fixes for >>>> > problem 1) only, and have another follow-up hotfix release 2.2.2 after >>>> > Flink 1.11.3 is available. >>>> > >>>> > I propose to keep a close eye on the progress of Flink 1.11.3 (you can >>>> > track progress on the 1.11.3 discussion thread [4]), and *make a >>>> decision >>>> > here mid-week on Wednesday, Nov. 4th*. >>>> > If by then we decide to not let StateFun 2.2.1 wait for Flink 1.11.3 >>>> > because it could take a while, we can start with a StateFun 2.2.1 RC >>>> right >>>> > away; otherwise, if Flink 1.11.3 seems to be just around the corner, >>>> we can >>>> > wait for a few more days. >>>> > >>>> > What do you think? >>>> > >>>> > Cheers, >>>> > Gordon >>>> > >>>> > [1] https://issues.apache.org/jira/browse/FLINK-19692 >>>> > [2] https://github.com/apache/flink/pull/13761 >>>> > [3] https://github.com/apache/flink/pull/13772 >>>> > [4] >>>> > >>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-Apache-Flink-1-11-3-td45989.html >>>> > >>>> >>>