The issue FLINK-4904 (Add a limit for how much data may be spilled in checkpoint alignments) is doen for master and I am currently backporting it. Hope to finish that this week...
Stephan On Wed, Nov 2, 2016 at 5:03 PM, Till Rohrmann <till.rohrm...@gmail.com> wrote: > It might make sense to backport > > - [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM > side: https://github.com/apache/flink/pull/2742 > > as well. This will allow us to activate the quarantine monitoring per > default in 1.1.4 without risking to kill all TMs in case of a JM failure. > > Cheers, > Till > > On Wed, Nov 2, 2016 at 11:43 AM, Ufuk Celebi <u...@apache.org> wrote: > > > As a quick update: the "pending review" issues have all been resolved. > > > > The open issues are still open: > > > > - FLINK-4904: Add a limit for how much data may be spilled in > > checkpoint alignments => fix pending > > - FLINK-4910: Introduce safety net for closing file system streams > > > > Any updates here? > > > > – Ufuk > > > > > > On Fri, Oct 28, 2016 at 5:45 PM, Stefan Richter > > <s.rich...@data-artisans.com> wrote: > > > Benefit of a backport, as I see it, is increased stability. The danger > > is potentially breaking some code that was casting FileSystems to > subtypes > > like LocalFileSytem. I don’t know how common that would be in user code. > > > > > >> Am 28.10.2016 um 14:27 schrieb Ufuk Celebi <u...@apache.org>: > > >> > > >> Thanks for all your feedback. > > >> > > >> If there are no objections, I would like to stick to the mentioned > > >> issues in this thread and create RC1 as soon as they are all > > >> addressed. This will probably not be this week though, but it looks > > >> good for next week. > > >> > > >> DONE > > >> ===== > > >> - FLINK-4619: Answer client if savepoint restore fails > > >> - FLINK-4715: Safety net for stuck task cancellation > > >> - FLINK-4510: Always create CheckpointCoordinator > > >> - FLINK-4894: Don't block on buffer request after broadcast event > > >> - FLINK-4298: Add proper repository for Closure dependencies > > >> - FLINK-4218: Do not fail checkpoints when state size cannot be > > determined > > >> - FLINK-3347: TaskManager (or its ActorSystem) need to restart in case > > >> they notice quarantine > > >> - FLINK-4875: Use correct operator name > > >> - FLINK-4913: Include user jars in system class loader > > >> > > >> PENDING REVIEW > > >> =============== > > >> - FLINK-4445: Add option to ignore unmatched state when restoring from > > >> savepoint => https://github.com/apache/flink/pull/2713 > > >> - FLINK-4932: Don't let ExecutionGraph fail when in state Restarting > > >> => https://github.com/apache/flink/pull/2711 > > >> - FLINK-4933: ExecutionGraph.scheduleOrUpdateConsumers can fail the > > >> ExecutionGraph => https://github.com/apache/flink/pull/2701 > > >> > > >> OPEN > > >> ===== > > >> - FLINK-4904: Add a limit for how much data may be spilled in > > >> checkpoint alignments => fix pending > > >> - FLINK-4910: Introduce safety net for closing file system streams => > > >> @Stephan, Stefan: What's the conclusion of your discussion whether to > > >> backport this or not? > > >> > > >> > > >> On Wed, Oct 26, 2016 at 9:57 PM, dan bress <danbr...@gmail.com> > wrote: > > >>> +1 for this release, > > >>> also +1 to Chesnay's suggesting for including this: [FLINK-4875] > > [metrics] > > >>> Use correct operator name > > >>> > > >>> Dan > > >>> > > >>> On Wed, Oct 26, 2016 at 5:06 AM Till Rohrmann <trohrm...@apache.org> > > wrote: > > >>> > > >>>> I'll work on FLINK-3347. Additionally I would like to get in > > >>>> > > >>>> - https://issues.apache.org/jira/browse/FLINK-4932: Don't let > > >>>> ExecutionGraph fail when in state Restarting > > >>>> - https://issues.apache.org/jira/browse/FLINK-4933: > > >>>> ExecutionGraph.scheduleOrUpdateConsumers > > >>>> can fail the ExecutionGraph > > >>>> > > >>>> Cheers, > > >>>> Till > > >>>> > > >>>> On Wed, Oct 26, 2016 at 1:02 PM, Stephan Ewen <se...@apache.org> > > wrote: > > >>>> > > >>>>> Concerning backporting the "I/O streams safety net" - we need to > make > > >>>> sure > > >>>>> that this does not change any behavior that users may implicitly > > expect. > > >>>>> > > >>>>> > > >>>>> On Wed, Oct 26, 2016 at 11:21 AM, Maximilian Michels < > m...@apache.org > > > > > >>>>> wrote: > > >>>>> > > >>>>>> +1 for a 1.1.4 release > > >>>>>> > > >>>>>> We could backport putting user jars into the system class loader > for > > >>>>>> per-job Yarn clusters: https://github.com/apache/flink/pull/2692 > > >>>>>> Arguably, this is somewhat a new feature but it gets rid of > > duplicate > > >>>>>> class loading issues users experienced in practice. > > >>>>>> > > >>>>>> We already have the following commits on the release-1.1 branch: > > >>>>>> > > >>>>>> 05a5f46 [FLINK-4862] fix Timer register in > > ContinuousEventTimeTrigger > > >>>>>> 5731672 [FLINK-4581] [table] Fix Table API throwing "No suitable > > driver > > >>>>>> found for jdbc:calcite" > > >>>>>> 9c87f92 [FLINK-4586] [core] Broken AverageAccumulator > > >>>>>> 210230c [FLINK-4829] snapshot accumulators on a best-effort basis > > >>>>>> c1d6b24 [FLINK-4829] protect user accumulators against concurrent > > >>>> updates > > >>>>>> fe464b4 [FLINK-4709] [core] Fix resource leak in > > >>>>> InputStreamFSInputWrapper > > >>>>>> 9f72698 [FLINK-4108] [scala] Respect ResultTypeQueryable for > > >>>>> InputFormats. > > >>>>>> 9591d50 [FLINK-4506] [DataSet] Fix documentation of > CsvOutputFormat > > >>>> about > > >>>>>> incorrect default of allowNullValues > > >>>>>> c9433bf [FLINK-3706] Fix YARN test instability > > >>>>>> 2203f74 [FLINK-4778] [docs] Fix WordCount parameters in CLI > > examples. > > >>>>>> > > >>>>>> -Max > > >>>>>> > > >>>>>> > > >>>>>> On Wed, Oct 26, 2016 at 7:05 AM, Jean-Baptiste Onofré < > > j...@nanthrax.net > > >>>>> > > >>>>>> wrote: > > >>>>>>> +1 > > >>>>>>> > > >>>>>>> Looking forward this release ! > > >>>>>>> > > >>>>>>> Regards > > >>>>>>> JB > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Oct 25, 2016, 14:43, at 14:43, Robert Metzger < > > >>>> rmetz...@apache.org> > > >>>>>> wrote: > > >>>>>>>> +1 for a bugfix release soon. > > >>>>>>>> > > >>>>>>>> On Tue, Oct 25, 2016 at 10:53 AM, Stephan Ewen < > se...@apache.org> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Thanks fort starting this Ufuk. > > >>>>>>>>> > > >>>>>>>>> I would like to add the following issues to 1.1.4: > > >>>>>>>>> > > >>>>>>>>> Build errors due to Storm dependencies *(fix pending)* > > >>>>>>>>> - [FLINK-4298] [storm compatibility] Add proper repository > for > > >>>>>>>> Closure > > >>>>>>>>> dependencies. > > >>>>>>>>> > > >>>>>>>>> Stability on S3 considering eventual consistency *(fix > pending)* > > >>>>>>>>> - [FLINK-4218] [checkpoints] Do not fail checkpoints when > > state > > >>>>>>>> size > > >>>>>>>>> cannot be determined > > >>>>>>>>> > > >>>>>>>>> Avoiding Zombie TaskManagers *(still needs to be done)* > > >>>>>>>>> - [FLINK-3347] [akka] TaskManager (or its ActorSystem) need > to > > >>>>>>>> restart > > >>>>>>>>> in case they notice quarantine > > >>>>>>>>> > > >>>>>>>>> Adding a limit to the amount of data spilled during checkpoint > > >>>>>>>> alignments > > >>>>>>>>> *(fix > > >>>>>>>>> is work in progress)* > > >>>>>>>>> - [FLINK-4904] [checkpoints] Add a limit for how much data > may > > >>>> be > > >>>>>>>>> spilled in checkpoint alignments > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> I can push the first two fixes to the 1.1.4 branch in a bit, > the > > >>>>>>>> fourth one > > >>>>>>>>> later today. > > >>>>>>>>> The third one (akka) is still pending. > > >>>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> Stephan > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Mon, Oct 24, 2016 at 3:32 PM, Ufuk Celebi <u...@apache.org> > > >>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Hey all, > > >>>>>>>>>> > > >>>>>>>>>> I would like to start the discussion for kicking off the next > > bug > > >>>>>>>> fix > > >>>>>>>>>> release, Flink 1.1.4. What do you think about aiming for a RC > by > > >>>>>>>> end > > >>>>>>>>>> of this week? > > >>>>>>>>>> > > >>>>>>>>>> Users reported some instabilities/inconveniences that would be > > >>>> good > > >>>>>>>> to > > >>>>>>>>> fix. > > >>>>>>>>>> > > >>>>>>>>>> Personally, I would like to backport the following fixes: > > >>>>>>>>>> > > >>>>>>>>>> (1) https://issues.apache.org/jira/browse/FLINK-4619: Answer > > >>>>> client > > >>>>>>>> if > > >>>>>>>>>> savepoint restore fails (Already merged for master, needs > > minimal > > >>>>>>>>>> adjustment for 1.1) > > >>>>>>>>>> (2) https://issues.apache.org/jira/browse/FLINK-4715: Safety > > net > > >>>>>>>> for > > >>>>>>>>>> stuck task cancellation (Already reviewed for master, waiting > > for > > >>>>>>>>>> tests to finish of backport) > > >>>>>>>>>> (3) https://issues.apache.org/jira/browse/FLINK-4510: Always > > >>>>> create > > >>>>>>>>>> CheckpointCoordinator (Already merged for master, needs > minimal > > >>>>>>>>>> adjustments for 1.1) > > >>>>>>>>>> > > >>>>>>>>>> Furthermore, I would like to address the following: > > >>>>>>>>>> > > >>>>>>>>>> (4) https://issues.apache.org/jira/browse/FLINK-4445: Add > > option > > >>>>> to > > >>>>>>>>>> ignore unmatched state when restoring from savepoint > > >>>>>>>>>> (5) https://issues.apache.org/jira/browse/FLINK-4894: Don't > > >>>> block > > >>>>>>>> on > > >>>>>>>>>> buffer request after broadcast event > > >>>>>>>>>> > > >>>>>>>>>> Strictly speaking, the (4) is not a bug fix. But given that it > > >>>>>>>> would > > >>>>>>>>>> only add an optional flag to savepoint restoring and should > have > > >>>>>>>> been > > >>>>>>>>>> addressed for 1.1.0 already, I would like to get it in. > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>> > > >>>>> > > >>>> > > > > > >