This is really huge. Highly appreciated Stephan and the rest :) The unfair input handling you mentioned was maybe responsible of the incredibly slow processing of records through not-so-active feedback edges (e.g. in ML projects).
cheers Paris > On 2 Dec 2016, at 14:48, Stephan Ewen <se...@apache.org> wrote: > > Hi fellow Flink enthusiasts! > > Over the past weeks, several community members have been working hard on a > thread of issues to make Flink more stable and scalable for some demanding > large scale deployments. > > It was quite a lot of work digging through the masses of logs, setting up > clusters to reproduce issues, debugging, and so on. Because of that > intensive time involvement, some of us spent considerably less time on Pull > Requests - as a result we have quite a backlog of pull requests. > > We are nearing the end of this effort, and I believe the pull requests can > expect to get more attention again in the near future. > > > The *good news* is that this effort really pushed Flink a lot further. We > addressed a lot of issues that now make Flink run a lot smoother in various > setups. > > Here is a sample of issues we addressed as part of that work: > > - The Network Stack had a quite serious issue with (un)fairness of stream > transport under backpressure. > > - Checkpoints are more robust now - they decline better when they cannot > complete and can have a limit for how much data may be buffered/spilled > during alignment > > - Cleanup of old checkpoint state happens more reliably and scalable > > - Robustness of HA recovery is increased, in the presence of ZooKeeper > problems and inconsistent state in ZooKeeper > > - Improving the cancellation behavior of checkpoints and connectors. > Adding a safeguard against tasks and libraries that block the TaskManagers > during cancellation. > > - We were diagnosing a RocksDB bug and updating to fixed version > > - Deployment scalability: Getting rid of some RPC messages, handling a > large RPC volume on deployment better > > - Memory leaks in the JobManager in the presence of many restarts > > - Some bugs in the life cycle of tasks and the execution graph to prevent > issues with reliable fault tolerance. > > > I think we'll all enjoy the results of this effort, especially those that > have to deploy and maintain large setups. > > > Greetings, > Stephan