Re: [ANNOUNCE] Stability/Scalability effort and Pull Requests

Paris Carbone Fri, 02 Dec 2016 07:15:39 -0800

This is really huge. Highly appreciated Stephan and the rest :)

The unfair input handling you mentioned was maybe responsible of the incredibly 
slow processing of records through not-so-active feedback edges (e.g. in ML 
projects).


cheers
Paris 

> On 2 Dec 2016, at 14:48, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi fellow Flink enthusiasts!
> 
> Over the past weeks, several community members have been working hard on a
> thread of issues to make Flink more stable and scalable for some demanding
> large scale deployments.
> 
> It was quite a lot of work digging through the masses of logs, setting up
> clusters to reproduce issues, debugging, and so on. Because of that
> intensive time involvement, some of us spent considerably less time on Pull
> Requests - as a result we have quite a backlog of pull requests.
> 
> We are nearing the end of this effort, and I believe the pull requests can
> expect to get more attention again in the near future.
> 
> 
> The *good news* is that this effort really pushed Flink a lot further. We
> addressed a lot of issues that now make Flink run a lot smoother in various
> setups.
> 
> Here is a sample of issues we addressed as part of that work:
> 
>  - The Network Stack had a quite serious issue with (un)fairness of stream
> transport under backpressure.
> 
>  - Checkpoints are more robust now - they decline better when they cannot
> complete and can have a limit for how much data may be buffered/spilled
> during alignment
> 
>  - Cleanup of old checkpoint state happens more reliably and scalable
> 
>  - Robustness of HA recovery is increased, in the presence of ZooKeeper
> problems and inconsistent state in ZooKeeper
> 
>  - Improving the cancellation behavior of checkpoints and connectors.
> Adding a safeguard against tasks and libraries that block the TaskManagers
> during cancellation.
> 
>  - We were diagnosing a RocksDB bug and updating to fixed version
> 
>  - Deployment scalability: Getting rid of some RPC messages, handling a
> large RPC volume on deployment better
> 
>  - Memory leaks in the JobManager in the presence of many restarts
> 
>  - Some bugs in the life cycle of tasks and the execution graph to prevent
> issues with reliable fault tolerance.
> 
> 
> I think we'll all enjoy the results of this effort, especially those that
> have to deploy and maintain large setups.
> 
> 
> Greetings,
> Stephan

Re: [ANNOUNCE] Stability/Scalability effort and Pull Requests

Reply via email to