I see the problem with very large jobs. Maybe we could solve it a bit differently, by deploying tasks in topological order when using the `EAGER` scheduling.
Concerning your answer to my second question: What if the producer partition would get disposed (e.g. due to a failover which does not necessarily restart the downstream operators). At the moment an upstream task failure will always fail the downstream consumers. However, this can change in the future and the more assumptions (e.g. downstream operators will be failed if upstream operators fail) we bake in, the harder it gets to change this behaviour. Moreover, I think it is always a good idea, to make the components as self-contained as possible. This also entails that the failover behaviour should ideally not depend on other things to happen. Therefore, I'm a bit hesitant to change the existing behaviour. [ Full content available at: https://github.com/apache/flink/pull/6680 ] This message was relayed via gitbox.apache.org for devnull@infra.apache.org