Github user StephanEwen commented on the issue:
https://github.com/apache/flink/pull/6103
That all depends why the failure happens in the first place. It seems to
happen if the receiver of a channel starts much faster than the sender. The
longest part of the deployment is library distribution, which happens only
once. After one failure / recovery, the library should be cached and the next
attempt to start the task should be very fast.---
