Stephan and I came up with the following document about how to handle failures of tasks and how to make sure we properly attribute the failure to the correct root cause and suppress follow-up failures. The document defines the behaviour that should be followed for different kinds of task failures.
https://cwiki.apache.org/confluence/display/FLINK/Task+Failures+and+Error+Handling Feel free to comment. I will open issues for the respective issues if there are no objections. – Ufuk