@yanghua @tillrohrmann I think a user would normally expect this count to apply globally, but please also consider the case of an intermittent failure (like S3 rate limit or storage backend unavailable for other reason). In a large job that would cause potentially many subtasks to fail in parallel. While this could be addressed by setting a corresponding very high threshold, it would in turn mean a problem that is isolated to a single task would not hit the threshold until much much later, leaving the job in flipflop status instead of failing.
[ Full content available at: https://github.com/apache/flink/pull/6567 ] This message was relayed via gitbox.apache.org for [email protected]
