[GitHub] [flink] tweise commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures

GitHub Mon, 17 Sep 2018 04:11:21 -0700

@yanghua @tillrohrmann I think a user would normally expect this count to apply 
globally, but please also consider the case of an intermittent failure (like S3 
rate limit or storage backend unavailable for other reason). In a large job 
that would cause potentially many subtasks to fail in parallel. While this 
could be addressed by setting a corresponding very high threshold, it would in 
turn mean a problem that is isolated to a single task would not hit the 
threshold until much much later, leaving the job in flipflop status instead of 
failing.


[ Full content available at: https://github.com/apache/flink/pull/6567 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [flink] tweise commented on issue #6567: [FLINK-10074] Allowable number of checkpoint failures

Reply via email to