tillrohrmann edited a comment on issue #7356: [FLINK-10868][flink-yarn] Enforce maximum failed TMs in YarnResourceManager URL: https://github.com/apache/flink/pull/7356#issuecomment-457619260 2. What about introducing a failure rate instead of a total number of failures? We could say we allow so and so many container failures per hour. If this number is exceeded, then fail all pending slot requests with a container failure rate exception. 3. By having a failure rate it might be ok to not account the failure for each job individually because the cluster can still recover after some time and won't be useless for future job submissions. Atm there is no way to find out the mapping between containers and jobs. The idea is that free TaskManager from another job can be used for serving slot requests of another job. It would actually be good if only unexpected container completions (e.g. if a container dies) would be counted for the failure rate. If a TM disconnects, then we should not increase the failure counter because it might still be there. 4. Yes temporarily we should let `MaximumFailedTaskManagerExceedingException` extend from `SuppressRestartException`. I think what we should do as a follow up is to extend the `RestartStrategy` such that it knows about the failure reason. That way, it would be possible to implement a custom `RestartStrategy` which can treat failures differently (e.g. `MaximumFailedTaskManagerExceedingException`).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
