tillrohrmann edited a comment on issue #7356: [FLINK-10868][flink-yarn] Enforce 
maximum failed TMs in YarnResourceManager
URL: https://github.com/apache/flink/pull/7356#issuecomment-457619260
 
 
   2. What about introducing a failure rate instead of a total number of 
failures? We could say we allow so and so many container failures per hour. If 
this number is exceeded, then fail all pending slot requests with a container 
failure rate exception.
   3. By having a failure rate it might be ok to not account the failure for 
each job individually because the cluster can still recover after some time and 
won't be useless for future job submissions. Atm there is no way to find out 
the mapping between containers and jobs. The idea is that free TaskManager from 
another job can be used for serving slot requests of another job. It would 
actually be good if only unexpected container completions (e.g. if a container 
dies) would be counted for the failure rate. If a TM disconnects, then we 
should not increase the failure counter because it might still be there.
   4. Yes temporarily we should let 
`MaximumFailedTaskManagerExceedingException` extend from 
`SuppressRestartException`. I think what we should do as a follow up is to 
extend the `RestartStrategy` such that it knows about the failure reason. That 
way, it would be possible to implement a custom `RestartStrategy` which can 
treat failures differently (e.g. `MaximumFailedTaskManagerExceedingException`).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to