[
https://issues.apache.org/jira/browse/FLINK-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Zhen Wu updated FLINK-8043:
----------------------------------
Description:
Fine grained recovery publish fullRestarts as guage, which is not suitable for
threshold based alerting. Usually we would alert like "fullRestarts > 0 happens
10 times in last 15 minutes".
In comparison, "task_failures" is published as counter.
was:When fine grained recovery failed (e.g. due to not enough taskmager slots
when replacement taskmanager node didn't come back in time), Flink will revert
to full job restart. In this case, it should also increment "job restart" metric
Summary: change fullRestarts (for fine grained recovery) from guage to
counter (was: increment job restart metric when fine grained recovery reverted
to full job restart)
> change fullRestarts (for fine grained recovery) from guage to counter
> ---------------------------------------------------------------------
>
> Key: FLINK-8043
> URL: https://issues.apache.org/jira/browse/FLINK-8043
> Project: Flink
> Issue Type: Bug
> Components: ResourceManager
> Affects Versions: 1.3.2
> Reporter: Steven Zhen Wu
>
> Fine grained recovery publish fullRestarts as guage, which is not suitable
> for threshold based alerting. Usually we would alert like "fullRestarts > 0
> happens 10 times in last 15 minutes".
> In comparison, "task_failures" is published as counter.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)