[
https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936639#comment-16936639
]
Till Rohrmann commented on FLINK-14164:
---------------------------------------
I think you are right that with the new fine grained recovery strategy the
{{fullRestart}} metric does not make much sense anymore. One question which
comes to my mind is whether we want to distinguish between partial and full
recoveries when using fine grained recovery. Looking at the restart strategy
alone, it feels a bit weird because a restart is a restart. However, in order
to maintain backwards compatibility, I think we cannot simply remove
{{fullRestart}}. So maybe we could say that {{fullRestart}} is actually
{{numberOfRestarts}} and let the fine grained failover strategy increment
{{fullRestart}} in case of a restart. Moreover, we could deprecate this metric
and introduce {{numberOfRestarts}}. That way we could remove {{fullRestart}}
with Flink {{1.11}}.
A meter view would have a slightly larger resource foot print. However, I could
see the benefits outweighing the costs.
> Add a metric to show failover count regarding fine grained recovery
> -------------------------------------------------------------------
>
> Key: FLINK-14164
> URL: https://issues.apache.org/jira/browse/FLINK-14164
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination, Runtime / Metrics
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Zhu Zhu
> Priority: Major
> Fix For: 1.10.0
>
>
> Previously Flink uses restart all strategy to recover jobs from failures. And
> the metric "fullRestart" is used to show the count of failovers.
> However, with fine grained recovery introduced in 1.9.0, the "fullRestart"
> metric only reveals how many times the entire graph has been restarted, not
> including the number of fine grained failure recoveries.
> As many users want to build their job alerting based on failovers, I'd
> propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}}
> which also respects fine grained recoveries.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)