[jira] [Commented] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery

Till Rohrmann (Jira) Tue, 24 Sep 2019 03:03:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936639#comment-16936639
 ]


Till Rohrmann commented on FLINK-14164:
---------------------------------------

I think you are right that with the new fine grained recovery strategy the 
{{fullRestart}} metric does not make much sense anymore. One question which 
comes to my mind is whether we want to distinguish between partial and full 
recoveries when using fine grained recovery. Looking at the restart strategy 
alone, it feels a bit weird because a restart is a restart. However, in order 
to maintain backwards compatibility, I think we cannot simply remove 
{{fullRestart}}. So maybe we could say that {{fullRestart}} is actually 
{{numberOfRestarts}} and let the fine grained failover strategy increment 
{{fullRestart}} in case of a restart. Moreover, we could deprecate this metric 
and introduce {{numberOfRestarts}}. That way we could remove {{fullRestart}} 
with Flink {{1.11}}.

A meter view would have a slightly larger resource foot print. However, I could 
see the benefits outweighing the costs.

> Add a metric to show failover count regarding fine grained recovery
> -------------------------------------------------------------------
>
>                 Key: FLINK-14164
>                 URL: https://issues.apache.org/jira/browse/FLINK-14164
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Metrics
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Previously Flink uses restart all strategy to recover jobs from failures. And 
> the metric "fullRestart" is used to show the count of failovers.
> However, with fine grained recovery introduced in 1.9.0, the "fullRestart" 
> metric only reveals how many times the entire graph has been restarted, not 
> including the number of fine grained failure recoveries.
> As many users want to build their job alerting based on failovers, I'd 
> propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} 
> which also respects fine grained recoveries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery

Reply via email to