[jira] [Comment Edited] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery

Zhu Zhu (Jira) Wed, 25 Sep 2019 05:05:14 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16937655#comment-16937655
 ]


Zhu Zhu edited comment on FLINK-14164 at 9/25/19 12:04 PM:
-----------------------------------------------------------

Hi [~wind_ljy], 

The partial restarts actually refer to the restarts conducted by fine grained 
recovery strategy. If a user is not using "full" failover strategy, there 
should be few full restarts, since task failures will be recovered via fine 
grained recoveries. 
And for many streaming jobs with all-to-all edges, the fine grained recovery 
would actually restarts all the vertices. But this is not a full restart.

In my mind a metric including all restarts(full and partial) should help in 
most cases.

Could you share some cases that you need to distinguish full restarts and 
partial restarts? That can be helpful.


was (Author: zhuzh):
Hi [~wind_ljy], 

The partial restarts actually refer to the restarts conducted by fine grained 
recovery strategy. If a user is not using "full" failover strategy, there 
should be few full restarts, since task failures will be recovered via fine 
grained recoveries. 
And for many streaming jobs with all-to-all edges, the fine grained recovery 
would actually restarts all the vertices.

In my mind a metric including all restarts(full and partial) should help in 
most cases.

Could you share some cases that you need to distinguish full restarts and 
partial restarts? That can be helpful.

> Add a metric to show failover count regarding fine grained recovery
> -------------------------------------------------------------------
>
>                 Key: FLINK-14164
>                 URL: https://issues.apache.org/jira/browse/FLINK-14164
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination, Runtime / Metrics
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Previously Flink uses restart all strategy to recover jobs from failures. And 
> the metric "fullRestart" is used to show the count of failovers.
> However, with fine grained recovery introduced in 1.9.0, the "fullRestart" 
> metric only reveals how many times the entire graph has been restarted, not 
> including the number of fine grained failure recoveries.
> As many users want to build their job alerting based on failovers, I'd 
> propose to add such a new metric {{numberOfFailures}}/{{numberOfRestarts}} 
> which also respects fine grained recoveries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-14164) Add a metric to show failover count regarding fine grained recovery

Reply via email to