zhuzhurk commented on a change in pull request #10082: [FLINK-14164][runtime]
Add a meter ‘numberOfRestarts’ to show number of restarts as well as its rate
URL: https://github.com/apache/flink/pull/10082#discussion_r342911786
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java
##########
@@ -193,6 +197,11 @@ public SchedulerBase(
this.failoverTopology = executionGraph.getFailoverTopology();
this.inputsLocationsRetriever = new
ExecutionGraphToInputsLocationsRetrieverAdapter(executionGraph);
+
+ // Use the counter from execution graph to avoid modifying
execution graph interfaces
+ // Can be a new SimpleCounter created here after the legacy
scheduler is removed.
+ this.numberOfRestartsCounter =
executionGraph.getNumberOfRestartsCounter();
+ jobManagerJobMetricGroup.meter(NUMBER_OF_RESTARTS, new
MeterView(numberOfRestartsCounter));
Review comment:
Yes the rate is awkward if the event happens in a very low frequency.
I think a counter `numberOfRestarts` is needed to enable users to build
alerts in a more flexible way.
And the question is: Whether to introduce a meter
`numberOfRestartsPerSecond`?
- Pros: The meter enables users to build alerts for restarts even if their
monitoring system does not support variations of values.
- Cons: The integral of rate value is not accurate so that users cannot use
it to build reliable alerts other than ">0". This is limited by the timespan
used to sample metrics in Flink, as well as in the external metric collecting
system.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services