GJL commented on a change in pull request #10082: [FLINK-14164][runtime] Add a
meter ‘numberOfRestarts’ to show number of restarts as well as its rate
URL: https://github.com/apache/flink/pull/10082#discussion_r342643086
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java
##########
@@ -193,6 +197,11 @@ public SchedulerBase(
this.failoverTopology = executionGraph.getFailoverTopology();
this.inputsLocationsRetriever = new
ExecutionGraphToInputsLocationsRetrieverAdapter(executionGraph);
+
+ // Use the counter from execution graph to avoid modifying
execution graph interfaces
+ // Can be a new SimpleCounter created here after the legacy
scheduler is removed.
+ this.numberOfRestartsCounter =
executionGraph.getNumberOfRestartsCounter();
+ jobManagerJobMetricGroup.meter(NUMBER_OF_RESTARTS, new
MeterView(numberOfRestartsCounter));
Review comment:
Now that I think about it, I find _restarts per second_ to be an awkward
unit because:
1. It will be normally very small (by default < 1/60)
1. It is hard to come up with reasonable alerting thresholds other than _">
0"_. For example, alerting on _number of restarts > 10 in the past hour_ is
impossible.
If a user had a time series database such as InfluxDB in place, the total
number of restarts would suffice because the database can calculate the
difference. I know that the requirement to introduce a meter [comes from the
user mailing
list](http://mail-archives.apache.org/mod_mbox/flink-dev/201909.mbox/%3cCAOmjRb2ti9MXOD2jFy0XzWViwoNM6tvU4DB5hSnG_=zbvec...@mail.gmail.com%3e).
I don't see a good solution at the moment.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services