Thanks Xintong!

Just so I understand correctly, do you suggest adding a metric for
delta(Time) / delta(Count) since the last reporting ?
<Collector>.TimePerGc or <Collector>.AverageTime would make sense.
AverageTime may be a bit nicer :)

My only concern is how useful this will be in reality. If there are only
(or several) long pauses then the msPerSec metrics will show it already,
and if there is a single long pause that may not be shown at all if there
are several shorter pauses as well with this metric.

Gyula

On Wed, Sep 6, 2023 at 8:46 AM Xintong Song <tonysong...@gmail.com> wrote:

> Thanks for bringing this up, Gyula.
>
> The proposed changes make sense to me. +1 for them.
>
> In addition to the proposed changes, I wonder if we should also add
> something like timePerGc? This would help understand whether there are long
> pauses, due to GC STW, that may lead to rpc unresponsiveness and heartbeat
> timeouts. Ideally, we'd like to understand the max pause time per STW in a
> recent time window. However, I don't see an easy way to separate the pause
> time of each STW. Deriving the overall time per GC from the existing
> metrics (time-increment / count-increment) seems to be a good alternative.
> WDYT?
>
> Best,
>
> Xintong
>
>
>
> On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Thanks for the clarification!
> >
> > By default the meterview measures for 1 minute sounds good to me!
> >
> > +1 for this proposal.
> >
> > Best,
> > Rui
> >
> > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
> >
> > > Thanks for the feedback Rui,
> > >
> > > The rates would be computed using the MeterView class (like for any
> other
> > > rate metric), just because we report the value per second it doesn't
> mean
> > > that we measure in a second granularity.
> > > By default the meterview measures for 1 minute and then we calculate
> the
> > > per second rates, but we can increase the timespan if necessary.
> > >
> > > So I don't think we run into this problem in practice and we can keep
> the
> > > metric aligned with other time rate metrics like busyTimeMsPerSec etc.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com> wrote:
> > >
> > > > Hi Gyula,
> > > >
> > > > +1 for this proposal. The current GC metric is really unfriendly.
> > > >
> > > > I have a concern with your proposed rate metric: the rate is
> perSecond
> > > > instead of per minute. I'm unsure whether it's suitable for GC
> metric.
> > > >
> > > > There are two reasons why I suspect perSecond may not be well
> > > > compatible with GC metric:
> > > >
> > > > 1. GCs are usually infrequent and may only occur for a small number
> > > > of time periods within a minute.
> > > >
> > > > Metrics are collected periodically, for example, reported every
> minute.
> > > > If the result reported by the GC metric is 1s/perSecond, it does not
> > > > mean that the GC of the TM is serious, because there may be no GC
> > > > in the remaining 59s.
> > > >
> > > > On the contrary, the GC metric reports 0s/perSecond, which does not
> > > > mean that the GC of the TM is not serious, and the GC may be very
> > > > serious in the remaining 59s.
> > > >
> > > > 2. Stop-the-world may cause the metric to fail(delay) to report
> > > >
> > > > The TM will stop the world during GC, especially full GC. It means
> > > > the metric cannot be collected or reported during full GC.
> > > >
> > > > So the collected GC metric may never be 1s/perSecond. This metric
> > > > may always be good because the metric will only be reported when
> > > > the GC is not severe.
> > > >
> > > >
> > > > If these concerns make sense, how about updating the GC rate
> > > > at minute level?
> > > >
> > > > We can define the type to Gauge for TimeMsPerMiunte, and updating
> > > > this Gauge every second, it is:
> > > > GC Total.Time of current time - GC total time of one miunte ago.
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels <m...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Gyula,
> > > > >
> > > > > +1 The proposed changes make sense and are in line with what is
> > > > > available for other metrics, e.g. number of records processed.
> > > > >
> > > > > -Max
> > > > >
> > > > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra <gyula.f...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Hi Devs,
> > > > > >
> > > > > > I would like to start a discussion on FLIP-361: Improve GC
> Metrics
> > > [1].
> > > > > >
> > > > > > The current Flink GC metrics [2] are not very useful for
> monitoring
> > > > > > purposes as they require post processing logic that is also
> > dependent
> > > > on
> > > > > > the current runtime environment.
> > > > > >
> > > > > > Problems:
> > > > > >  - Total time is not very relevant for long running applications,
> > > only
> > > > > the
> > > > > > rate of change (msPerSec)
> > > > > >  - In most cases it's best to simply aggregate the time/count
> > across
> > > > the
> > > > > > different GabrageCollectors, however the specific collectors are
> > > > > dependent
> > > > > > on the current Java runtime
> > > > > >
> > > > > > We propose to improve the current situation by:
> > > > > >  - Exposing rate metrics per GarbageCollector
> > > > > >  - Exposing aggregated Total time/count/rate metrics
> > > > > >
> > > > > > These new metrics are all derived from the existing ones with
> > minimal
> > > > > > overhead.
> > > > > >
> > > > > > Looking forward to your feedback.
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > > > [1]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-361%3A+Improve+GC+Metrics
> > > > > > [2]
> > > > > >
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/#garbagecollection
> > > > >
> > > >
> > >
> >
>

Reply via email to