Hi Gyula,

Thanks for driving this FLIP.

The proposal looks good to me. Only one minor suggestion I have is, can we
also include the % GC time spent wrt the overall CPU time especially useful
in the cases of TM which helps in easily identifying issues related to GC.
Thoughts?

Regards
Venkata krishnan


On Wed, Sep 13, 2023 at 6:13 AM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Thanks for all the feedback, I will start the vote on this.
>
> Gyula
>
> On Wed, Sep 6, 2023 at 11:11 AM Xintong Song <tonysong...@gmail.com>
> wrote:
>
> > >
> > > I added the average time metric to the FLIP document. I also included
> it
> > > for the aggregate (total) across all collectors. But maybe it doesn't
> > make
> > > too much sense as collection times usually differ greatly depending on
> > the
> > > collector.
> > >
> >
> > LGTM
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Sep 6, 2023 at 4:31 PM Gyula Fóra <gyula.f...@gmail.com> wrote:
> >
> > > I added the average time metric to the FLIP document. I also included
> it
> > > for the aggregate (total) across all collectors. But maybe it doesn't
> > make
> > > too much sense as collection times usually differ greatly depending on
> > the
> > > collector.
> > >
> > > Gyula
> > >
> > > On Wed, Sep 6, 2023 at 10:21 AM Xintong Song <tonysong...@gmail.com>
> > > wrote:
> > >
> > > > Thank you :)
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Wed, Sep 6, 2023 at 4:17 PM Gyula Fóra <gyula.f...@gmail.com>
> > wrote:
> > > >
> > > > > Makes sense Xintong, I am happy to extend the proposal with the
> > average
> > > > gc
> > > > > time metric +1
> > > > >
> > > > > Gyula
> > > > >
> > > > > On Wed, Sep 6, 2023 at 10:09 AM Xintong Song <
> tonysong...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > >
> > > > > > > Just so I understand correctly, do you suggest adding a metric
> > for
> > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > <Collector>.TimePerGc or <Collector>.AverageTime would make
> > sense.
> > > > > > > AverageTime may be a bit nicer :)
> > > > > > >
> > > > > >
> > > > > > Yes, that's what I mean.
> > > > > >
> > > > > > My only concern is how useful this will be in reality. If there
> are
> > > > only
> > > > > > > (or several) long pauses then the msPerSec metrics will show it
> > > > > already,
> > > > > > > and if there is a single long pause that may not be shown at
> all
> > if
> > > > > there
> > > > > > > are several shorter pauses as well with this metric.
> > > > > >
> > > > > >
> > > > > > Let's say we measure this for every minute and see a 900 msPerSec
> > > > (which
> > > > > > means 54s within the minute are spent on GC). This may come from
> a
> > > > single
> > > > > > GC that lasts for 54s, or 2 GCs each lasting for ~27s, or more
> GCs
> > > with
> > > > > > less time each. As the default heartbeat timeout is 50s, the
> former
> > > > means
> > > > > > there's likely a heartbeat timeout due to the GC pause, while the
> > > > latter
> > > > > > means otherwise.
> > > > > >
> > > > > >
> > > > > > Mathematically, it is possible that there's 1 long pause together
> > > with
> > > > > > several short pauses within the same measurement period, making
> the
> > > > long
> > > > > > pause not observable with AverageTime. However, from my
> experience,
> > > > such
> > > > > a
> > > > > > pattern is not normal in reality. My observation is that GCs
> happen
> > > at
> > > > a
> > > > > > similar time usually take a similar length of time. Admittedly,
> > this
> > > is
> > > > > not
> > > > > > a hard guarantee.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Xintong
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 6, 2023 at 3:59 PM Gyula Fóra <gyula.f...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Matt Wang,
> > > > > > >
> > > > > > > I think the currently exposed info is all that is available
> > through
> > > > > > > GarbageCollectorMXBean. This FLIP does not aim to introduce a
> new
> > > > more
> > > > > > > granular way of reporting the per collector metrics, that would
> > > > > require a
> > > > > > > new mechanism and may be a breaking change.
> > > > > > >
> > > > > > > We basically want to simply extend the current reporting here
> > with
> > > > the
> > > > > > rate
> > > > > > > metrics and the total metrics.
> > > > > > >
> > > > > > > Gyula
> > > > > > >
> > > > > > > On Wed, Sep 6, 2023 at 9:24 AM Matt Wang <wang...@163.com>
> > wrote:
> > > > > > >
> > > > > > > > Hi Gyula,
> > > > > > > >
> > > > > > > > +1 for this proposal.
> > > > > > > >
> > > > > > > > Do we need to add a metric to record the count of different
> > > > > > > > collectors? Now there is only a total count. For example,
> > > > > > > > for G1, there is no way to distinguish whether it is the
> > > > > > > > young generation or the old generation.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Matt Wang
> > > > > > > >
> > > > > > > >
> > > > > > > > ---- Replied Message ----
> > > > > > > > | From | Gyula Fóra<gyula.f...@gmail.com> |
> > > > > > > > | Date | 09/6/2023 15:03 |
> > > > > > > > | To | <dev@flink.apache.org> |
> > > > > > > > | Subject | Re: [DISCUSS] FLIP-361: Improve GC Metrics |
> > > > > > > > Thanks Xintong!
> > > > > > > >
> > > > > > > > Just so I understand correctly, do you suggest adding a
> metric
> > > for
> > > > > > > > delta(Time) / delta(Count) since the last reporting ?
> > > > > > > > <Collector>.TimePerGc or <Collector>.AverageTime would make
> > > sense.
> > > > > > > > AverageTime may be a bit nicer :)
> > > > > > > >
> > > > > > > > My only concern is how useful this will be in reality. If
> there
> > > are
> > > > > > only
> > > > > > > > (or several) long pauses then the msPerSec metrics will show
> it
> > > > > > already,
> > > > > > > > and if there is a single long pause that may not be shown at
> > all
> > > if
> > > > > > there
> > > > > > > > are several shorter pauses as well with this metric.
> > > > > > > >
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Sep 6, 2023 at 8:46 AM Xintong Song <
> > > tonysong...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks for bringing this up, Gyula.
> > > > > > > >
> > > > > > > > The proposed changes make sense to me. +1 for them.
> > > > > > > >
> > > > > > > > In addition to the proposed changes, I wonder if we should
> also
> > > add
> > > > > > > > something like timePerGc? This would help understand whether
> > > there
> > > > > are
> > > > > > > long
> > > > > > > > pauses, due to GC STW, that may lead to rpc unresponsiveness
> > and
> > > > > > > heartbeat
> > > > > > > > timeouts. Ideally, we'd like to understand the max pause time
> > per
> > > > STW
> > > > > > in
> > > > > > > a
> > > > > > > > recent time window. However, I don't see an easy way to
> > separate
> > > > the
> > > > > > > pause
> > > > > > > > time of each STW. Deriving the overall time per GC from the
> > > > existing
> > > > > > > > metrics (time-increment / count-increment) seems to be a good
> > > > > > > alternative.
> > > > > > > > WDYT?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > >
> > > > > > > > Xintong
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Sep 6, 2023 at 2:16 PM Rui Fan <1996fan...@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > Thanks for the clarification!
> > > > > > > >
> > > > > > > > By default the meterview measures for 1 minute sounds good to
> > me!
> > > > > > > >
> > > > > > > > +1 for this proposal.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Rui
> > > > > > > >
> > > > > > > > On Wed, Sep 6, 2023 at 1:27 PM Gyula Fóra <
> > gyula.f...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks for the feedback Rui,
> > > > > > > >
> > > > > > > > The rates would be computed using the MeterView class (like
> for
> > > any
> > > > > > > > other
> > > > > > > > rate metric), just because we report the value per second it
> > > > doesn't
> > > > > > > > mean
> > > > > > > > that we measure in a second granularity.
> > > > > > > > By default the meterview measures for 1 minute and then we
> > > > calculate
> > > > > > > > the
> > > > > > > > per second rates, but we can increase the timespan if
> > necessary.
> > > > > > > >
> > > > > > > > So I don't think we run into this problem in practice and we
> > can
> > > > keep
> > > > > > > > the
> > > > > > > > metric aligned with other time rate metrics like
> > busyTimeMsPerSec
> > > > > etc.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > On Wed, Sep 6, 2023 at 4:55 AM Rui Fan <1996fan...@gmail.com
> >
> > > > wrote:
> > > > > > > >
> > > > > > > > Hi Gyula,
> > > > > > > >
> > > > > > > > +1 for this proposal. The current GC metric is really
> > unfriendly.
> > > > > > > >
> > > > > > > > I have a concern with your proposed rate metric: the rate is
> > > > > > > > perSecond
> > > > > > > > instead of per minute. I'm unsure whether it's suitable for
> GC
> > > > > > > > metric.
> > > > > > > >
> > > > > > > > There are two reasons why I suspect perSecond may not be well
> > > > > > > > compatible with GC metric:
> > > > > > > >
> > > > > > > > 1. GCs are usually infrequent and may only occur for a small
> > > number
> > > > > > > > of time periods within a minute.
> > > > > > > >
> > > > > > > > Metrics are collected periodically, for example, reported
> every
> > > > > > > > minute.
> > > > > > > > If the result reported by the GC metric is 1s/perSecond, it
> > does
> > > > not
> > > > > > > > mean that the GC of the TM is serious, because there may be
> no
> > GC
> > > > > > > > in the remaining 59s.
> > > > > > > >
> > > > > > > > On the contrary, the GC metric reports 0s/perSecond, which
> does
> > > not
> > > > > > > > mean that the GC of the TM is not serious, and the GC may be
> > very
> > > > > > > > serious in the remaining 59s.
> > > > > > > >
> > > > > > > > 2. Stop-the-world may cause the metric to fail(delay) to
> report
> > > > > > > >
> > > > > > > > The TM will stop the world during GC, especially full GC. It
> > > means
> > > > > > > > the metric cannot be collected or reported during full GC.
> > > > > > > >
> > > > > > > > So the collected GC metric may never be 1s/perSecond. This
> > metric
> > > > > > > > may always be good because the metric will only be reported
> > when
> > > > > > > > the GC is not severe.
> > > > > > > >
> > > > > > > >
> > > > > > > > If these concerns make sense, how about updating the GC rate
> > > > > > > > at minute level?
> > > > > > > >
> > > > > > > > We can define the type to Gauge for TimeMsPerMiunte, and
> > updating
> > > > > > > > this Gauge every second, it is:
> > > > > > > > GC Total.Time of current time - GC total time of one miunte
> > ago.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Rui
> > > > > > > >
> > > > > > > > On Tue, Sep 5, 2023 at 11:05 PM Maximilian Michels <
> > > m...@apache.org
> > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Gyula,
> > > > > > > >
> > > > > > > > +1 The proposed changes make sense and are in line with what
> is
> > > > > > > > available for other metrics, e.g. number of records
> processed.
> > > > > > > >
> > > > > > > > -Max
> > > > > > > >
> > > > > > > > On Tue, Sep 5, 2023 at 2:43 PM Gyula Fóra <
> > gyula.f...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Devs,
> > > > > > > >
> > > > > > > > I would like to start a discussion on FLIP-361: Improve GC
> > > > > > > > Metrics
> > > > > > > > [1].
> > > > > > > >
> > > > > > > > The current Flink GC metrics [2] are not very useful for
> > > > > > > > monitoring
> > > > > > > > purposes as they require post processing logic that is also
> > > > > > > > dependent
> > > > > > > > on
> > > > > > > > the current runtime environment.
> > > > > > > >
> > > > > > > > Problems:
> > > > > > > > - Total time is not very relevant for long running
> > applications,
> > > > > > > > only
> > > > > > > > the
> > > > > > > > rate of change (msPerSec)
> > > > > > > > - In most cases it's best to simply aggregate the time/count
> > > > > > > > across
> > > > > > > > the
> > > > > > > > different GabrageCollectors, however the specific collectors
> > are
> > > > > > > > dependent
> > > > > > > > on the current Java runtime
> > > > > > > >
> > > > > > > > We propose to improve the current situation by:
> > > > > > > > - Exposing rate metrics per GarbageCollector
> > > > > > > > - Exposing aggregated Total time/count/rate metrics
> > > > > > > >
> > > > > > > > These new metrics are all derived from the existing ones with
> > > > > > > > minimal
> > > > > > > > overhead.
> > > > > > > >
> > > > > > > > Looking forward to your feedback.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gyula
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP-361*3A*Improve*GC*Metrics__;JSsrKw!!IKRxdwAv5BmarQ!cw2nH8ALLzxCrHdoE6Uw58Nk5dcPvUM-Y_hcjiIszqPGjDAKi-pJUhwEnQ5uBjC_m73XC-yu6183qznWNCA$
> > > > > > > > [2]
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-docs-master/docs/ops/metrics/*garbagecollection__;Iw!!IKRxdwAv5BmarQ!cw2nH8ALLzxCrHdoE6Uw58Nk5dcPvUM-Y_hcjiIszqPGjDAKi-pJUhwEnQ5uBjC_m73XC-yu61837rPCNDo$
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to