Re: Partition map exchange metrics

Pavel Kovalenko Tue, 23 Jul 2019 06:03:09 -0700

Nikita,

I agree with total blocking duration metric but
I still don't understand why instant value indicating that operations are
blocked should be boolean.
Duration time since blocking has started looks more appropriate and useful.
It gives more information while semantic is left the same.




вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev <nsamelc...@gmail.com>:

> Folks,
>
> All previous suggestions have some disadvantages. It can be several
> exchanges between two metric updates and fast exchange can rewrite
> previous long exchange.
>
> We can introduce a metric of total blocking duration that will
> accumulate at the end of the exchange. So, users will get actual
> information about how long operations were blocked. Cluster metric
> will be a maximum of local nodes metrics. And we need a boolean metric
> that will indicate realtime status. It needs because of duration
> metric updates at the end of the exchange.
>
> So I propose to change the current metric that not released to the
> totalCacheOperationsBlockingDuration metric and to add the
> isCacheOperationsBlocked metric.
>
> WDYT?
>
> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov <a...@apache.org>:
> >
> > Nikolay,
> >
> > Still see no reason to replace boolean with long.
> >
> > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <nizhi...@apache.org>
> wrote:
> >
> > > Anton.
> > >
> > > 1. Value exported based on SPI settings, not in the moment it changed.
> > >
> > > 2. Clock synchronisation - if we export start time, we should also
> export
> > > node local timestamp.
> > >
> > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov <a...@apache.org>:
> > >
> > > > Folks,
> > > >
> > > > What's the reason for duration counting?
> > > > AFAIU, it's a monitoring system feature to count the durations.
> > > > Sine monitoring system checks metrics periodically it will know the
> > > > duration by its own log.
> > > >
> > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <jokse...@gmail.com>
> > > > wrote:
> > > >
> > > > > Nikita,
> > > > >
> > > > > Yes, I mean duration not timestamp. For the metric name, I suggest
> > > > > "cacheOperationsBlockingDuration", I think it cleaner represents
> what
> > > is
> > > > > blocked during PME.
> > > > > We can also combine both timestamp
> "cacheOperationsBlockingStartTs" and
> > > > > duration to have better correlation when cache operations were
> blocked
> > > > and
> > > > > how much time it's taken.
> > > > > For instant view (like in JMX bean) a calculated value as you
> mentioned
> > > > > can be used.
> > > > > For metrics are exported to some backend (IEP-35) a counter can be
> > > used.
> > > > > The counter is incremented by blocking time after blocking has
> ended.
> > > > >
> > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <nsamelc...@gmail.com
> >:
> > > > >
> > > > >> Pavel,
> > > > >>
> > > > >> The main purpose of this metric is
> > > > >> >> how much time we wait for resuming cache operations
> > > > >>
> > > > >> Seems I misunderstood you. Do you mean timestamp or duration here?
> > > > >> >> What do you think if we change the boolean value of metric to a
> > > long
> > > > >> value that represents time in milliseconds when operations were
> > > blocked?
> > > > >>
> > > > >> This time can be calculated as (currentTime -
> > > > >> timeSinceOperationsBlocked) in case of timestamp.
> > > > >>
> > > > >> Duration will be more understandable. It'll be something like
> > > > >> getCurrentBlockingPmeDuration. But I haven't come up with a better
> > > > >> name yet.
> > > > >>
> > > > >> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <jokse...@gmail.com
> >:
> > > > >> >
> > > > >> > Nikita,
> > > > >> >
> > > > >> > I think getCurrentPmeDuration doesn't show useful information.
> The
> > > > main
> > > > >> PME side effect for end-users is blocking cache operations. Not
> all
> > > PME
> > > > >> time blocks it.
> > > > >> > What information gives to an end-user timestamp of
> > > > >> "timeSinceOperationsBlocked"? For what analysis it can be used and
> > > how?
> > > > >> >
> > > > >> > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> nsamelc...@gmail.com
> > > >:
> > > > >> >>
> > > > >> >> Hi Pavel,
> > > > >> >>
> > > > >> >> This time already can be obtained from the
> getCurrentPmeDuration
> > > and
> > > > >> >> new isOperationsBlockedByPme metrics.
> > > > >> >>
> > > > >> >> As an alternative solution, I can rework recently added
> > > > >> >> getCurrentPmeDuration metric (not released yet). Seems for
> users it
> > > > >> >> useless in case of non-blocking PME.
> > > > >> >> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> when
> > > > >> >> blocking started (minimal value of cluster nodes) and 0 if
> blocking
> > > > >> >> ends (there is no running PME).
> > > > >> >>
> > > > >> >> WDYT?
> > > > >> >>
> > > > >> >> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> jokse...@gmail.com>:
> > > > >> >> >
> > > > >> >> > Hi Nikita,
> > > > >> >> >
> > > > >> >> > Thank you for working on this. What do you think if we
> change the
> > > > >> boolean
> > > > >> >> > value of metric to a long value that represents time in
> > > > milliseconds
> > > > >> when
> > > > >> >> > operations were blocked?
> > > > >> >> > Since we have not only JMX and now metrics are periodically
> > > > exported
> > > > >> to
> > > > >> >> > some backend it can give a more clear picture of how much
> time we
> > > > >> wait for
> > > > >> >> > resuming cache operations instead of instant boolean
> indicator.
> > > > >> >> >
> > > > >> >> > пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> > > nsamelc...@gmail.com
> > > > >:
> > > > >> >> >
> > > > >> >> > > Anton, Nikolay,
> > > > >> >> > >
> > > > >> >> > > Thanks for the support.
> > > > >> >> > >
> > > > >> >> > > For now, we have the getCurrentPmeDuration() metric that
> does
> > > not
> > > > >> show
> > > > >> >> > > influence on the cluster correctly. PME can be without
> blocking
> > > > >> >> > > operations. For example, client node join/leave events.
> > > > >> >> > >
> > > > >> >> > > I suggest add new metric - isOperationsBlockedByPme().
> > > Together,
> > > > >> these
> > > > >> >> > > metrics will show influence of the PME on cluster and user
> > > > >> operations.
> > > > >> >> > >
> > > > >> >> > > I have prepared PR for this (Bot visa is green). [1] Can
> anyone
> > > > >> take a
> > > > >> >> > > look?
> > > > >> >> > >
> > > > >> >> > > [1] https://issues.apache.org/jira/browse/IGNITE-11961
> > > > >> >> > >
> > > > >> >> > > вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> > > > nizhi...@apache.org
> > > > >> >:
> > > > >> >> > >
> > > > >> >> > > >
> > > > >> >> > > > I think administator of Ignite cluster should be able to
> > > > monitor
> > > > >> all
> > > > >> >> > > Ignite process, including non blocking PME.
> > > > >> >> > > >
> > > > >> >> > > > В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> > > > >> >> > > > > BTW,
> > > > >> >> > > > > Found PME metric - getCurrentPmeDuration().
> > > > >> >> > > > > Seems, it shows exactly PME time and not so useful
> because
> > > of
> > > > >> this.
> > > > >> >> > > > > The goal it so show exactly blocking period.
> > > > >> >> > > > > When PME cause no blocking, it's a good PME and I see
> no
> > > > >> reason to have
> > > > >> >> > > > > monitoring related to it :)
> > > > >> >> > > > >
> > > > >> >> > > > > On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> > > > >> nizhi...@apache.org>
> > > > >> >> > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Anton.
> > > > >> >> > > > > >
> > > > >> >> > > > > > Why do we need to postpone implementation of this
> > > metrics?
> > > > >> >> > > > > > For now, implementation of new metric is very simple.
> > > > >> >> > > > > >
> > > > >> >> > > > > > I think we can implement this metrics as a single
> > > > >> contribution.
> > > > >> >> > > > > >
> > > > >> >> > > > > > В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> пишет:
> > > > >> >> > > > > > > Nikita,
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > Looks like all we need now is a 1 simple metric:
> are
> > > > >> operations
> > > > >> >> > > blocked?
> > > > >> >> > > > > > > Just a true or false.
> > > > >> >> > > > > > > Lest start from this.
> > > > >> >> > > > > > > All other metrics can be extracted from logs now
> and
> > > can
> > > > be
> > > > >> >> > > implemented
> > > > >> >> > > > > > > later.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> > > > >> >> > > nizhi...@apache.org>
> > > > >> >> > > > > > > wrote:
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > +1.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Nikita, please, go ahead.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> > > > >> nsamelc...@gmail.com
> > > > >> >> > > >:
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > > Hello, Igniters.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > I suggest to add some useful metrics about the
> > > > >> partition map
> > > > >> >> > > exchange
> > > > >> >> > > > > > > > > (PME). For now, the duration of PME stages
> > > available
> > > > >> only in
> > > > >> >> > > log
> > > > >> >> > > > > >
> > > > >> >> > > > > > files
> > > > >> >> > > > > > > > > and cannot be obtained using JMX or other
> external
> > > > >> tools. [1]
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > I made the list of local node metrics that
> help to
> > > > >> understand
> > > > >> >> > > the
> > > > >> >> > > > > > > > > actual status of current PME:
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > 1. initialVersion. Topology version that
> initiates
> > > > the
> > > > >> >> > > exchange.
> > > > >> >> > > > > > > > > 2. initTime. Time PME was started.
> > > > >> >> > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > >> >> > > > > > > > > 4. partitionReleaseTime. Time when a node has
> > > > finished
> > > > >> waiting
> > > > >> >> > > for
> > > > >> >> > > > > >
> > > > >> >> > > > > > all
> > > > >> >> > > > > > > > > updates and translations on a previous
> topology.
> > > > >> >> > > > > > > > > 5. sendSingleMessageTime. Time when a node
> sent a
> > > > >> single
> > > > >> >> > > message.
> > > > >> >> > > > > > > > > 6. recieveFullMessageTime. Time when a node
> > > received
> > > > a
> > > > >> full
> > > > >> >> > > message.
> > > > >> >> > > > > > > > > 7. finishTime. Time PME was ended.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > When new PME started all these metrics resets.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > These metrics help to understand:
> > > > >> >> > > > > > > > > - how long PME was (current or previous).
> > > > >> >> > > > > > > > > - how long awaited for all updates was
> completed.
> > > > >> >> > > > > > > > > - what node blocks PME (didn't send a single
> > > message)
> > > > >> >> > > > > > > > > - what triggered PME.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Thoughts?
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > [1]
> > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > --
> > > > >> >> > > > > > > > > Best wishes,
> > > > >> >> > > > > > > > > Amelchev Nikita
> > > > >> >> > > > > > > > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > --
> > > > >> >> > > Best wishes,
> > > > >> >> > > Amelchev Nikita
> > > > >> >> > >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Best wishes,
> > > > >> >> Amelchev Nikita
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Best wishes,
> > > > >> Amelchev Nikita
> > > > >>
> > > > >
> > > >
> > >
>
>
>
> --
> Best wishes,
> Amelchev Nikita
>

Re: Partition map exchange metrics

Reply via email to