Re: Re[2]: Partition map exchange metrics

Nikita Amelchev Wed, 24 Jul 2019 02:45:50 -0700

Igniters, thanks for comments.

>From the discussion it can be seen that we need only two metrics for now:
- сacheOperationsBlockedDuration (long)
- totalCacheOperationsBlockedDuration (long)


I will prepare PR at the nearest time.

ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <[email protected]>:
>
> +1 with Anton decisions.
>
>
> >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <[email protected]>:
> >
> >Folks,
> >
> >It looks like we're trying to implement "extended debug" instead of
> >"monitoring".
> >It should not be interesting for real admin what phase of PME is in
> >progress and so on.
> >Interested metrics are
> >- total blocked time (will be used for real SLA counting)
> >- are we blocked right now (shows we have an SLA degradation right now)
> >Duration of the current blocking period can be easily presented using any
> >modern monitoring tool by regular checks.
> >Initial true will means "period start", precision will be a result of
> >checks frequency.
> >Anyway, I'm ok to have current metric presented with long, where long is a
> >duration, see no reason, but ok :)
> >
> >All other features you mentioned are useful for code or
> >deployment improving and can (should) be taken from logs at the analysis
> >phase.
> >
> >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < [email protected] > wrote:
> >
> >> Folks, let me step in.
> >>
> >> Nikita, thanks for your suggestions!
> >>
> >> > 1. initialVersion. Topology version that initiates the exchange.
> >> > 2. initTime. Time PME was started.
> >> > 3. initEvent. Event that triggered PME.
> >> > 4. partitionReleaseTime. Time when a node has finished waiting for all
> >> > updates and translations on a previous topology.
> >> > 5. sendSingleMessageTime. Time when a node sent a single message.
> >> > 6. recieveFullMessageTime. Time when a node received a full message.
> >> > 7. finishTime. Time PME was ended.
> >> >
> >> > When new PME started all these metrics resets.
> >> Every metric from Nikita's list looks useful and simple to implement.
> >> I think that it would be better to change format of metrics 4, 5, 6 and
> >> 7 a bit: we can keep only difference between time of previous event and
> >> time of corresponding event. Such metrics would be easier to perceive:
> >> they answer to specific questions "how much time did partition release
> >> take?" or "how much time did awaiting of distributed phase end take?".
> >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system,
> >> graphs will show how different stages times change from one PME to another.
> >>
> >> > When PME cause no blocking, it's a good PME and I see no reason to have
> >> > monitoring related to it
> >> Agree with Anton here. These metrics should be measured only for true
> >> distributed exchange. Saving results for client leave/join PMEs will
> >> just complicate monitoring.
> >>
> >> > I agree with total blocking duration metric but
> >> > I still don't understand why instant value indicating that operations are
> >> > blocked should be boolean.
> >> > Duration time since blocking has started looks more appropriate and
> >> useful.
> >> > It gives more information while semantic is left the same.
> >> Totally agree with Pavel here. Both "accumulated block time" and
> >> "current PME block time" metrics are useful. Growth of accumulated
> >> metric for specific period of time (should be easy to check via
> >> monitoring system graph) will show for how much business operations were
> >> blocked in total, and non-zero current metric will show that we are
> >> experiencing issues right now. Boolean metric "are we blocked right now"
> >> is not needed as it's obviously can be inferred from "current PME block
> >> time".
> >>
> >> Best Regards,
> >> Ivan Rakov
> >>
> >> On 23.07.2019 16:02, Pavel Kovalenko wrote:
> >> > Nikita,
> >> >
> >> > I agree with total blocking duration metric but
> >> > I still don't understand why instant value indicating that operations are
> >> > blocked should be boolean.
> >> > Duration time since blocking has started looks more appropriate and
> >> useful.
> >> > It gives more information while semantic is left the same.
> >> >
> >> >
> >> >
> >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < [email protected] >:
> >> >
> >> >> Folks,
> >> >>
> >> >> All previous suggestions have some disadvantages. It can be several
> >> >> exchanges between two metric updates and fast exchange can rewrite
> >> >> previous long exchange.
> >> >>
> >> >> We can introduce a metric of total blocking duration that will
> >> >> accumulate at the end of the exchange. So, users will get actual
> >> >> information about how long operations were blocked. Cluster metric
> >> >> will be a maximum of local nodes metrics. And we need a boolean metric
> >> >> that will indicate realtime status. It needs because of duration
> >> >> metric updates at the end of the exchange.
> >> >>
> >> >> So I propose to change the current metric that not released to the
> >> >> totalCacheOperationsBlockingDuration metric and to add the
> >> >> isCacheOperationsBlocked metric.
> >> >>
> >> >> WDYT?
> >> >>
> >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < [email protected] >:
> >> >>> Nikolay,
> >> >>>
> >> >>> Still see no reason to replace boolean with long.
> >> >>>
> >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < [email protected] >
> >> >> wrote:
> >> >>>> Anton.
> >> >>>>
> >> >>>> 1. Value exported based on SPI settings, not in the moment it changed.
> >> >>>>
> >> >>>> 2. Clock synchronisation - if we export start time, we should also
> >> >> export
> >> >>>> node local timestamp.
> >> >>>>
> >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < [email protected] >:
> >> >>>>
> >> >>>>> Folks,
> >> >>>>>
> >> >>>>> What's the reason for duration counting?
> >> >>>>> AFAIU, it's a monitoring system feature to count the durations.
> >> >>>>> Sine monitoring system checks metrics periodically it will know the
> >> >>>>> duration by its own log.
> >> >>>>>
> >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < [email protected] 
> >> >>>>> >
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Nikita,
> >> >>>>>>
> >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest
> >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents
> >> >> what
> >> >>>> is
> >> >>>>>> blocked during PME.
> >> >>>>>> We can also combine both timestamp
> >> >> "cacheOperationsBlockingStartTs" and
> >> >>>>>> duration to have better correlation when cache operations were
> >> >> blocked
> >> >>>>> and
> >> >>>>>> how much time it's taken.
> >> >>>>>> For instant view (like in JMX bean) a calculated value as you
> >> >> mentioned
> >> >>>>>> can be used.
> >> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be
> >> >>>> used.
> >> >>>>>> The counter is incremented by blocking time after blocking has
> >> >> ended.
> >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < [email protected]
> >> >>> :
> >> >>>>>>> Pavel,
> >> >>>>>>>
> >> >>>>>>> The main purpose of this metric is
> >> >>>>>>>>> how much time we wait for resuming cache operations
> >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here?
> >> >>>>>>>>> What do you think if we change the boolean value of metric to a
> >> >>>> long
> >> >>>>>>> value that represents time in milliseconds when operations were
> >> >>>> blocked?
> >> >>>>>>> This time can be calculated as (currentTime -
> >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp.
> >> >>>>>>>
> >> >>>>>>> Duration will be more understandable. It'll be something like
> >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better
> >> >>>>>>> name yet.
> >> >>>>>>>
> >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < [email protected]
> >> >>> :
> >> >>>>>>>> Nikita,
> >> >>>>>>>>
> >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information.
> >> >> The
> >> >>>>> main
> >> >>>>>>> PME side effect for end-users is blocking cache operations. Not
> >> >> all
> >> >>>> PME
> >> >>>>>>> time blocks it.
> >> >>>>>>>> What information gives to an end-user timestamp of
> >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and
> >> >>>> how?
> >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> >> >>  [email protected]
> >> >>>>> :
> >> >>>>>>>>> Hi Pavel,
> >> >>>>>>>>>
> >> >>>>>>>>> This time already can be obtained from the
> >> >> getCurrentPmeDuration
> >> >>>> and
> >> >>>>>>>>> new isOperationsBlockedByPme metrics.
> >> >>>>>>>>>
> >> >>>>>>>>> As an alternative solution, I can rework recently added
> >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for
> >> >> users it
> >> >>>>>>>>> useless in case of non-blocking PME.
> >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp
> >> >> when
> >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if
> >> >> blocking
> >> >>>>>>>>> ends (there is no running PME).
> >> >>>>>>>>>
> >> >>>>>>>>> WDYT?
> >> >>>>>>>>>
> >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> >> >>  [email protected] >:
> >> >>>>>>>>>> Hi Nikita,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thank you for working on this. What do you think if we
> >> >> change the
> >> >>>>>>> boolean
> >> >>>>>>>>>> value of metric to a long value that represents time in
> >> >>>>> milliseconds
> >> >>>>>>> when
> >> >>>>>>>>>> operations were blocked?
> >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically
> >> >>>>> exported
> >> >>>>>>> to
> >> >>>>>>>>>> some backend it can give a more clear picture of how much
> >> >> time we
> >> >>>>>>> wait for
> >> >>>>>>>>>> resuming cache operations instead of instant boolean
> >> >> indicator.
> >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev <
> >> >>>>  [email protected]
> >> >>>>>> :
> >> >>>>>>>>>>> Anton, Nikolay,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Thanks for the support.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that
> >> >> does
> >> >>>> not
> >> >>>>>>> show
> >> >>>>>>>>>>> influence on the cluster correctly. PME can be without
> >> >> blocking
> >> >>>>>>>>>>> operations. For example, client node join/leave events.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme().
> >> >>>> Together,
> >> >>>>>>> these
> >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user
> >> >>>>>>> operations.
> >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can
> >> >> anyone
> >> >>>>>>> take a
> >> >>>>>>>>>>> look?
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> [1]  https://issues.apache.org/jira/browse/IGNITE-11961
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov <
> >> >>>>>  [email protected]
> >> >>>>>>>> :
> >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to
> >> >>>>> monitor
> >> >>>>>>> all
> >> >>>>>>>>>>> Ignite process, including non blocking PME.
> >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет:
> >> >>>>>>>>>>>>> BTW,
> >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration().
> >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful
> >> >> because
> >> >>>> of
> >> >>>>>>> this.
> >> >>>>>>>>>>>>> The goal it so show exactly blocking period.
> >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see
> >> >> no
> >> >>>>>>> reason to have
> >> >>>>>>>>>>>>> monitoring related to it :)
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov <
> >> >>>>>>>  [email protected] >
> >> >>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>> Anton.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this
> >> >>>> metrics?
> >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple.
> >> >>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>> I think we can implement this metrics as a single
> >> >>>>>>> contribution.
> >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov
> >> >> пишет:
> >> >>>>>>>>>>>>>>> Nikita,
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric:
> >> >> are
> >> >>>>>>> operations
> >> >>>>>>>>>>> blocked?
> >> >>>>>>>>>>>>>>> Just a true or false.
> >> >>>>>>>>>>>>>>> Lest start from this.
> >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now
> >> >> and
> >> >>>> can
> >> >>>>> be
> >> >>>>>>>>>>> implemented
> >> >>>>>>>>>>>>>>> later.
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov <
> >> >>>>>>>>>>>  [email protected] >
> >> >>>>>>>>>>>>>>> wrote:
> >> >>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> +1.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> Nikita, please, go ahead.
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev <
> >> >>>>>>>  [email protected]
> >> >>>>>>>>>>>> :
> >> >>>>>>>>>>>>>>>>> Hello, Igniters.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the
> >> >>>>>>> partition map
> >> >>>>>>>>>>> exchange
> >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages
> >> >>>> available
> >> >>>>>>> only in
> >> >>>>>>>>>>> log
> >> >>>>>>>>>>>>>> files
> >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other
> >> >> external
> >> >>>>>>> tools. [1]
> >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that
> >> >> help to
> >> >>>>>>> understand
> >> >>>>>>>>>>> the
> >> >>>>>>>>>>>>>>>>> actual status of current PME:
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that
> >> >> initiates
> >> >>>>> the
> >> >>>>>>>>>>> exchange.
> >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started.
> >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME.
> >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has
> >> >>>>> finished
> >> >>>>>>> waiting
> >> >>>>>>>>>>> for
> >> >>>>>>>>>>>>>> all
> >> >>>>>>>>>>>>>>>>> updates and translations on a previous
> >> >> topology.
> >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node
> >> >> sent a
> >> >>>>>>> single
> >> >>>>>>>>>>> message.
> >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node
> >> >>>> received
> >> >>>>> a
> >> >>>>>>> full
> >> >>>>>>>>>>> message.
> >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> These metrics help to understand:
> >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous).
> >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was
> >> >> completed.
> >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single
> >> >>>> message)
> >> >>>>>>>>>>>>>>>>> - what triggered PME.
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> Thoughts?
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>>>>>>> [1]
> >> >>>>>  https://issues.apache.org/jira/browse/IGNITE-11961
> >> >>>>>>>>>>>>>>>>> --
> >> >>>>>>>>>>>>>>>>> Best wishes,
> >> >>>>>>>>>>>>>>>>> Amelchev Nikita
> >> >>>>>>>>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> --
> >> >>>>>>>>>>> Best wishes,
> >> >>>>>>>>>>> Amelchev Nikita
> >> >>>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> --
> >> >>>>>>>>> Best wishes,
> >> >>>>>>>>> Amelchev Nikita
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Best wishes,
> >> >>>>>>> Amelchev Nikita
> >> >>>>>>>
> >> >>
> >> >>
> >> >> --
> >> >> Best wishes,
> >> >> Amelchev Nikita
> >> >>
> >>
>
>
> --
> Zhenya Stanilovsky



-- 
Best wishes,
Amelchev Nikita

Re: Re[2]: Partition map exchange metrics

Reply via email to