Igniters,

All want to see the сacheOperationsBlockedDuration metric that will
show current blocking duration or 0 if there is no blocking right now.

Do we need the following metrics? It seems one of them will be superfluous.
1. The totalCacheOperationsBlockedDuration metric that will accumulate
all blocking durations that happen after node starts.
2. Blocking duration histogram. Based on the HistogramMetric class.
User will be able to configure bounds.

ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <nizhi...@apache.org>:
>
> Guys.
>
> I think we should go with the 2 metrics
>
>         * current PME duration (resets on finish)
>
>                 This metric required for alerting(or automatic actions) on 
> long PME.
>
>         * PME duration histogram (value added to metrics on PME finish)
>                 This metric required for an:
>                         * Quick PME trend analysis
>                         * Quick PME history analysis
>
>
> В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> > Nikita and Maxim,
> >
> > > What if we just update current metric getCurrentPmeDuration behaviour
> > > to show durations only for blocking PMEs?
> > > Remain it as a long value and rename it to 
> > > getCacheOperationsBlockedDuration.
> > >
> > > No other changes will require.
> > >
> > > WDYT?
> >
> > I agree with these two metrics. I also think that current
> > getCurrentPmeDuration will become redundant.
> >
> > Anton,
> >
> > > It looks like we're trying to implement "extended debug" instead of
> > > "monitoring".
> > > It should not be interesting for real admin what phase of PME is in
> > > progress and so on.
> >
> > PME is mission critical cluster process. I agree that there's a fine
> > line between monitoring and debug here. However, it's not good to add
> > monitoring capabilities only for scenario when everything is alright.
> > If PME will really hang, *real admin* will be extremely interested how
> > to return cluster back to working state. Metrics about stages completion
> > time may really help here: e.g. if one specific node hasn't completed
> > stage X while rest of the cluster has, it can be a signal that this node
> > should be killed.
> >
> > Of course, it's possible to build monitoring system that extract this
> > information from logs, but:
> > - It's more resource intensive as it requires parsing logs for all the time
> > - It's less reliable as log messages may change
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > > Folks,
> > >
> > > +1 with Anton post.
> > >
> > > What if we just update current metric getCurrentPmeDuration behaviour
> > > to show durations only for blocking PMEs?
> > > Remain it as a long value and rename it to 
> > > getCacheOperationsBlockedDuration.
> > >
> > > No other changes will require.
> > >
> > > WDYT?
> > >
> > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <nsamelc...@gmail.com> 
> > > wrote:
> > > > Nikolay,
> > > >
> > > > The сacheOperationsBlockedDuration metric will show current blocking
> > > > duration or 0 if there is no blocking right now.
> > > >
> > > > The totalCacheOperationsBlockedDuration metric will accumulate all
> > > > blocking durations that happen after node starts.
> > > >
> > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <nizhi...@apache.org>:
> > > > > Nikita
> > > > >
> > > > > What is the difference between those two metrics?
> > > > >
> > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <nsamelc...@gmail.com>:
> > > > >
> > > > > > Igniters, thanks for comments.
> > > > > >
> > > > > >  From the discussion it can be seen that we need only two metrics 
> > > > > > for now:
> > > > > > - сacheOperationsBlockedDuration (long)
> > > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > >
> > > > > > I will prepare PR at the nearest time.
> > > > > >
> > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky 
> > > > > > <arzamas...@mail.ru.invalid
> > > > > > > :
> > > > > > >
> > > > > > > +1 with Anton decisions.
> > > > > > >
> > > > > > >
> > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov 
> > > > > > > > <a...@apache.org>:
> > > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > It looks like we're trying to implement "extended debug" 
> > > > > > > > instead of
> > > > > > > > "monitoring".
> > > > > > > > It should not be interesting for real admin what phase of PME 
> > > > > > > > is in
> > > > > > > > progress and so on.
> > > > > > > > Interested metrics are
> > > > > > > > - total blocked time (will be used for real SLA counting)
> > > > > > > > - are we blocked right now (shows we have an SLA degradation 
> > > > > > > > right now)
> > > > > > > > Duration of the current blocking period can be easily presented 
> > > > > > > > using
> > > > > >
> > > > > > any
> > > > > > > > modern monitoring tool by regular checks.
> > > > > > > > Initial true will means "period start", precision will be a 
> > > > > > > > result of
> > > > > > > > checks frequency.
> > > > > > > > Anyway, I'm ok to have current metric presented with long, 
> > > > > > > > where long
> > > > > >
> > > > > > is a
> > > > > > > > duration, see no reason, but ok :)
> > > > > > > >
> > > > > > > > All other features you mentioned are useful for code or
> > > > > > > > deployment improving and can (should) be taken from logs at the 
> > > > > > > > analysis
> > > > > > > > phase.
> > > > > > > >
> > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < 
> > > > > > > > ivan.glu...@gmail.com >
> > > > > >
> > > > > > wrote:
> > > > > > > > > Folks, let me step in.
> > > > > > > > >
> > > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > >
> > > > > > > > > > 1. initialVersion. Topology version that initiates the 
> > > > > > > > > > exchange.
> > > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > > 4. partitionReleaseTime. Time when a node has finished 
> > > > > > > > > > waiting for
> > > > > >
> > > > > > all
> > > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single 
> > > > > > > > > > message.
> > > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full 
> > > > > > > > > > message.
> > > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > >
> > > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > >
> > > > > > > > > Every metric from Nikita's list looks useful and simple to 
> > > > > > > > > implement.
> > > > > > > > > I think that it would be better to change format of metrics 
> > > > > > > > > 4, 5, 6
> > > > > >
> > > > > > and
> > > > > > > > > 7 a bit: we can keep only difference between time of previous 
> > > > > > > > > event
> > > > > >
> > > > > > and
> > > > > > > > > time of corresponding event. Such metrics would be easier to 
> > > > > > > > > perceive:
> > > > > > > > > they answer to specific questions "how much time did 
> > > > > > > > > partition release
> > > > > > > > > take?" or "how much time did awaiting of distributed phase 
> > > > > > > > > end take?".
> > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to monitoring 
> > > > > > > > > system,
> > > > > > > > > graphs will show how different stages times change from one 
> > > > > > > > > PME to
> > > > > >
> > > > > > another.
> > > > > > > > > > When PME cause no blocking, it's a good PME and I see no 
> > > > > > > > > > reason to
> > > > > >
> > > > > > have
> > > > > > > > > > monitoring related to it
> > > > > > > > >
> > > > > > > > > Agree with Anton here. These metrics should be measured only 
> > > > > > > > > for true
> > > > > > > > > distributed exchange. Saving results for client leave/join 
> > > > > > > > > PMEs will
> > > > > > > > > just complicate monitoring.
> > > > > > > > >
> > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > I still don't understand why instant value indicating that
> > > > > >
> > > > > > operations are
> > > > > > > > > > blocked should be boolean.
> > > > > > > > > > Duration time since blocking has started looks more 
> > > > > > > > > > appropriate and
> > > > > > > > >
> > > > > > > > > useful.
> > > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > >
> > > > > > > > > Totally agree with Pavel here. Both "accumulated block time" 
> > > > > > > > > and
> > > > > > > > > "current PME block time" metrics are useful. Growth of 
> > > > > > > > > accumulated
> > > > > > > > > metric for specific period of time (should be easy to check 
> > > > > > > > > via
> > > > > > > > > monitoring system graph) will show for how much business 
> > > > > > > > > operations
> > > > > >
> > > > > > were
> > > > > > > > > blocked in total, and non-zero current metric will show that 
> > > > > > > > > we are
> > > > > > > > > experiencing issues right now. Boolean metric "are we blocked 
> > > > > > > > > right
> > > > > >
> > > > > > now"
> > > > > > > > > is not needed as it's obviously can be inferred from "current 
> > > > > > > > > PME
> > > > > >
> > > > > > block
> > > > > > > > > time".
> > > > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > Ivan Rakov
> > > > > > > > >
> > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > > Nikita,
> > > > > > > > > >
> > > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > > I still don't understand why instant value indicating that
> > > > > >
> > > > > > operations are
> > > > > > > > > > blocked should be boolean.
> > > > > > > > > > Duration time since blocking has started looks more 
> > > > > > > > > > appropriate and
> > > > > > > > >
> > > > > > > > > useful.
> > > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < 
> > > > > > > > > > nsamelc...@gmail.com
> > > > > > >
> > > > > > > :
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > All previous suggestions have some disadvantages. It can 
> > > > > > > > > > > be several
> > > > > > > > > > > exchanges between two metric updates and fast exchange 
> > > > > > > > > > > can rewrite
> > > > > > > > > > > previous long exchange.
> > > > > > > > > > >
> > > > > > > > > > > We can introduce a metric of total blocking duration that 
> > > > > > > > > > > will
> > > > > > > > > > > accumulate at the end of the exchange. So, users will get 
> > > > > > > > > > > actual
> > > > > > > > > > > information about how long operations were blocked. 
> > > > > > > > > > > Cluster metric
> > > > > > > > > > > will be a maximum of local nodes metrics. And we need a 
> > > > > > > > > > > boolean
> > > > > >
> > > > > > metric
> > > > > > > > > > > that will indicate realtime status. It needs because of 
> > > > > > > > > > > duration
> > > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > >
> > > > > > > > > > > So I propose to change the current metric that not 
> > > > > > > > > > > released to the
> > > > > > > > > > > totalCacheOperationsBlockingDuration metric and to add the
> > > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > >
> > > > > > > > > > > WDYT?
> > > > > > > > > > >
> > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < 
> > > > > > > > > > > a...@apache.org >:
> > > > > > > > > > > > Nikolay,
> > > > > > > > > > > >
> > > > > > > > > > > > Still see no reason to replace boolean with long.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > > > > >
> > > > > > nizhi...@apache.org >
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > Anton.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Value exported based on SPI settings, not in the 
> > > > > > > > > > > > > moment it
> > > > > >
> > > > > > changed.
> > > > > > > > > > > > > 2. Clock synchronisation - if we export start time, 
> > > > > > > > > > > > > we should
> > > > > >
> > > > > > also
> > > > > > > > > > > export
> > > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > >
> > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov < 
> > > > > > > > > > > > > a...@apache.org >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to count 
> > > > > > > > > > > > > > the durations.
> > > > > > > > > > > > > > Sine monitoring system checks metrics periodically 
> > > > > > > > > > > > > > it will know
> > > > > >
> > > > > > the
> > > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> > > > > >
> > > > > > jokse...@gmail.com >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For the 
> > > > > > > > > > > > > > > metric name, I
> > > > > >
> > > > > > suggest
> > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think it 
> > > > > > > > > > > > > > > cleaner
> > > > > >
> > > > > > represents
> > > > > > > > > > > what
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > >
> > > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > > duration to have better correlation when cache 
> > > > > > > > > > > > > > > operations were
> > > > > > > > > > >
> > > > > > > > > > > blocked
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > > For instant view (like in JMX bean) a calculated 
> > > > > > > > > > > > > > > value as you
> > > > > > > > > > >
> > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > > For metrics are exported to some backend (IEP-35) 
> > > > > > > > > > > > > > > a counter
> > > > > >
> > > > > > can be
> > > > > > > > > > > > > used.
> > > > > > > > > > > > > > > The counter is incremented by blocking time after 
> > > > > > > > > > > > > > > blocking has
> > > > > > > > > > >
> > > > > > > > > > > ended.
> > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> > > > > >
> > > > > > nsamelc...@gmail.com
> > > > > > > > > > > > :
> > > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > > how much time we wait for resuming cache 
> > > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean 
> > > > > > > > > > > > > > > > timestamp or duration
> > > > > >
> > > > > > here?
> > > > > > > > > > > > > > > > > > What do you think if we change the boolean 
> > > > > > > > > > > > > > > > > > value of metric
> > > > > >
> > > > > > to a
> > > > > > > > > > > > > long
> > > > > > > > > > > > > > > > value that represents time in milliseconds when 
> > > > > > > > > > > > > > > > operations
> > > > > >
> > > > > > were
> > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > This time can be calculated as (currentTime -
> > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of 
> > > > > > > > > > > > > > > > timestamp.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Duration will be more understandable. It'll be 
> > > > > > > > > > > > > > > > something like
> > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I haven't 
> > > > > > > > > > > > > > > > come up with a
> > > > > >
> > > > > > better
> > > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> > > > > >
> > > > > > jokse...@gmail.com
> > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't show 
> > > > > > > > > > > > > > > > > useful
> > > > > >
> > > > > > information.
> > > > > > > > > > > The
> > > > > > > > > > > > > > main
> > > > > > > > > > > > > > > > PME side effect for end-users is blocking cache 
> > > > > > > > > > > > > > > > operations.
> > > > > >
> > > > > > Not
> > > > > > > > > > > all
> > > > > > > > > > > > > PME
> > > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > > What information gives to an end-user 
> > > > > > > > > > > > > > > > > timestamp of
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what analysis 
> > > > > > > > > > > > > > > > it can be
> > > > > >
> > > > > > used and
> > > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > > > > > > > > > >
> > > > > > > > > > >   nsamelc...@gmail.com
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > This time already can be obtained from the
> > > > > > > > > > >
> > > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > As an alternative solution, I can rework 
> > > > > > > > > > > > > > > > > > recently added
> > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not released 
> > > > > > > > > > > > > > > > > > yet). Seems for
> > > > > > > > > > >
> > > > > > > > > > > users it
> > > > > > > > > > > > > > > > > > useless in case of non-blocking PME.
> > > > > > > > > > > > > > > > > > Lets name it timeSinceOperationsBlocked. 
> > > > > > > > > > > > > > > > > > It'll be timestamp
> > > > > > > > > > >
> > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > blocking started (minimal value of cluster 
> > > > > > > > > > > > > > > > > > nodes) and 0 if
> > > > > > > > > > >
> > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel 
> > > > > > > > > > > > > > > > > > Kovalenko <
> > > > > > > > > > >
> > > > > > > > > > >   jokse...@gmail.com >:
> > > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thank you for working on this. What do 
> > > > > > > > > > > > > > > > > > > you think if we
> > > > > > > > > > >
> > > > > > > > > > > change the
> > > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > > value of metric to a long value that 
> > > > > > > > > > > > > > > > > > > represents time in
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > > Since we have not only JMX and now 
> > > > > > > > > > > > > > > > > > > metrics are periodically
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > some backend it can give a more clear 
> > > > > > > > > > > > > > > > > > > picture of how much
> > > > > > > > > > >
> > > > > > > > > > > time we
> > > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > > resuming cache operations instead of 
> > > > > > > > > > > > > > > > > > > instant boolean
> > > > > > > > > > >
> > > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, Nikita 
> > > > > > > > > > > > > > > > > > > Amelchev <
> > > > > > > > > > > > >
> > > > > > > > > > > > >   nsamelc...@gmail.com
> > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > For now, we have the 
> > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration() metric that
> > > > > > > > > > >
> > > > > > > > > > > does
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > > influence on the cluster correctly. PME 
> > > > > > > > > > > > > > > > > > > > can be without
> > > > > > > > > > >
> > > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > > operations. For example, client node 
> > > > > > > > > > > > > > > > > > > > join/leave events.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > I suggest add new metric - 
> > > > > > > > > > > > > > > > > > > > isOperationsBlockedByPme().
> > > > > > > > > > > > >
> > > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > > metrics will show influence of the PME 
> > > > > > > > > > > > > > > > > > > > on cluster and user
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot visa 
> > > > > > > > > > > > > > > > > > > > is green). [1] Can
> > > > > > > > > > >
> > > > > > > > > > > anyone
> > > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > [1]  
> > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay 
> > > > > > > > > > > > > > > > > > > > Izhikov <
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   nizhi...@apache.org
> > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > I think administator of Ignite 
> > > > > > > > > > > > > > > > > > > > > cluster should be able to
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > Ignite process, including non blocking 
> > > > > > > > > > > > > > > > > > > > PME.
> > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 +0300, Anton 
> > > > > > > > > > > > > > > > > > > > > Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > > Found PME metric - 
> > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME time 
> > > > > > > > > > > > > > > > > > > > > > and not so useful
> > > > > > > > > > >
> > > > > > > > > > > because
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly 
> > > > > > > > > > > > > > > > > > > > > > blocking period.
> > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking, it's a 
> > > > > > > > > > > > > > > > > > > > > > good PME and I see
> > > > > > > > > > >
> > > > > > > > > > > no
> > > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM 
> > > > > > > > > > > > > > > > > > > > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nizhi...@apache.org >
> > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone 
> > > > > > > > > > > > > > > > > > > > > > > implementation of this
> > > > > > > > > > > > >
> > > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > > For now, implementation of new 
> > > > > > > > > > > > > > > > > > > > > > > metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > I think we can implement this 
> > > > > > > > > > > > > > > > > > > > > > > metrics as a single
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, 
> > > > > > > > > > > > > > > > > > > > > > > Anton Vinogradov
> > > > > > > > > > >
> > > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need now is a 
> > > > > > > > > > > > > > > > > > > > > > > > 1 simple metric:
> > > > > > > > > > >
> > > > > > > > > > > are
> > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > > All other metrics can be 
> > > > > > > > > > > > > > > > > > > > > > > > extracted from logs now
> > > > > > > > > > >
> > > > > > > > > > > and
> > > > > > > > > > > > > can
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 
> > > > > > > > > > > > > > > > > > > > > > > > PM Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >   nizhi...@apache.org >
> > > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go ahead.
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 
> > > > > > > > > > > > > > > > > > > > > > > > > Nikita Amelchev <
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >   nsamelc...@gmail.com
> > > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add some 
> > > > > > > > > > > > > > > > > > > > > > > > > > useful metrics about the
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the 
> > > > > > > > > > > > > > > > > > > > > > > > > > duration of PME stages
> > > > > > > > > > > > >
> > > > > > > > > > > > > available
> > > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be obtained 
> > > > > > > > > > > > > > > > > > > > > > > > > > using JMX or other
> > > > > > > > > > >
> > > > > > > > > > > external
> > > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of local 
> > > > > > > > > > > > > > > > > > > > > > > > > > node metrics that
> > > > > > > > > > >
> > > > > > > > > > > help to
> > > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > > actual status of current 
> > > > > > > > > > > > > > > > > > > > > > > > > > PME:
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology 
> > > > > > > > > > > > > > > > > > > > > > > > > > version that
> > > > > > > > > > >
> > > > > > > > > > > initiates
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time PME was 
> > > > > > > > > > > > > > > > > > > > > > > > > > started.
> > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event that 
> > > > > > > > > > > > > > > > > > > > > > > > > > triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. 
> > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node has
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > > updates and translations on 
> > > > > > > > > > > > > > > > > > > > > > > > > > a previous
> > > > > > > > > > >
> > > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. 
> > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node
> > > > > > > > > > >
> > > > > > > > > > > sent a
> > > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. 
> > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node
> > > > > > > > > > > > >
> > > > > > > > > > > > > received
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time PME was 
> > > > > > > > > > > > > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all 
> > > > > > > > > > > > > > > > > > > > > > > > > > these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help to 
> > > > > > > > > > > > > > > > > > > > > > > > > > understand:
> > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was (current 
> > > > > > > > > > > > > > > > > > > > > > > > > > or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited for all 
> > > > > > > > > > > > > > > > > > > > > > > > > > updates was
> > > > > > > > > > >
> > > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks PME 
> > > > > > > > > > > > > > > > > > > > > > > > > > (didn't send a single
> > > > > > > > > > > > >
> > > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >   https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best wishes,
> > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Zhenya Stanilovsky
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best wishes,
> > > > > > Amelchev Nikita
> > > > > >
> > > >
> > > >
> > > > --
> > > > Best wishes,
> > > > Amelchev Nikita



-- 
Best wishes,
Amelchev Nikita

Reply via email to