Re: Partition map exchange metrics

Nikolay Izhikov Wed, 24 Jul 2019 08:27:05 -0700

Guys.

I think we should go with the 2 metrics


        * current PME duration (resets on finish)

                This metric required for alerting(or automatic actions) on long 
PME.

        * PME duration histogram (value added to metrics on PME finish)
                This metric required for an:
                        * Quick PME trend analysis
                        * Quick PME history analysis


В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет:
> Nikita and Maxim,
> 
> > What if we just update current metric getCurrentPmeDuration behaviour
> > to show durations only for blocking PMEs?
> > Remain it as a long value and rename it to 
> > getCacheOperationsBlockedDuration.
> > 
> > No other changes will require.
> > 
> > WDYT?
> 
> I agree with these two metrics. I also think that current 
> getCurrentPmeDuration will become redundant.
> 
> Anton,
> 
> > It looks like we're trying to implement "extended debug" instead of
> > "monitoring".
> > It should not be interesting for real admin what phase of PME is in
> > progress and so on.
> 
> PME is mission critical cluster process. I agree that there's a fine 
> line between monitoring and debug here. However, it's not good to add 
> monitoring capabilities only for scenario when everything is alright.
> If PME will really hang, *real admin* will be extremely interested how 
> to return cluster back to working state. Metrics about stages completion 
> time may really help here: e.g. if one specific node hasn't completed 
> stage X while rest of the cluster has, it can be a signal that this node 
> should be killed.
> 
> Of course, it's possible to build monitoring system that extract this 
> information from logs, but:
> - It's more resource intensive as it requires parsing logs for all the time
> - It's less reliable as log messages may change
> 
> Best Regards,
> Ivan Rakov
> 
> On 24.07.2019 14:57, Maxim Muzafarov wrote:
> > Folks,
> > 
> > +1 with Anton post.
> > 
> > What if we just update current metric getCurrentPmeDuration behaviour
> > to show durations only for blocking PMEs?
> > Remain it as a long value and rename it to 
> > getCacheOperationsBlockedDuration.
> > 
> > No other changes will require.
> > 
> > WDYT?
> > 
> > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <[email protected]> wrote:
> > > Nikolay,
> > > 
> > > The сacheOperationsBlockedDuration metric will show current blocking
> > > duration or 0 if there is no blocking right now.
> > > 
> > > The totalCacheOperationsBlockedDuration metric will accumulate all
> > > blocking durations that happen after node starts.
> > > 
> > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <[email protected]>:
> > > > Nikita
> > > > 
> > > > What is the difference between those two metrics?
> > > > 
> > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <[email protected]>:
> > > > 
> > > > > Igniters, thanks for comments.
> > > > > 
> > > > >  From the discussion it can be seen that we need only two metrics for 
> > > > > now:
> > > > > - сacheOperationsBlockedDuration (long)
> > > > > - totalCacheOperationsBlockedDuration (long)
> > > > > 
> > > > > I will prepare PR at the nearest time.
> > > > > 
> > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky 
> > > > > <[email protected]
> > > > > > :
> > > > > > 
> > > > > > +1 with Anton decisions.
> > > > > > 
> > > > > > 
> > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov 
> > > > > > > <[email protected]>:
> > > > > > > 
> > > > > > > Folks,
> > > > > > > 
> > > > > > > It looks like we're trying to implement "extended debug" instead 
> > > > > > > of
> > > > > > > "monitoring".
> > > > > > > It should not be interesting for real admin what phase of PME is 
> > > > > > > in
> > > > > > > progress and so on.
> > > > > > > Interested metrics are
> > > > > > > - total blocked time (will be used for real SLA counting)
> > > > > > > - are we blocked right now (shows we have an SLA degradation 
> > > > > > > right now)
> > > > > > > Duration of the current blocking period can be easily presented 
> > > > > > > using
> > > > > 
> > > > > any
> > > > > > > modern monitoring tool by regular checks.
> > > > > > > Initial true will means "period start", precision will be a 
> > > > > > > result of
> > > > > > > checks frequency.
> > > > > > > Anyway, I'm ok to have current metric presented with long, where 
> > > > > > > long
> > > > > 
> > > > > is a
> > > > > > > duration, see no reason, but ok :)
> > > > > > > 
> > > > > > > All other features you mentioned are useful for code or
> > > > > > > deployment improving and can (should) be taken from logs at the 
> > > > > > > analysis
> > > > > > > phase.
> > > > > > > 
> > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < 
> > > > > > > [email protected] >
> > > > > 
> > > > > wrote:
> > > > > > > > Folks, let me step in.
> > > > > > > > 
> > > > > > > > Nikita, thanks for your suggestions!
> > > > > > > > 
> > > > > > > > > 1. initialVersion. Topology version that initiates the 
> > > > > > > > > exchange.
> > > > > > > > > 2. initTime. Time PME was started.
> > > > > > > > > 3. initEvent. Event that triggered PME.
> > > > > > > > > 4. partitionReleaseTime. Time when a node has finished 
> > > > > > > > > waiting for
> > > > > 
> > > > > all
> > > > > > > > > updates and translations on a previous topology.
> > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single 
> > > > > > > > > message.
> > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full 
> > > > > > > > > message.
> > > > > > > > > 7. finishTime. Time PME was ended.
> > > > > > > > > 
> > > > > > > > > When new PME started all these metrics resets.
> > > > > > > > 
> > > > > > > > Every metric from Nikita's list looks useful and simple to 
> > > > > > > > implement.
> > > > > > > > I think that it would be better to change format of metrics 4, 
> > > > > > > > 5, 6
> > > > > 
> > > > > and
> > > > > > > > 7 a bit: we can keep only difference between time of previous 
> > > > > > > > event
> > > > > 
> > > > > and
> > > > > > > > time of corresponding event. Such metrics would be easier to 
> > > > > > > > perceive:
> > > > > > > > they answer to specific questions "how much time did partition 
> > > > > > > > release
> > > > > > > > take?" or "how much time did awaiting of distributed phase end 
> > > > > > > > take?".
> > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to monitoring 
> > > > > > > > system,
> > > > > > > > graphs will show how different stages times change from one PME 
> > > > > > > > to
> > > > > 
> > > > > another.
> > > > > > > > > When PME cause no blocking, it's a good PME and I see no 
> > > > > > > > > reason to
> > > > > 
> > > > > have
> > > > > > > > > monitoring related to it
> > > > > > > > 
> > > > > > > > Agree with Anton here. These metrics should be measured only 
> > > > > > > > for true
> > > > > > > > distributed exchange. Saving results for client leave/join PMEs 
> > > > > > > > will
> > > > > > > > just complicate monitoring.
> > > > > > > > 
> > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > I still don't understand why instant value indicating that
> > > > > 
> > > > > operations are
> > > > > > > > > blocked should be boolean.
> > > > > > > > > Duration time since blocking has started looks more 
> > > > > > > > > appropriate and
> > > > > > > > 
> > > > > > > > useful.
> > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > 
> > > > > > > > Totally agree with Pavel here. Both "accumulated block time" and
> > > > > > > > "current PME block time" metrics are useful. Growth of 
> > > > > > > > accumulated
> > > > > > > > metric for specific period of time (should be easy to check via
> > > > > > > > monitoring system graph) will show for how much business 
> > > > > > > > operations
> > > > > 
> > > > > were
> > > > > > > > blocked in total, and non-zero current metric will show that we 
> > > > > > > > are
> > > > > > > > experiencing issues right now. Boolean metric "are we blocked 
> > > > > > > > right
> > > > > 
> > > > > now"
> > > > > > > > is not needed as it's obviously can be inferred from "current 
> > > > > > > > PME
> > > > > 
> > > > > block
> > > > > > > > time".
> > > > > > > > 
> > > > > > > > Best Regards,
> > > > > > > > Ivan Rakov
> > > > > > > > 
> > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote:
> > > > > > > > > Nikita,
> > > > > > > > > 
> > > > > > > > > I agree with total blocking duration metric but
> > > > > > > > > I still don't understand why instant value indicating that
> > > > > 
> > > > > operations are
> > > > > > > > > blocked should be boolean.
> > > > > > > > > Duration time since blocking has started looks more 
> > > > > > > > > appropriate and
> > > > > > > > 
> > > > > > > > useful.
> > > > > > > > > It gives more information while semantic is left the same.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < 
> > > > > > > > > [email protected]
> > > > > > 
> > > > > > :
> > > > > > > > > > Folks,
> > > > > > > > > > 
> > > > > > > > > > All previous suggestions have some disadvantages. It can be 
> > > > > > > > > > several
> > > > > > > > > > exchanges between two metric updates and fast exchange can 
> > > > > > > > > > rewrite
> > > > > > > > > > previous long exchange.
> > > > > > > > > > 
> > > > > > > > > > We can introduce a metric of total blocking duration that 
> > > > > > > > > > will
> > > > > > > > > > accumulate at the end of the exchange. So, users will get 
> > > > > > > > > > actual
> > > > > > > > > > information about how long operations were blocked. Cluster 
> > > > > > > > > > metric
> > > > > > > > > > will be a maximum of local nodes metrics. And we need a 
> > > > > > > > > > boolean
> > > > > 
> > > > > metric
> > > > > > > > > > that will indicate realtime status. It needs because of 
> > > > > > > > > > duration
> > > > > > > > > > metric updates at the end of the exchange.
> > > > > > > > > > 
> > > > > > > > > > So I propose to change the current metric that not released 
> > > > > > > > > > to the
> > > > > > > > > > totalCacheOperationsBlockingDuration metric and to add the
> > > > > > > > > > isCacheOperationsBlocked metric.
> > > > > > > > > > 
> > > > > > > > > > WDYT?
> > > > > > > > > > 
> > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < 
> > > > > > > > > > [email protected] >:
> > > > > > > > > > > Nikolay,
> > > > > > > > > > > 
> > > > > > > > > > > Still see no reason to replace boolean with long.
> > > > > > > > > > > 
> > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov <
> > > > > 
> > > > > [email protected] >
> > > > > > > > > > wrote:
> > > > > > > > > > > > Anton.
> > > > > > > > > > > > 
> > > > > > > > > > > > 1. Value exported based on SPI settings, not in the 
> > > > > > > > > > > > moment it
> > > > > 
> > > > > changed.
> > > > > > > > > > > > 2. Clock synchronisation - if we export start time, we 
> > > > > > > > > > > > should
> > > > > 
> > > > > also
> > > > > > > > > > export
> > > > > > > > > > > > node local timestamp.
> > > > > > > > > > > > 
> > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov < 
> > > > > > > > > > > > [email protected] >:
> > > > > > > > > > > > 
> > > > > > > > > > > > > Folks,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > What's the reason for duration counting?
> > > > > > > > > > > > > AFAIU, it's a monitoring system feature to count the 
> > > > > > > > > > > > > durations.
> > > > > > > > > > > > > Sine monitoring system checks metrics periodically it 
> > > > > > > > > > > > > will know
> > > > > 
> > > > > the
> > > > > > > > > > > > > duration by its own log.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko <
> > > > > 
> > > > > [email protected] >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Yes, I mean duration not timestamp. For the metric 
> > > > > > > > > > > > > > name, I
> > > > > 
> > > > > suggest
> > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think it 
> > > > > > > > > > > > > > cleaner
> > > > > 
> > > > > represents
> > > > > > > > > > what
> > > > > > > > > > > > is
> > > > > > > > > > > > > > blocked during PME.
> > > > > > > > > > > > > > We can also combine both timestamp
> > > > > > > > > > 
> > > > > > > > > > "cacheOperationsBlockingStartTs" and
> > > > > > > > > > > > > > duration to have better correlation when cache 
> > > > > > > > > > > > > > operations were
> > > > > > > > > > 
> > > > > > > > > > blocked
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > how much time it's taken.
> > > > > > > > > > > > > > For instant view (like in JMX bean) a calculated 
> > > > > > > > > > > > > > value as you
> > > > > > > > > > 
> > > > > > > > > > mentioned
> > > > > > > > > > > > > > can be used.
> > > > > > > > > > > > > > For metrics are exported to some backend (IEP-35) a 
> > > > > > > > > > > > > > counter
> > > > > 
> > > > > can be
> > > > > > > > > > > > used.
> > > > > > > > > > > > > > The counter is incremented by blocking time after 
> > > > > > > > > > > > > > blocking has
> > > > > > > > > > 
> > > > > > > > > > ended.
> > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev <
> > > > > 
> > > > > [email protected]
> > > > > > > > > > > :
> > > > > > > > > > > > > > > Pavel,
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The main purpose of this metric is
> > > > > > > > > > > > > > > > > how much time we wait for resuming cache 
> > > > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean timestamp 
> > > > > > > > > > > > > > > or duration
> > > > > 
> > > > > here?
> > > > > > > > > > > > > > > > > What do you think if we change the boolean 
> > > > > > > > > > > > > > > > > value of metric
> > > > > 
> > > > > to a
> > > > > > > > > > > > long
> > > > > > > > > > > > > > > value that represents time in milliseconds when 
> > > > > > > > > > > > > > > operations
> > > > > 
> > > > > were
> > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > This time can be calculated as (currentTime -
> > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of timestamp.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Duration will be more understandable. It'll be 
> > > > > > > > > > > > > > > something like
> > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I haven't come 
> > > > > > > > > > > > > > > up with a
> > > > > 
> > > > > better
> > > > > > > > > > > > > > > name yet.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko <
> > > > > 
> > > > > [email protected]
> > > > > > > > > > > :
> > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't show 
> > > > > > > > > > > > > > > > useful
> > > > > 
> > > > > information.
> > > > > > > > > > The
> > > > > > > > > > > > > main
> > > > > > > > > > > > > > > PME side effect for end-users is blocking cache 
> > > > > > > > > > > > > > > operations.
> > > > > 
> > > > > Not
> > > > > > > > > > all
> > > > > > > > > > > > PME
> > > > > > > > > > > > > > > time blocks it.
> > > > > > > > > > > > > > > > What information gives to an end-user timestamp 
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what analysis 
> > > > > > > > > > > > > > > it can be
> > > > > 
> > > > > used and
> > > > > > > > > > > > how?
> > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev <
> > > > > > > > > > 
> > > > > > > > > >   [email protected]
> > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > Hi Pavel,
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > This time already can be obtained from the
> > > > > > > > > > 
> > > > > > > > > > getCurrentPmeDuration
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > As an alternative solution, I can rework 
> > > > > > > > > > > > > > > > > recently added
> > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not released 
> > > > > > > > > > > > > > > > > yet). Seems for
> > > > > > > > > > 
> > > > > > > > > > users it
> > > > > > > > > > > > > > > > > useless in case of non-blocking PME.
> > > > > > > > > > > > > > > > > Lets name it timeSinceOperationsBlocked. 
> > > > > > > > > > > > > > > > > It'll be timestamp
> > > > > > > > > > 
> > > > > > > > > > when
> > > > > > > > > > > > > > > > > blocking started (minimal value of cluster 
> > > > > > > > > > > > > > > > > nodes) and 0 if
> > > > > > > > > > 
> > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > ends (there is no running PME).
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > WDYT?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko <
> > > > > > > > > > 
> > > > > > > > > >   [email protected] >:
> > > > > > > > > > > > > > > > > > Hi Nikita,
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Thank you for working on this. What do you 
> > > > > > > > > > > > > > > > > > think if we
> > > > > > > > > > 
> > > > > > > > > > change the
> > > > > > > > > > > > > > > boolean
> > > > > > > > > > > > > > > > > > value of metric to a long value that 
> > > > > > > > > > > > > > > > > > represents time in
> > > > > > > > > > > > > 
> > > > > > > > > > > > > milliseconds
> > > > > > > > > > > > > > > when
> > > > > > > > > > > > > > > > > > operations were blocked?
> > > > > > > > > > > > > > > > > > Since we have not only JMX and now metrics 
> > > > > > > > > > > > > > > > > > are periodically
> > > > > > > > > > > > > 
> > > > > > > > > > > > > exported
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > some backend it can give a more clear 
> > > > > > > > > > > > > > > > > > picture of how much
> > > > > > > > > > 
> > > > > > > > > > time we
> > > > > > > > > > > > > > > wait for
> > > > > > > > > > > > > > > > > > resuming cache operations instead of 
> > > > > > > > > > > > > > > > > > instant boolean
> > > > > > > > > > 
> > > > > > > > > > indicator.
> > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, Nikita 
> > > > > > > > > > > > > > > > > > Amelchev <
> > > > > > > > > > > > 
> > > > > > > > > > > >   [email protected]
> > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > Anton, Nikolay,
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Thanks for the support.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > For now, we have the 
> > > > > > > > > > > > > > > > > > > getCurrentPmeDuration() metric that
> > > > > > > > > > 
> > > > > > > > > > does
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > show
> > > > > > > > > > > > > > > > > > > influence on the cluster correctly. PME 
> > > > > > > > > > > > > > > > > > > can be without
> > > > > > > > > > 
> > > > > > > > > > blocking
> > > > > > > > > > > > > > > > > > > operations. For example, client node 
> > > > > > > > > > > > > > > > > > > join/leave events.
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > I suggest add new metric - 
> > > > > > > > > > > > > > > > > > > isOperationsBlockedByPme().
> > > > > > > > > > > > 
> > > > > > > > > > > > Together,
> > > > > > > > > > > > > > > these
> > > > > > > > > > > > > > > > > > > metrics will show influence of the PME on 
> > > > > > > > > > > > > > > > > > > cluster and user
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > operations.
> > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot visa is 
> > > > > > > > > > > > > > > > > > > green). [1] Can
> > > > > > > > > > 
> > > > > > > > > > anyone
> > > > > > > > > > > > > > > take a
> > > > > > > > > > > > > > > > > > > look?
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > [1]  
> > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay 
> > > > > > > > > > > > > > > > > > > Izhikov <
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   [email protected]
> > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > I think administator of Ignite cluster 
> > > > > > > > > > > > > > > > > > > > should be able to
> > > > > > > > > > > > > 
> > > > > > > > > > > > > monitor
> > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > Ignite process, including non blocking 
> > > > > > > > > > > > > > > > > > > PME.
> > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 +0300, Anton 
> > > > > > > > > > > > > > > > > > > > Vinogradov пишет:
> > > > > > > > > > > > > > > > > > > > > BTW,
> > > > > > > > > > > > > > > > > > > > > Found PME metric - 
> > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration().
> > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME time and 
> > > > > > > > > > > > > > > > > > > > > not so useful
> > > > > > > > > > 
> > > > > > > > > > because
> > > > > > > > > > > > of
> > > > > > > > > > > > > > > this.
> > > > > > > > > > > > > > > > > > > > > The goal it so show exactly blocking 
> > > > > > > > > > > > > > > > > > > > > period.
> > > > > > > > > > > > > > > > > > > > > When PME cause no blocking, it's a 
> > > > > > > > > > > > > > > > > > > > > good PME and I see
> > > > > > > > > > 
> > > > > > > > > > no
> > > > > > > > > > > > > > > reason to have
> > > > > > > > > > > > > > > > > > > > > monitoring related to it :)
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM 
> > > > > > > > > > > > > > > > > > > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   [email protected] >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > Anton.
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone 
> > > > > > > > > > > > > > > > > > > > > > implementation of this
> > > > > > > > > > > > 
> > > > > > > > > > > > metrics?
> > > > > > > > > > > > > > > > > > > > > > For now, implementation of new 
> > > > > > > > > > > > > > > > > > > > > > metric is very simple.
> > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > I think we can implement this 
> > > > > > > > > > > > > > > > > > > > > > metrics as a single
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > contribution.
> > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, 
> > > > > > > > > > > > > > > > > > > > > > Anton Vinogradov
> > > > > > > > > > 
> > > > > > > > > > пишет:
> > > > > > > > > > > > > > > > > > > > > > > Nikita,
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > Looks like all we need now is a 1 
> > > > > > > > > > > > > > > > > > > > > > > simple metric:
> > > > > > > > > > 
> > > > > > > > > > are
> > > > > > > > > > > > > > > operations
> > > > > > > > > > > > > > > > > > > blocked?
> > > > > > > > > > > > > > > > > > > > > > > Just a true or false.
> > > > > > > > > > > > > > > > > > > > > > > Lest start from this.
> > > > > > > > > > > > > > > > > > > > > > > All other metrics can be 
> > > > > > > > > > > > > > > > > > > > > > > extracted from logs now
> > > > > > > > > > 
> > > > > > > > > > and
> > > > > > > > > > > > can
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > implemented
> > > > > > > > > > > > > > > > > > > > > > > later.
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 PM 
> > > > > > > > > > > > > > > > > > > > > > > Nikolay Izhikov <
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > >   [email protected] >
> > > > > > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > +1.
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go ahead.
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 
> > > > > > > > > > > > > > > > > > > > > > > > Nikita Amelchev <
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >   [email protected]
> > > > > > > > > > > > > > > > > > > > :
> > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add some useful 
> > > > > > > > > > > > > > > > > > > > > > > > > metrics about the
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > partition map
> > > > > > > > > > > > > > > > > > > exchange
> > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the duration 
> > > > > > > > > > > > > > > > > > > > > > > > > of PME stages
> > > > > > > > > > > > 
> > > > > > > > > > > > available
> > > > > > > > > > > > > > > only in
> > > > > > > > > > > > > > > > > > > log
> > > > > > > > > > > > > > > > > > > > > > files
> > > > > > > > > > > > > > > > > > > > > > > > > and cannot be obtained using 
> > > > > > > > > > > > > > > > > > > > > > > > > JMX or other
> > > > > > > > > > 
> > > > > > > > > > external
> > > > > > > > > > > > > > > tools. [1]
> > > > > > > > > > > > > > > > > > > > > > > > > I made the list of local node 
> > > > > > > > > > > > > > > > > > > > > > > > > metrics that
> > > > > > > > > > 
> > > > > > > > > > help to
> > > > > > > > > > > > > > > understand
> > > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > > > > > actual status of current PME:
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology 
> > > > > > > > > > > > > > > > > > > > > > > > > version that
> > > > > > > > > > 
> > > > > > > > > > initiates
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > exchange.
> > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time PME was 
> > > > > > > > > > > > > > > > > > > > > > > > > started.
> > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event that 
> > > > > > > > > > > > > > > > > > > > > > > > > triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time 
> > > > > > > > > > > > > > > > > > > > > > > > > when a node has
> > > > > > > > > > > > > 
> > > > > > > > > > > > > finished
> > > > > > > > > > > > > > > waiting
> > > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > > > > > > > > updates and translations on a 
> > > > > > > > > > > > > > > > > > > > > > > > > previous
> > > > > > > > > > 
> > > > > > > > > > topology.
> > > > > > > > > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. 
> > > > > > > > > > > > > > > > > > > > > > > > > Time when a node
> > > > > > > > > > 
> > > > > > > > > > sent a
> > > > > > > > > > > > > > > single
> > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. 
> > > > > > > > > > > > > > > > > > > > > > > > > Time when a node
> > > > > > > > > > > > 
> > > > > > > > > > > > received
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > full
> > > > > > > > > > > > > > > > > > > message.
> > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time PME was 
> > > > > > > > > > > > > > > > > > > > > > > > > ended.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all 
> > > > > > > > > > > > > > > > > > > > > > > > > these metrics resets.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > These metrics help to 
> > > > > > > > > > > > > > > > > > > > > > > > > understand:
> > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was (current 
> > > > > > > > > > > > > > > > > > > > > > > > > or previous).
> > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited for all 
> > > > > > > > > > > > > > > > > > > > > > > > > updates was
> > > > > > > > > > 
> > > > > > > > > > completed.
> > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks PME 
> > > > > > > > > > > > > > > > > > > > > > > > > (didn't send a single
> > > > > > > > > > > > 
> > > > > > > > > > > > message)
> > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME.
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > Thoughts?
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > 
> > > > > > > > > > > > >   https://issues.apache.org/jira/browse/IGNITE-11961
> > > > > > > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best wishes,
> > > > > > > > > > > > > > > Amelchev Nikita
> > > > > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Best wishes,
> > > > > > > > > > Amelchev Nikita
> > > > > > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Zhenya Stanilovsky
> > > > > 
> > > > > 
> > > > > --
> > > > > Best wishes,
> > > > > Amelchev Nikita
> > > > > 
> > > 
> > > 
> > > --
> > > Best wishes,
> > > Amelchev Nikita

signature.asc
Description: This is a digitally signed message part

Re: Partition map exchange metrics

Reply via email to