I think exact time should be obtained from logs, isnt it?
чт, 25 июля 2019 г., 20:00 Pavel Kovalenko <[email protected]>: > Nikolay, > > Yes, I have a chance to see HistogramMetric and moreover reviewed it) My > question was mostly about what exactly we will track in Histogram. > If we use histogram do you know how we can find exact time e.g. when PME > with time > 1s happened? > > чт, 25 июл. 2019 г. в 19:24, Nikolay Izhikov <[email protected]>: > > > Pavel > > > > Do you have a chance to see HistogramMetric source? > > It in master now. > > Look in source would be better then my explanation) > > > > We should count PME processes that blocks operations for some amount of > > time. For example [less then 50, less then 250, less then 1000, more then > > 1000] millis. > > > > чт, 25 июля 2019 г., 18:55 Pavel Kovalenko <[email protected]>: > > > > > Nikolay, > > > > > > Could you please explain deeper what structure will be of PME > histogram? > > > > > > чт, 25 июл. 2019 г. в 11:56, Nikolay Izhikov <[email protected]>: > > > > > > > Hello, Nikita. > > > > > > > > I think > > > > > > > > > 1. The totalCacheOperationsBlockedDuration metric that will > > accumulate > > > > > all blocking durations that happen after node starts. > > > > > > > > No, we don't need it. > > > > > > > > > 2. Blocking duration histogram. Based on the HistogramMetric class. > > > > > > > > Yes, we need it. > > > > > > > > В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет: > > > > > Igniters, > > > > > > > > > > All want to see the сacheOperationsBlockedDuration metric that will > > > > > show current blocking duration or 0 if there is no blocking right > > now. > > > > > > > > > > Do we need the following metrics? It seems one of them will be > > > > superfluous. > > > > > 1. The totalCacheOperationsBlockedDuration metric that will > > accumulate > > > > > all blocking durations that happen after node starts. > > > > > 2. Blocking duration histogram. Based on the HistogramMetric class. > > > > > User will be able to configure bounds. > > > > > > > > > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <[email protected] > >: > > > > > > > > > > > > Guys. > > > > > > > > > > > > I think we should go with the 2 metrics > > > > > > > > > > > > * current PME duration (resets on finish) > > > > > > > > > > > > This metric required for alerting(or automatic > > > > actions) on long PME. > > > > > > > > > > > > * PME duration histogram (value added to metrics on PME > > > finish) > > > > > > This metric required for an: > > > > > > * Quick PME trend analysis > > > > > > * Quick PME history analysis > > > > > > > > > > > > > > > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет: > > > > > > > Nikita and Maxim, > > > > > > > > > > > > > > > What if we just update current metric getCurrentPmeDuration > > > > behaviour > > > > > > > > to show durations only for blocking PMEs? > > > > > > > > Remain it as a long value and rename it to > > > > getCacheOperationsBlockedDuration. > > > > > > > > > > > > > > > > No other changes will require. > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > I agree with these two metrics. I also think that current > > > > > > > getCurrentPmeDuration will become redundant. > > > > > > > > > > > > > > Anton, > > > > > > > > > > > > > > > It looks like we're trying to implement "extended debug" > > instead > > > of > > > > > > > > "monitoring". > > > > > > > > It should not be interesting for real admin what phase of PME > > is > > > in > > > > > > > > progress and so on. > > > > > > > > > > > > > > PME is mission critical cluster process. I agree that there's a > > > fine > > > > > > > line between monitoring and debug here. However, it's not good > to > > > add > > > > > > > monitoring capabilities only for scenario when everything is > > > alright. > > > > > > > If PME will really hang, *real admin* will be extremely > > interested > > > > how > > > > > > > to return cluster back to working state. Metrics about stages > > > > completion > > > > > > > time may really help here: e.g. if one specific node hasn't > > > completed > > > > > > > stage X while rest of the cluster has, it can be a signal that > > this > > > > node > > > > > > > should be killed. > > > > > > > > > > > > > > Of course, it's possible to build monitoring system that > extract > > > this > > > > > > > information from logs, but: > > > > > > > - It's more resource intensive as it requires parsing logs for > > all > > > > the time > > > > > > > - It's less reliable as log messages may change > > > > > > > > > > > > > > Best Regards, > > > > > > > Ivan Rakov > > > > > > > > > > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote: > > > > > > > > Folks, > > > > > > > > > > > > > > > > +1 with Anton post. > > > > > > > > > > > > > > > > What if we just update current metric getCurrentPmeDuration > > > > behaviour > > > > > > > > to show durations only for blocking PMEs? > > > > > > > > Remain it as a long value and rename it to > > > > getCacheOperationsBlockedDuration. > > > > > > > > > > > > > > > > No other changes will require. > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev < > > > > [email protected]> wrote: > > > > > > > > > Nikolay, > > > > > > > > > > > > > > > > > > The сacheOperationsBlockedDuration metric will show current > > > > blocking > > > > > > > > > duration or 0 if there is no blocking right now. > > > > > > > > > > > > > > > > > > The totalCacheOperationsBlockedDuration metric will > > accumulate > > > > all > > > > > > > > > blocking durations that happen after node starts. > > > > > > > > > > > > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov < > > > > [email protected]>: > > > > > > > > > > Nikita > > > > > > > > > > > > > > > > > > > > What is the difference between those two metrics? > > > > > > > > > > > > > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev < > > > > [email protected]>: > > > > > > > > > > > > > > > > > > > > > Igniters, thanks for comments. > > > > > > > > > > > > > > > > > > > > > > From the discussion it can be seen that we need only > two > > > > metrics for now: > > > > > > > > > > > - сacheOperationsBlockedDuration (long) > > > > > > > > > > > - totalCacheOperationsBlockedDuration (long) > > > > > > > > > > > > > > > > > > > > > > I will prepare PR at the nearest time. > > > > > > > > > > > > > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky > > > > <[email protected] > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > +1 with Anton decisions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton > Vinogradov > > < > > > > [email protected]>: > > > > > > > > > > > > > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > > > > > > > It looks like we're trying to implement "extended > > > debug" > > > > instead of > > > > > > > > > > > > > "monitoring". > > > > > > > > > > > > > It should not be interesting for real admin what > > phase > > > > of PME is in > > > > > > > > > > > > > progress and so on. > > > > > > > > > > > > > Interested metrics are > > > > > > > > > > > > > - total blocked time (will be used for real SLA > > > counting) > > > > > > > > > > > > > - are we blocked right now (shows we have an SLA > > > > degradation right now) > > > > > > > > > > > > > Duration of the current blocking period can be > easily > > > > presented using > > > > > > > > > > > > > > > > > > > > > > any > > > > > > > > > > > > > modern monitoring tool by regular checks. > > > > > > > > > > > > > Initial true will means "period start", precision > > will > > > > be a result of > > > > > > > > > > > > > checks frequency. > > > > > > > > > > > > > Anyway, I'm ok to have current metric presented > with > > > > long, where long > > > > > > > > > > > > > > > > > > > > > > is a > > > > > > > > > > > > > duration, see no reason, but ok :) > > > > > > > > > > > > > > > > > > > > > > > > > > All other features you mentioned are useful for > code > > or > > > > > > > > > > > > > deployment improving and can (should) be taken from > > > logs > > > > at the analysis > > > > > > > > > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > Folks, let me step in. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, thanks for your suggestions! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that > > initiates > > > > the exchange. > > > > > > > > > > > > > > > 2. initTime. Time PME was started. > > > > > > > > > > > > > > > 3. initEvent. Event that triggered PME. > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has > > > > finished waiting for > > > > > > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > updates and translations on a previous > topology. > > > > > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node > sent a > > > > single message. > > > > > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node > > > received > > > > a full message. > > > > > > > > > > > > > > > 7. finishTime. Time PME was ended. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all these metrics resets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Every metric from Nikita's list looks useful and > > > > simple to implement. > > > > > > > > > > > > > > I think that it would be better to change format > of > > > > metrics 4, 5, 6 > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > 7 a bit: we can keep only difference between time > > of > > > > previous event > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > time of corresponding event. Such metrics would > be > > > > easier to perceive: > > > > > > > > > > > > > > they answer to specific questions "how much time > > did > > > > partition release > > > > > > > > > > > > > > take?" or "how much time did awaiting of > > distributed > > > > phase end take?". > > > > > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported > to > > > > monitoring system, > > > > > > > > > > > > > > graphs will show how different stages times > change > > > > from one PME to > > > > > > > > > > > > > > > > > > > > > > another. > > > > > > > > > > > > > > > When PME cause no blocking, it's a good PME > and I > > > > see no reason to > > > > > > > > > > > > > > > > > > > > > > have > > > > > > > > > > > > > > > monitoring related to it > > > > > > > > > > > > > > > > > > > > > > > > > > > > Agree with Anton here. These metrics should be > > > > measured only for true > > > > > > > > > > > > > > distributed exchange. Saving results for client > > > > leave/join PMEs will > > > > > > > > > > > > > > just complicate monitoring. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree with total blocking duration metric but > > > > > > > > > > > > > > > I still don't understand why instant value > > > > indicating that > > > > > > > > > > > > > > > > > > > > > > operations are > > > > > > > > > > > > > > > blocked should be boolean. > > > > > > > > > > > > > > > Duration time since blocking has started looks > > more > > > > appropriate and > > > > > > > > > > > > > > > > > > > > > > > > > > > > useful. > > > > > > > > > > > > > > > It gives more information while semantic is > left > > > the > > > > same. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Totally agree with Pavel here. Both "accumulated > > > block > > > > time" and > > > > > > > > > > > > > > "current PME block time" metrics are useful. > Growth > > > of > > > > accumulated > > > > > > > > > > > > > > metric for specific period of time (should be > easy > > to > > > > check via > > > > > > > > > > > > > > monitoring system graph) will show for how much > > > > business operations > > > > > > > > > > > > > > > > > > > > > > were > > > > > > > > > > > > > > blocked in total, and non-zero current metric > will > > > > show that we are > > > > > > > > > > > > > > experiencing issues right now. Boolean metric > "are > > we > > > > blocked right > > > > > > > > > > > > > > > > > > > > > > now" > > > > > > > > > > > > > > is not needed as it's obviously can be inferred > > from > > > > "current PME > > > > > > > > > > > > > > > > > > > > > > block > > > > > > > > > > > > > > time". > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > > > > Ivan Rakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote: > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree with total blocking duration metric but > > > > > > > > > > > > > > > I still don't understand why instant value > > > > indicating that > > > > > > > > > > > > > > > > > > > > > > operations are > > > > > > > > > > > > > > > blocked should be boolean. > > > > > > > > > > > > > > > Duration time since blocking has started looks > > more > > > > appropriate and > > > > > > > > > > > > > > > > > > > > > > > > > > > > useful. > > > > > > > > > > > > > > > It gives more information while semantic is > left > > > the > > > > same. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All previous suggestions have some > > disadvantages. > > > > It can be several > > > > > > > > > > > > > > > > exchanges between two metric updates and fast > > > > exchange can rewrite > > > > > > > > > > > > > > > > previous long exchange. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We can introduce a metric of total blocking > > > > duration that will > > > > > > > > > > > > > > > > accumulate at the end of the exchange. So, > > users > > > > will get actual > > > > > > > > > > > > > > > > information about how long operations were > > > > blocked. Cluster metric > > > > > > > > > > > > > > > > will be a maximum of local nodes metrics. And > > we > > > > need a boolean > > > > > > > > > > > > > > > > > > > > > > metric > > > > > > > > > > > > > > > > that will indicate realtime status. It needs > > > > because of duration > > > > > > > > > > > > > > > > metric updates at the end of the exchange. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So I propose to change the current metric > that > > > not > > > > released to the > > > > > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric > and > > > to > > > > add the > > > > > > > > > > > > > > > > isCacheOperationsBlocked metric. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton > Vinogradov < > > > > [email protected] >: > > > > > > > > > > > > > > > > > Nikolay, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Still see no reason to replace boolean with > > > long. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay > > > Izhikov < > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Anton. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Value exported based on SPI settings, > > not > > > > in the moment it > > > > > > > > > > > > > > > > > > > > > > changed. > > > > > > > > > > > > > > > > > > 2. Clock synchronisation - if we export > > start > > > > time, we should > > > > > > > > > > > > > > > > > > > > > > also > > > > > > > > > > > > > > > > export > > > > > > > > > > > > > > > > > > node local timestamp. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton > Vinogradov > > < > > > > [email protected] >: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > What's the reason for duration > counting? > > > > > > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature > > to > > > > count the durations. > > > > > > > > > > > > > > > > > > > Sine monitoring system checks metrics > > > > periodically it will know > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > duration by its own log. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel > > > > Kovalenko < > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. > For > > > > the metric name, I > > > > > > > > > > > > > > > > > > > > > > suggest > > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I > > > think > > > > it cleaner > > > > > > > > > > > > > > > > > > > > > > represents > > > > > > > > > > > > > > > > what > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > blocked during PME. > > > > > > > > > > > > > > > > > > > > We can also combine both timestamp > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and > > > > > > > > > > > > > > > > > > > > duration to have better correlation > > when > > > > cache operations were > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > blocked > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > how much time it's taken. > > > > > > > > > > > > > > > > > > > > For instant view (like in JMX bean) a > > > > calculated value as you > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > mentioned > > > > > > > > > > > > > > > > > > > > can be used. > > > > > > > > > > > > > > > > > > > > For metrics are exported to some > > backend > > > > (IEP-35) a counter > > > > > > > > > > > > > > > > > > > > > > can be > > > > > > > > > > > > > > > > > > used. > > > > > > > > > > > > > > > > > > > > The counter is incremented by > blocking > > > > time after blocking has > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ended. > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita > > > > Amelchev < > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > Pavel, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The main purpose of this metric is > > > > > > > > > > > > > > > > > > > > > > > how much time we wait for > > resuming > > > > cache operations > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you > > mean > > > > timestamp or duration > > > > > > > > > > > > > > > > > > > > > > here? > > > > > > > > > > > > > > > > > > > > > > > What do you think if we change > > the > > > > boolean value of metric > > > > > > > > > > > > > > > > > > > > > > to a > > > > > > > > > > > > > > > > > > long > > > > > > > > > > > > > > > > > > > > > value that represents time in > > > > milliseconds when operations > > > > > > > > > > > > > > > > > > > > > > were > > > > > > > > > > > > > > > > > > blocked? > > > > > > > > > > > > > > > > > > > > > This time can be calculated as > > > > (currentTime - > > > > > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case > > of > > > > timestamp. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Duration will be more > understandable. > > > > It'll be something like > > > > > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But > I > > > > haven't come up with a > > > > > > > > > > > > > > > > > > > > > > better > > > > > > > > > > > > > > > > > > > > > name yet. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel > > > > Kovalenko < > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration > > doesn't > > > > show useful > > > > > > > > > > > > > > > > > > > > > > information. > > > > > > > > > > > > > > > > The > > > > > > > > > > > > > > > > > > > main > > > > > > > > > > > > > > > > > > > > > PME side effect for end-users is > > > > blocking cache operations. > > > > > > > > > > > > > > > > > > > > > > Not > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > PME > > > > > > > > > > > > > > > > > > > > > time blocks it. > > > > > > > > > > > > > > > > > > > > > > What information gives to an > > end-user > > > > timestamp of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For > > what > > > > analysis it can be > > > > > > > > > > > > > > > > > > > > > > used and > > > > > > > > > > > > > > > > > > how? > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, > Nikita > > > > Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > Hi Pavel, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This time already can be > obtained > > > > from the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme > > > metrics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I > can > > > > rework recently added > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric > (not > > > > released yet). Seems for > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > users it > > > > > > > > > > > > > > > > > > > > > > > useless in case of non-blocking > > > PME. > > > > > > > > > > > > > > > > > > > > > > > Lets name it > > > > timeSinceOperationsBlocked. It'll be timestamp > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > > > > blocking started (minimal value > > of > > > > cluster nodes) and 0 if > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > blocking > > > > > > > > > > > > > > > > > > > > > > > ends (there is no running PME). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, > > Pavel > > > > Kovalenko < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [email protected] >: > > > > > > > > > > > > > > > > > > > > > > > > Hi Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for working on > this. > > > > What do you think if we > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > change the > > > > > > > > > > > > > > > > > > > > > boolean > > > > > > > > > > > > > > > > > > > > > > > > value of metric to a long > value > > > > that represents time in > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > milliseconds > > > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > > > > > operations were blocked? > > > > > > > > > > > > > > > > > > > > > > > > Since we have not only JMX > and > > > now > > > > metrics are periodically > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > exported > > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > > > > > some backend it can give a > more > > > > clear picture of how much > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > time we > > > > > > > > > > > > > > > > > > > > > wait for > > > > > > > > > > > > > > > > > > > > > > > > resuming cache operations > > instead > > > > of instant boolean > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, > > > > Nikita Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the > > > > getCurrentPmeDuration() metric that > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > does > > > > > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > > > > > > show > > > > > > > > > > > > > > > > > > > > > > > > > influence on the cluster > > > > correctly. PME can be without > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > blocking > > > > > > > > > > > > > > > > > > > > > > > > > operations. For example, > > client > > > > node join/leave events. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric - > > > > isOperationsBlockedByPme(). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Together, > > > > > > > > > > > > > > > > > > > > > these > > > > > > > > > > > > > > > > > > > > > > > > > metrics will show influence > > of > > > > the PME on cluster and user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > operations. > > > > > > > > > > > > > > > > > > > > > > > > > I have prepared PR for this > > > (Bot > > > > visa is green). [1] Can > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > anyone > > > > > > > > > > > > > > > > > > > > > take a > > > > > > > > > > > > > > > > > > > > > > > > > look? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > https://issues.apache.org/jira/browse/IGNITE-11961 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в > 14:58, > > > > Nikolay Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > > I think administator of > > > Ignite > > > > cluster should be able to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > monitor > > > > > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > > > > > > Ignite process, including > non > > > > blocking PME. > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 > > > > +0300, Anton Vinogradov пишет: > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, > > > > > > > > > > > > > > > > > > > > > > > > > > > Found PME metric - > > > > getCurrentPmeDuration(). > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly > > PME > > > > time and not so useful > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > because > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > The goal it so show > > exactly > > > > blocking period. > > > > > > > > > > > > > > > > > > > > > > > > > > > When PME cause no > > blocking, > > > > it's a good PME and I see > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > no > > > > > > > > > > > > > > > > > > > > > reason to have > > > > > > > > > > > > > > > > > > > > > > > > > > > monitoring related to > it > > :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at > > > 2:50 > > > > PM Nikolay Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Anton. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to > > > postpone > > > > implementation of this > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > metrics? > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now, > implementation > > > of > > > > new metric is very simple. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can > > implement > > > > this metrics as a single > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > contribution. > > > > > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в > > 13:47 > > > > +0300, Anton Vinogradov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пишет: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we > > need > > > > now is a 1 simple metric: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > > > > > operations > > > > > > > > > > > > > > > > > > > > > > > > > blocked? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or > false. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics > can > > > be > > > > extracted from logs now > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > > > > > implemented > > > > > > > > > > > > > > > > > > > > > > > > > > > > > later. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, > 2019 > > at > > > > 12:46 PM Nikolay Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, > go > > > > ahead. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 > > г., > > > > 11:45 Nikita Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, > Igniters. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to > add > > > > some useful metrics about the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > partition map > > > > > > > > > > > > > > > > > > > > > > > > > exchange > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, > > the > > > > duration of PME stages > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > available > > > > > > > > > > > > > > > > > > > > > only in > > > > > > > > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > > > > > > > > > > > files > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be > > > > obtained using JMX or other > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > external > > > > > > > > > > > > > > > > > > > > > tools. [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list > > of > > > > local node metrics that > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > help to > > > > > > > > > > > > > > > > > > > > > understand > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status > of > > > > current PME: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. > > initialVersion. > > > > Topology version that > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > initiates > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > exchange. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. > Time > > > > PME was started. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. > > Event > > > > that triggered PME. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. > > > > partitionReleaseTime. Time when a node has > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > finished > > > > > > > > > > > > > > > > > > > > > waiting > > > > > > > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and > > > > translations on a previous > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > topology. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. > > > > sendSingleMessageTime. Time when a node > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > sent a > > > > > > > > > > > > > > > > > > > > > single > > > > > > > > > > > > > > > > > > > > > > > > > message. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6. > > > > recieveFullMessageTime. Time when a node > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > received > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > > > > > full > > > > > > > > > > > > > > > > > > > > > > > > > message. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. > > Time > > > > PME was ended. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME > > > started > > > > all these metrics resets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics > > help > > > > to understand: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME > > was > > > > (current or previous). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long > > awaited > > > > for all updates was > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > completed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node > > blocks > > > > PME (didn't send a single > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > message) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - what > triggered > > > PME. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Zhenya Stanilovsky > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Best wishes, > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Best wishes, > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > >
