Hello, Nikita. I think
> 1. The totalCacheOperationsBlockedDuration metric that will accumulate > all blocking durations that happen after node starts. No, we don't need it. > 2. Blocking duration histogram. Based on the HistogramMetric class. Yes, we need it. В Чт, 25/07/2019 в 11:50 +0300, Nikita Amelchev пишет: > Igniters, > > All want to see the сacheOperationsBlockedDuration metric that will > show current blocking duration or 0 if there is no blocking right now. > > Do we need the following metrics? It seems one of them will be superfluous. > 1. The totalCacheOperationsBlockedDuration metric that will accumulate > all blocking durations that happen after node starts. > 2. Blocking duration histogram. Based on the HistogramMetric class. > User will be able to configure bounds. > > ср, 24 июл. 2019 г. в 18:26, Nikolay Izhikov <nizhi...@apache.org>: > > > > Guys. > > > > I think we should go with the 2 metrics > > > > * current PME duration (resets on finish) > > > > This metric required for alerting(or automatic actions) on > > long PME. > > > > * PME duration histogram (value added to metrics on PME finish) > > This metric required for an: > > * Quick PME trend analysis > > * Quick PME history analysis > > > > > > В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет: > > > Nikita and Maxim, > > > > > > > What if we just update current metric getCurrentPmeDuration behaviour > > > > to show durations only for blocking PMEs? > > > > Remain it as a long value and rename it to > > > > getCacheOperationsBlockedDuration. > > > > > > > > No other changes will require. > > > > > > > > WDYT? > > > > > > I agree with these two metrics. I also think that current > > > getCurrentPmeDuration will become redundant. > > > > > > Anton, > > > > > > > It looks like we're trying to implement "extended debug" instead of > > > > "monitoring". > > > > It should not be interesting for real admin what phase of PME is in > > > > progress and so on. > > > > > > PME is mission critical cluster process. I agree that there's a fine > > > line between monitoring and debug here. However, it's not good to add > > > monitoring capabilities only for scenario when everything is alright. > > > If PME will really hang, *real admin* will be extremely interested how > > > to return cluster back to working state. Metrics about stages completion > > > time may really help here: e.g. if one specific node hasn't completed > > > stage X while rest of the cluster has, it can be a signal that this node > > > should be killed. > > > > > > Of course, it's possible to build monitoring system that extract this > > > information from logs, but: > > > - It's more resource intensive as it requires parsing logs for all the > > > time > > > - It's less reliable as log messages may change > > > > > > Best Regards, > > > Ivan Rakov > > > > > > On 24.07.2019 14:57, Maxim Muzafarov wrote: > > > > Folks, > > > > > > > > +1 with Anton post. > > > > > > > > What if we just update current metric getCurrentPmeDuration behaviour > > > > to show durations only for blocking PMEs? > > > > Remain it as a long value and rename it to > > > > getCacheOperationsBlockedDuration. > > > > > > > > No other changes will require. > > > > > > > > WDYT? > > > > > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <nsamelc...@gmail.com> > > > > wrote: > > > > > Nikolay, > > > > > > > > > > The сacheOperationsBlockedDuration metric will show current blocking > > > > > duration or 0 if there is no blocking right now. > > > > > > > > > > The totalCacheOperationsBlockedDuration metric will accumulate all > > > > > blocking durations that happen after node starts. > > > > > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <nizhi...@apache.org>: > > > > > > Nikita > > > > > > > > > > > > What is the difference between those two metrics? > > > > > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <nsamelc...@gmail.com>: > > > > > > > > > > > > > Igniters, thanks for comments. > > > > > > > > > > > > > > From the discussion it can be seen that we need only two metrics > > > > > > > for now: > > > > > > > - сacheOperationsBlockedDuration (long) > > > > > > > - totalCacheOperationsBlockedDuration (long) > > > > > > > > > > > > > > I will prepare PR at the nearest time. > > > > > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky > > > > > > > <arzamas...@mail.ru.invalid > > > > > > > > : > > > > > > > > > > > > > > > > +1 with Anton decisions. > > > > > > > > > > > > > > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov > > > > > > > > > <a...@apache.org>: > > > > > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > It looks like we're trying to implement "extended debug" > > > > > > > > > instead of > > > > > > > > > "monitoring". > > > > > > > > > It should not be interesting for real admin what phase of PME > > > > > > > > > is in > > > > > > > > > progress and so on. > > > > > > > > > Interested metrics are > > > > > > > > > - total blocked time (will be used for real SLA counting) > > > > > > > > > - are we blocked right now (shows we have an SLA degradation > > > > > > > > > right now) > > > > > > > > > Duration of the current blocking period can be easily > > > > > > > > > presented using > > > > > > > > > > > > > > any > > > > > > > > > modern monitoring tool by regular checks. > > > > > > > > > Initial true will means "period start", precision will be a > > > > > > > > > result of > > > > > > > > > checks frequency. > > > > > > > > > Anyway, I'm ok to have current metric presented with long, > > > > > > > > > where long > > > > > > > > > > > > > > is a > > > > > > > > > duration, see no reason, but ok :) > > > > > > > > > > > > > > > > > > All other features you mentioned are useful for code or > > > > > > > > > deployment improving and can (should) be taken from logs at > > > > > > > > > the analysis > > > > > > > > > phase. > > > > > > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < > > > > > > > > > ivan.glu...@gmail.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > Folks, let me step in. > > > > > > > > > > > > > > > > > > > > Nikita, thanks for your suggestions! > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that initiates the > > > > > > > > > > > exchange. > > > > > > > > > > > 2. initTime. Time PME was started. > > > > > > > > > > > 3. initEvent. Event that triggered PME. > > > > > > > > > > > 4. partitionReleaseTime. Time when a node has finished > > > > > > > > > > > waiting for > > > > > > > > > > > > > > all > > > > > > > > > > > updates and translations on a previous topology. > > > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single > > > > > > > > > > > message. > > > > > > > > > > > 6. recieveFullMessageTime. Time when a node received a > > > > > > > > > > > full message. > > > > > > > > > > > 7. finishTime. Time PME was ended. > > > > > > > > > > > > > > > > > > > > > > When new PME started all these metrics resets. > > > > > > > > > > > > > > > > > > > > Every metric from Nikita's list looks useful and simple to > > > > > > > > > > implement. > > > > > > > > > > I think that it would be better to change format of metrics > > > > > > > > > > 4, 5, 6 > > > > > > > > > > > > > > and > > > > > > > > > > 7 a bit: we can keep only difference between time of > > > > > > > > > > previous event > > > > > > > > > > > > > > and > > > > > > > > > > time of corresponding event. Such metrics would be easier > > > > > > > > > > to perceive: > > > > > > > > > > they answer to specific questions "how much time did > > > > > > > > > > partition release > > > > > > > > > > take?" or "how much time did awaiting of distributed phase > > > > > > > > > > end take?". > > > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to > > > > > > > > > > monitoring system, > > > > > > > > > > graphs will show how different stages times change from one > > > > > > > > > > PME to > > > > > > > > > > > > > > another. > > > > > > > > > > > When PME cause no blocking, it's a good PME and I see no > > > > > > > > > > > reason to > > > > > > > > > > > > > > have > > > > > > > > > > > monitoring related to it > > > > > > > > > > > > > > > > > > > > Agree with Anton here. These metrics should be measured > > > > > > > > > > only for true > > > > > > > > > > distributed exchange. Saving results for client leave/join > > > > > > > > > > PMEs will > > > > > > > > > > just complicate monitoring. > > > > > > > > > > > > > > > > > > > > > I agree with total blocking duration metric but > > > > > > > > > > > I still don't understand why instant value indicating that > > > > > > > > > > > > > > operations are > > > > > > > > > > > blocked should be boolean. > > > > > > > > > > > Duration time since blocking has started looks more > > > > > > > > > > > appropriate and > > > > > > > > > > > > > > > > > > > > useful. > > > > > > > > > > > It gives more information while semantic is left the same. > > > > > > > > > > > > > > > > > > > > Totally agree with Pavel here. Both "accumulated block > > > > > > > > > > time" and > > > > > > > > > > "current PME block time" metrics are useful. Growth of > > > > > > > > > > accumulated > > > > > > > > > > metric for specific period of time (should be easy to check > > > > > > > > > > via > > > > > > > > > > monitoring system graph) will show for how much business > > > > > > > > > > operations > > > > > > > > > > > > > > were > > > > > > > > > > blocked in total, and non-zero current metric will show > > > > > > > > > > that we are > > > > > > > > > > experiencing issues right now. Boolean metric "are we > > > > > > > > > > blocked right > > > > > > > > > > > > > > now" > > > > > > > > > > is not needed as it's obviously can be inferred from > > > > > > > > > > "current PME > > > > > > > > > > > > > > block > > > > > > > > > > time". > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > Ivan Rakov > > > > > > > > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote: > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > I agree with total blocking duration metric but > > > > > > > > > > > I still don't understand why instant value indicating that > > > > > > > > > > > > > > operations are > > > > > > > > > > > blocked should be boolean. > > > > > > > > > > > Duration time since blocking has started looks more > > > > > > > > > > > appropriate and > > > > > > > > > > > > > > > > > > > > useful. > > > > > > > > > > > It gives more information while semantic is left the same. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > > > > : > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > > > > > All previous suggestions have some disadvantages. It > > > > > > > > > > > > can be several > > > > > > > > > > > > exchanges between two metric updates and fast exchange > > > > > > > > > > > > can rewrite > > > > > > > > > > > > previous long exchange. > > > > > > > > > > > > > > > > > > > > > > > > We can introduce a metric of total blocking duration > > > > > > > > > > > > that will > > > > > > > > > > > > accumulate at the end of the exchange. So, users will > > > > > > > > > > > > get actual > > > > > > > > > > > > information about how long operations were blocked. > > > > > > > > > > > > Cluster metric > > > > > > > > > > > > will be a maximum of local nodes metrics. And we need a > > > > > > > > > > > > boolean > > > > > > > > > > > > > > metric > > > > > > > > > > > > that will indicate realtime status. It needs because of > > > > > > > > > > > > duration > > > > > > > > > > > > metric updates at the end of the exchange. > > > > > > > > > > > > > > > > > > > > > > > > So I propose to change the current metric that not > > > > > > > > > > > > released to the > > > > > > > > > > > > totalCacheOperationsBlockingDuration metric and to add > > > > > > > > > > > > the > > > > > > > > > > > > isCacheOperationsBlocked metric. > > > > > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < > > > > > > > > > > > > a...@apache.org >: > > > > > > > > > > > > > Nikolay, > > > > > > > > > > > > > > > > > > > > > > > > > > Still see no reason to replace boolean with long. > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < > > > > > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > Anton. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Value exported based on SPI settings, not in the > > > > > > > > > > > > > > moment it > > > > > > > > > > > > > > changed. > > > > > > > > > > > > > > 2. Clock synchronisation - if we export start time, > > > > > > > > > > > > > > we should > > > > > > > > > > > > > > also > > > > > > > > > > > > export > > > > > > > > > > > > > > node local timestamp. > > > > > > > > > > > > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov < > > > > > > > > > > > > > > a...@apache.org >: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > What's the reason for duration counting? > > > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to count > > > > > > > > > > > > > > > the durations. > > > > > > > > > > > > > > > Sine monitoring system checks metrics > > > > > > > > > > > > > > > periodically it will know > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > duration by its own log. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < > > > > > > > > > > > > > > jokse...@gmail.com > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For the > > > > > > > > > > > > > > > > metric name, I > > > > > > > > > > > > > > suggest > > > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think it > > > > > > > > > > > > > > > > cleaner > > > > > > > > > > > > > > represents > > > > > > > > > > > > what > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > blocked during PME. > > > > > > > > > > > > > > > > We can also combine both timestamp > > > > > > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and > > > > > > > > > > > > > > > > duration to have better correlation when cache > > > > > > > > > > > > > > > > operations were > > > > > > > > > > > > > > > > > > > > > > > > blocked > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > how much time it's taken. > > > > > > > > > > > > > > > > For instant view (like in JMX bean) a > > > > > > > > > > > > > > > > calculated value as you > > > > > > > > > > > > > > > > > > > > > > > > mentioned > > > > > > > > > > > > > > > > can be used. > > > > > > > > > > > > > > > > For metrics are exported to some backend > > > > > > > > > > > > > > > > (IEP-35) a counter > > > > > > > > > > > > > > can be > > > > > > > > > > > > > > used. > > > > > > > > > > > > > > > > The counter is incremented by blocking time > > > > > > > > > > > > > > > > after blocking has > > > > > > > > > > > > > > > > > > > > > > > > ended. > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < > > > > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > Pavel, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The main purpose of this metric is > > > > > > > > > > > > > > > > > > > how much time we wait for resuming cache > > > > > > > > > > > > > > > > > > > operations > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean > > > > > > > > > > > > > > > > > timestamp or duration > > > > > > > > > > > > > > here? > > > > > > > > > > > > > > > > > > > What do you think if we change the > > > > > > > > > > > > > > > > > > > boolean value of metric > > > > > > > > > > > > > > to a > > > > > > > > > > > > > > long > > > > > > > > > > > > > > > > > value that represents time in milliseconds > > > > > > > > > > > > > > > > > when operations > > > > > > > > > > > > > > were > > > > > > > > > > > > > > blocked? > > > > > > > > > > > > > > > > > This time can be calculated as (currentTime - > > > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of > > > > > > > > > > > > > > > > > timestamp. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Duration will be more understandable. It'll > > > > > > > > > > > > > > > > > be something like > > > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I haven't > > > > > > > > > > > > > > > > > come up with a > > > > > > > > > > > > > > better > > > > > > > > > > > > > > > > > name yet. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < > > > > > > > > > > > > > > jokse...@gmail.com > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't show > > > > > > > > > > > > > > > > > > useful > > > > > > > > > > > > > > information. > > > > > > > > > > > > The > > > > > > > > > > > > > > > main > > > > > > > > > > > > > > > > > PME side effect for end-users is blocking > > > > > > > > > > > > > > > > > cache operations. > > > > > > > > > > > > > > Not > > > > > > > > > > > > all > > > > > > > > > > > > > > PME > > > > > > > > > > > > > > > > > time blocks it. > > > > > > > > > > > > > > > > > > What information gives to an end-user > > > > > > > > > > > > > > > > > > timestamp of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what > > > > > > > > > > > > > > > > > analysis it can be > > > > > > > > > > > > > > used and > > > > > > > > > > > > > > how? > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita > > > > > > > > > > > > > > > > > > Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > Hi Pavel, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This time already can be obtained from the > > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I can rework > > > > > > > > > > > > > > > > > > > recently added > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not > > > > > > > > > > > > > > > > > > > released yet). Seems for > > > > > > > > > > > > > > > > > > > > > > > > users it > > > > > > > > > > > > > > > > > > > useless in case of non-blocking PME. > > > > > > > > > > > > > > > > > > > Lets name it timeSinceOperationsBlocked. > > > > > > > > > > > > > > > > > > > It'll be timestamp > > > > > > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > blocking started (minimal value of > > > > > > > > > > > > > > > > > > > cluster nodes) and 0 if > > > > > > > > > > > > > > > > > > > > > > > > blocking > > > > > > > > > > > > > > > > > > > ends (there is no running PME). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel > > > > > > > > > > > > > > > > > > > Kovalenko < > > > > > > > > > > > > > > > > > > > > > > > > jokse...@gmail.com >: > > > > > > > > > > > > > > > > > > > > Hi Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for working on this. What do > > > > > > > > > > > > > > > > > > > > you think if we > > > > > > > > > > > > > > > > > > > > > > > > change the > > > > > > > > > > > > > > > > > boolean > > > > > > > > > > > > > > > > > > > > value of metric to a long value that > > > > > > > > > > > > > > > > > > > > represents time in > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > milliseconds > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > operations were blocked? > > > > > > > > > > > > > > > > > > > > Since we have not only JMX and now > > > > > > > > > > > > > > > > > > > > metrics are periodically > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > exported > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > some backend it can give a more clear > > > > > > > > > > > > > > > > > > > > picture of how much > > > > > > > > > > > > > > > > > > > > > > > > time we > > > > > > > > > > > > > > > > > wait for > > > > > > > > > > > > > > > > > > > > resuming cache operations instead of > > > > > > > > > > > > > > > > > > > > instant boolean > > > > > > > > > > > > > > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, Nikita > > > > > > > > > > > > > > > > > > > > Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > Anton, Nikolay, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration() metric that > > > > > > > > > > > > > > > > > > > > > > > > does > > > > > > > > > > > > > > not > > > > > > > > > > > > > > > > > show > > > > > > > > > > > > > > > > > > > > > influence on the cluster correctly. > > > > > > > > > > > > > > > > > > > > > PME can be without > > > > > > > > > > > > > > > > > > > > > > > > blocking > > > > > > > > > > > > > > > > > > > > > operations. For example, client node > > > > > > > > > > > > > > > > > > > > > join/leave events. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric - > > > > > > > > > > > > > > > > > > > > > isOperationsBlockedByPme(). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Together, > > > > > > > > > > > > > > > > > these > > > > > > > > > > > > > > > > > > > > > metrics will show influence of the > > > > > > > > > > > > > > > > > > > > > PME on cluster and user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > operations. > > > > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot visa > > > > > > > > > > > > > > > > > > > > > is green). [1] Can > > > > > > > > > > > > > > > > > > > > > > > > anyone > > > > > > > > > > > > > > > > > take a > > > > > > > > > > > > > > > > > > > > > look? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay > > > > > > > > > > > > > > > > > > > > > Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > I think administator of Ignite > > > > > > > > > > > > > > > > > > > > > > cluster should be able to > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > monitor > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > > Ignite process, including non > > > > > > > > > > > > > > > > > > > > > blocking PME. > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 +0300, > > > > > > > > > > > > > > > > > > > > > > Anton Vinogradov пишет: > > > > > > > > > > > > > > > > > > > > > > > BTW, > > > > > > > > > > > > > > > > > > > > > > > Found PME metric - > > > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration(). > > > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME time > > > > > > > > > > > > > > > > > > > > > > > and not so useful > > > > > > > > > > > > > > > > > > > > > > > > because > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly > > > > > > > > > > > > > > > > > > > > > > > blocking period. > > > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking, it's > > > > > > > > > > > > > > > > > > > > > > > a good PME and I see > > > > > > > > > > > > > > > > > > > > > > > > no > > > > > > > > > > > > > > > > > reason to have > > > > > > > > > > > > > > > > > > > > > > > monitoring related to it :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM > > > > > > > > > > > > > > > > > > > > > > > Nikolay Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > Anton. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone > > > > > > > > > > > > > > > > > > > > > > > > implementation of this > > > > > > > > > > > > > > > > > > > > > > > > > > > > metrics? > > > > > > > > > > > > > > > > > > > > > > > > For now, implementation of new > > > > > > > > > > > > > > > > > > > > > > > > metric is very simple. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can implement this > > > > > > > > > > > > > > > > > > > > > > > > metrics as a single > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > contribution. > > > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, > > > > > > > > > > > > > > > > > > > > > > > > Anton Vinogradov > > > > > > > > > > > > > > > > > > > > > > > > пишет: > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need now is > > > > > > > > > > > > > > > > > > > > > > > > > a 1 simple metric: > > > > > > > > > > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > > > operations > > > > > > > > > > > > > > > > > > > > > blocked? > > > > > > > > > > > > > > > > > > > > > > > > > Just a true or false. > > > > > > > > > > > > > > > > > > > > > > > > > Lest start from this. > > > > > > > > > > > > > > > > > > > > > > > > > All other metrics can be > > > > > > > > > > > > > > > > > > > > > > > > > extracted from logs now > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > can > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > > implemented > > > > > > > > > > > > > > > > > > > > > > > > > later. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 > > > > > > > > > > > > > > > > > > > > > > > > > PM Nikolay Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go ahead. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add some > > > > > > > > > > > > > > > > > > > > > > > > > > > useful metrics about the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > partition map > > > > > > > > > > > > > > > > > > > > > exchange > > > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the > > > > > > > > > > > > > > > > > > > > > > > > > > > duration of PME stages > > > > > > > > > > > > > > > > > > > > > > > > > > > > available > > > > > > > > > > > > > > > > > only in > > > > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > > > > > > > files > > > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be obtained > > > > > > > > > > > > > > > > > > > > > > > > > > > using JMX or other > > > > > > > > > > > > > > > > > > > > > > > > external > > > > > > > > > > > > > > > > > tools. [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of local > > > > > > > > > > > > > > > > > > > > > > > > > > > node metrics that > > > > > > > > > > > > > > > > > > > > > > > > help to > > > > > > > > > > > > > > > > > understand > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > > > actual status of current > > > > > > > > > > > > > > > > > > > > > > > > > > > PME: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. > > > > > > > > > > > > > > > > > > > > > > > > > > > Topology version that > > > > > > > > > > > > > > > > > > > > > > > > initiates > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > exchange. > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time PME was > > > > > > > > > > > > > > > > > > > > > > > > > > > started. > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event that > > > > > > > > > > > > > > > > > > > > > > > > > > > triggered PME. > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. > > > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node has > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > finished > > > > > > > > > > > > > > > > > waiting > > > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > > > > > > > > updates and translations > > > > > > > > > > > > > > > > > > > > > > > > > > > on a previous > > > > > > > > > > > > > > > > > > > > > > > > topology. > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. > > > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node > > > > > > > > > > > > > > > > > > > > > > > > sent a > > > > > > > > > > > > > > > > > single > > > > > > > > > > > > > > > > > > > > > message. > > > > > > > > > > > > > > > > > > > > > > > > > > > 6. > > > > > > > > > > > > > > > > > > > > > > > > > > > recieveFullMessageTime. > > > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node > > > > > > > > > > > > > > > > > > > > > > > > > > > > received > > > > > > > > > > > > > > > a > > > > > > > > > > > > > > > > > full > > > > > > > > > > > > > > > > > > > > > message. > > > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time PME > > > > > > > > > > > > > > > > > > > > > > > > > > > was ended. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all > > > > > > > > > > > > > > > > > > > > > > > > > > > these metrics resets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help to > > > > > > > > > > > > > > > > > > > > > > > > > > > understand: > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was > > > > > > > > > > > > > > > > > > > > > > > > > > > (current or previous). > > > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited for > > > > > > > > > > > > > > > > > > > > > > > > > > > all updates was > > > > > > > > > > > > > > > > > > > > > > > > completed. > > > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks PME > > > > > > > > > > > > > > > > > > > > > > > > > > > (didn't send a single > > > > > > > > > > > > > > > > > > > > > > > > > > > > message) > > > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961 > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Zhenya Stanilovsky > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best wishes, > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best wishes, > > > > > Amelchev Nikita > > >
signature.asc
Description: This is a digitally signed message part