Guys. I think we should go with the 2 metrics
* current PME duration (resets on finish) This metric required for alerting(or automatic actions) on long PME. * PME duration histogram (value added to metrics on PME finish) This metric required for an: * Quick PME trend analysis * Quick PME history analysis В Ср, 24/07/2019 в 15:01 +0300, Ivan Rakov пишет: > Nikita and Maxim, > > > What if we just update current metric getCurrentPmeDuration behaviour > > to show durations only for blocking PMEs? > > Remain it as a long value and rename it to > > getCacheOperationsBlockedDuration. > > > > No other changes will require. > > > > WDYT? > > I agree with these two metrics. I also think that current > getCurrentPmeDuration will become redundant. > > Anton, > > > It looks like we're trying to implement "extended debug" instead of > > "monitoring". > > It should not be interesting for real admin what phase of PME is in > > progress and so on. > > PME is mission critical cluster process. I agree that there's a fine > line between monitoring and debug here. However, it's not good to add > monitoring capabilities only for scenario when everything is alright. > If PME will really hang, *real admin* will be extremely interested how > to return cluster back to working state. Metrics about stages completion > time may really help here: e.g. if one specific node hasn't completed > stage X while rest of the cluster has, it can be a signal that this node > should be killed. > > Of course, it's possible to build monitoring system that extract this > information from logs, but: > - It's more resource intensive as it requires parsing logs for all the time > - It's less reliable as log messages may change > > Best Regards, > Ivan Rakov > > On 24.07.2019 14:57, Maxim Muzafarov wrote: > > Folks, > > > > +1 with Anton post. > > > > What if we just update current metric getCurrentPmeDuration behaviour > > to show durations only for blocking PMEs? > > Remain it as a long value and rename it to > > getCacheOperationsBlockedDuration. > > > > No other changes will require. > > > > WDYT? > > > > On Wed, 24 Jul 2019 at 14:02, Nikita Amelchev <nsamelc...@gmail.com> wrote: > > > Nikolay, > > > > > > The сacheOperationsBlockedDuration metric will show current blocking > > > duration or 0 if there is no blocking right now. > > > > > > The totalCacheOperationsBlockedDuration metric will accumulate all > > > blocking durations that happen after node starts. > > > > > > ср, 24 июл. 2019 г. в 13:35, Nikolay Izhikov <nizhi...@apache.org>: > > > > Nikita > > > > > > > > What is the difference between those two metrics? > > > > > > > > ср, 24 июля 2019 г., 12:45 Nikita Amelchev <nsamelc...@gmail.com>: > > > > > > > > > Igniters, thanks for comments. > > > > > > > > > > From the discussion it can be seen that we need only two metrics for > > > > > now: > > > > > - сacheOperationsBlockedDuration (long) > > > > > - totalCacheOperationsBlockedDuration (long) > > > > > > > > > > I will prepare PR at the nearest time. > > > > > > > > > > ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky > > > > > <arzamas...@mail.ru.invalid > > > > > > : > > > > > > > > > > > > +1 with Anton decisions. > > > > > > > > > > > > > > > > > > > Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov > > > > > > > <a...@apache.org>: > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > It looks like we're trying to implement "extended debug" instead > > > > > > > of > > > > > > > "monitoring". > > > > > > > It should not be interesting for real admin what phase of PME is > > > > > > > in > > > > > > > progress and so on. > > > > > > > Interested metrics are > > > > > > > - total blocked time (will be used for real SLA counting) > > > > > > > - are we blocked right now (shows we have an SLA degradation > > > > > > > right now) > > > > > > > Duration of the current blocking period can be easily presented > > > > > > > using > > > > > > > > > > any > > > > > > > modern monitoring tool by regular checks. > > > > > > > Initial true will means "period start", precision will be a > > > > > > > result of > > > > > > > checks frequency. > > > > > > > Anyway, I'm ok to have current metric presented with long, where > > > > > > > long > > > > > > > > > > is a > > > > > > > duration, see no reason, but ok :) > > > > > > > > > > > > > > All other features you mentioned are useful for code or > > > > > > > deployment improving and can (should) be taken from logs at the > > > > > > > analysis > > > > > > > phase. > > > > > > > > > > > > > > On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < > > > > > > > ivan.glu...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > Folks, let me step in. > > > > > > > > > > > > > > > > Nikita, thanks for your suggestions! > > > > > > > > > > > > > > > > > 1. initialVersion. Topology version that initiates the > > > > > > > > > exchange. > > > > > > > > > 2. initTime. Time PME was started. > > > > > > > > > 3. initEvent. Event that triggered PME. > > > > > > > > > 4. partitionReleaseTime. Time when a node has finished > > > > > > > > > waiting for > > > > > > > > > > all > > > > > > > > > updates and translations on a previous topology. > > > > > > > > > 5. sendSingleMessageTime. Time when a node sent a single > > > > > > > > > message. > > > > > > > > > 6. recieveFullMessageTime. Time when a node received a full > > > > > > > > > message. > > > > > > > > > 7. finishTime. Time PME was ended. > > > > > > > > > > > > > > > > > > When new PME started all these metrics resets. > > > > > > > > > > > > > > > > Every metric from Nikita's list looks useful and simple to > > > > > > > > implement. > > > > > > > > I think that it would be better to change format of metrics 4, > > > > > > > > 5, 6 > > > > > > > > > > and > > > > > > > > 7 a bit: we can keep only difference between time of previous > > > > > > > > event > > > > > > > > > > and > > > > > > > > time of corresponding event. Such metrics would be easier to > > > > > > > > perceive: > > > > > > > > they answer to specific questions "how much time did partition > > > > > > > > release > > > > > > > > take?" or "how much time did awaiting of distributed phase end > > > > > > > > take?". > > > > > > > > Also, if results of 4, 5, 6, 7 will be exported to monitoring > > > > > > > > system, > > > > > > > > graphs will show how different stages times change from one PME > > > > > > > > to > > > > > > > > > > another. > > > > > > > > > When PME cause no blocking, it's a good PME and I see no > > > > > > > > > reason to > > > > > > > > > > have > > > > > > > > > monitoring related to it > > > > > > > > > > > > > > > > Agree with Anton here. These metrics should be measured only > > > > > > > > for true > > > > > > > > distributed exchange. Saving results for client leave/join PMEs > > > > > > > > will > > > > > > > > just complicate monitoring. > > > > > > > > > > > > > > > > > I agree with total blocking duration metric but > > > > > > > > > I still don't understand why instant value indicating that > > > > > > > > > > operations are > > > > > > > > > blocked should be boolean. > > > > > > > > > Duration time since blocking has started looks more > > > > > > > > > appropriate and > > > > > > > > > > > > > > > > useful. > > > > > > > > > It gives more information while semantic is left the same. > > > > > > > > > > > > > > > > Totally agree with Pavel here. Both "accumulated block time" and > > > > > > > > "current PME block time" metrics are useful. Growth of > > > > > > > > accumulated > > > > > > > > metric for specific period of time (should be easy to check via > > > > > > > > monitoring system graph) will show for how much business > > > > > > > > operations > > > > > > > > > > were > > > > > > > > blocked in total, and non-zero current metric will show that we > > > > > > > > are > > > > > > > > experiencing issues right now. Boolean metric "are we blocked > > > > > > > > right > > > > > > > > > > now" > > > > > > > > is not needed as it's obviously can be inferred from "current > > > > > > > > PME > > > > > > > > > > block > > > > > > > > time". > > > > > > > > > > > > > > > > Best Regards, > > > > > > > > Ivan Rakov > > > > > > > > > > > > > > > > On 23.07.2019 16:02, Pavel Kovalenko wrote: > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > I agree with total blocking duration metric but > > > > > > > > > I still don't understand why instant value indicating that > > > > > > > > > > operations are > > > > > > > > > blocked should be boolean. > > > > > > > > > Duration time since blocking has started looks more > > > > > > > > > appropriate and > > > > > > > > > > > > > > > > useful. > > > > > > > > > It gives more information while semantic is left the same. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > : > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > All previous suggestions have some disadvantages. It can be > > > > > > > > > > several > > > > > > > > > > exchanges between two metric updates and fast exchange can > > > > > > > > > > rewrite > > > > > > > > > > previous long exchange. > > > > > > > > > > > > > > > > > > > > We can introduce a metric of total blocking duration that > > > > > > > > > > will > > > > > > > > > > accumulate at the end of the exchange. So, users will get > > > > > > > > > > actual > > > > > > > > > > information about how long operations were blocked. Cluster > > > > > > > > > > metric > > > > > > > > > > will be a maximum of local nodes metrics. And we need a > > > > > > > > > > boolean > > > > > > > > > > metric > > > > > > > > > > that will indicate realtime status. It needs because of > > > > > > > > > > duration > > > > > > > > > > metric updates at the end of the exchange. > > > > > > > > > > > > > > > > > > > > So I propose to change the current metric that not released > > > > > > > > > > to the > > > > > > > > > > totalCacheOperationsBlockingDuration metric and to add the > > > > > > > > > > isCacheOperationsBlocked metric. > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < > > > > > > > > > > a...@apache.org >: > > > > > > > > > > > Nikolay, > > > > > > > > > > > > > > > > > > > > > > Still see no reason to replace boolean with long. > > > > > > > > > > > > > > > > > > > > > > On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > wrote: > > > > > > > > > > > > Anton. > > > > > > > > > > > > > > > > > > > > > > > > 1. Value exported based on SPI settings, not in the > > > > > > > > > > > > moment it > > > > > > > > > > changed. > > > > > > > > > > > > 2. Clock synchronisation - if we export start time, we > > > > > > > > > > > > should > > > > > > > > > > also > > > > > > > > > > export > > > > > > > > > > > > node local timestamp. > > > > > > > > > > > > > > > > > > > > > > > > пн, 22 июля 2019 г., 8:33 Anton Vinogradov < > > > > > > > > > > > > a...@apache.org >: > > > > > > > > > > > > > > > > > > > > > > > > > Folks, > > > > > > > > > > > > > > > > > > > > > > > > > > What's the reason for duration counting? > > > > > > > > > > > > > AFAIU, it's a monitoring system feature to count the > > > > > > > > > > > > > durations. > > > > > > > > > > > > > Sine monitoring system checks metrics periodically it > > > > > > > > > > > > > will know > > > > > > > > > > the > > > > > > > > > > > > > duration by its own log. > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < > > > > > > > > > > jokse...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, I mean duration not timestamp. For the metric > > > > > > > > > > > > > > name, I > > > > > > > > > > suggest > > > > > > > > > > > > > > "cacheOperationsBlockingDuration", I think it > > > > > > > > > > > > > > cleaner > > > > > > > > > > represents > > > > > > > > > > what > > > > > > > > > > > > is > > > > > > > > > > > > > > blocked during PME. > > > > > > > > > > > > > > We can also combine both timestamp > > > > > > > > > > > > > > > > > > > > "cacheOperationsBlockingStartTs" and > > > > > > > > > > > > > > duration to have better correlation when cache > > > > > > > > > > > > > > operations were > > > > > > > > > > > > > > > > > > > > blocked > > > > > > > > > > > > > and > > > > > > > > > > > > > > how much time it's taken. > > > > > > > > > > > > > > For instant view (like in JMX bean) a calculated > > > > > > > > > > > > > > value as you > > > > > > > > > > > > > > > > > > > > mentioned > > > > > > > > > > > > > > can be used. > > > > > > > > > > > > > > For metrics are exported to some backend (IEP-35) a > > > > > > > > > > > > > > counter > > > > > > > > > > can be > > > > > > > > > > > > used. > > > > > > > > > > > > > > The counter is incremented by blocking time after > > > > > > > > > > > > > > blocking has > > > > > > > > > > > > > > > > > > > > ended. > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > : > > > > > > > > > > > > > > > Pavel, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The main purpose of this metric is > > > > > > > > > > > > > > > > > how much time we wait for resuming cache > > > > > > > > > > > > > > > > > operations > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Seems I misunderstood you. Do you mean timestamp > > > > > > > > > > > > > > > or duration > > > > > > > > > > here? > > > > > > > > > > > > > > > > > What do you think if we change the boolean > > > > > > > > > > > > > > > > > value of metric > > > > > > > > > > to a > > > > > > > > > > > > long > > > > > > > > > > > > > > > value that represents time in milliseconds when > > > > > > > > > > > > > > > operations > > > > > > > > > > were > > > > > > > > > > > > blocked? > > > > > > > > > > > > > > > This time can be calculated as (currentTime - > > > > > > > > > > > > > > > timeSinceOperationsBlocked) in case of timestamp. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Duration will be more understandable. It'll be > > > > > > > > > > > > > > > something like > > > > > > > > > > > > > > > getCurrentBlockingPmeDuration. But I haven't come > > > > > > > > > > > > > > > up with a > > > > > > > > > > better > > > > > > > > > > > > > > > name yet. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < > > > > > > > > > > jokse...@gmail.com > > > > > > > > > > > : > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think getCurrentPmeDuration doesn't show > > > > > > > > > > > > > > > > useful > > > > > > > > > > information. > > > > > > > > > > The > > > > > > > > > > > > > main > > > > > > > > > > > > > > > PME side effect for end-users is blocking cache > > > > > > > > > > > > > > > operations. > > > > > > > > > > Not > > > > > > > > > > all > > > > > > > > > > > > PME > > > > > > > > > > > > > > > time blocks it. > > > > > > > > > > > > > > > > What information gives to an end-user timestamp > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > "timeSinceOperationsBlocked"? For what analysis > > > > > > > > > > > > > > > it can be > > > > > > > > > > used and > > > > > > > > > > > > how? > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev < > > > > > > > > > > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > Hi Pavel, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This time already can be obtained from the > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > new isOperationsBlockedByPme metrics. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > As an alternative solution, I can rework > > > > > > > > > > > > > > > > > recently added > > > > > > > > > > > > > > > > > getCurrentPmeDuration metric (not released > > > > > > > > > > > > > > > > > yet). Seems for > > > > > > > > > > > > > > > > > > > > users it > > > > > > > > > > > > > > > > > useless in case of non-blocking PME. > > > > > > > > > > > > > > > > > Lets name it timeSinceOperationsBlocked. > > > > > > > > > > > > > > > > > It'll be timestamp > > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > blocking started (minimal value of cluster > > > > > > > > > > > > > > > > > nodes) and 0 if > > > > > > > > > > > > > > > > > > > > blocking > > > > > > > > > > > > > > > > > ends (there is no running PME). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko < > > > > > > > > > > > > > > > > > > > > jokse...@gmail.com >: > > > > > > > > > > > > > > > > > > Hi Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for working on this. What do you > > > > > > > > > > > > > > > > > > think if we > > > > > > > > > > > > > > > > > > > > change the > > > > > > > > > > > > > > > boolean > > > > > > > > > > > > > > > > > > value of metric to a long value that > > > > > > > > > > > > > > > > > > represents time in > > > > > > > > > > > > > > > > > > > > > > > > > > milliseconds > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > operations were blocked? > > > > > > > > > > > > > > > > > > Since we have not only JMX and now metrics > > > > > > > > > > > > > > > > > > are periodically > > > > > > > > > > > > > > > > > > > > > > > > > > exported > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > some backend it can give a more clear > > > > > > > > > > > > > > > > > > picture of how much > > > > > > > > > > > > > > > > > > > > time we > > > > > > > > > > > > > > > wait for > > > > > > > > > > > > > > > > > > resuming cache operations instead of > > > > > > > > > > > > > > > > > > instant boolean > > > > > > > > > > > > > > > > > > > > indicator. > > > > > > > > > > > > > > > > > > пт, 19 июл. 2019 г. в 14:41, Nikita > > > > > > > > > > > > > > > > > > Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > Anton, Nikolay, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the support. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For now, we have the > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration() metric that > > > > > > > > > > > > > > > > > > > > does > > > > > > > > > > > > not > > > > > > > > > > > > > > > show > > > > > > > > > > > > > > > > > > > influence on the cluster correctly. PME > > > > > > > > > > > > > > > > > > > can be without > > > > > > > > > > > > > > > > > > > > blocking > > > > > > > > > > > > > > > > > > > operations. For example, client node > > > > > > > > > > > > > > > > > > > join/leave events. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest add new metric - > > > > > > > > > > > > > > > > > > > isOperationsBlockedByPme(). > > > > > > > > > > > > > > > > > > > > > > > > Together, > > > > > > > > > > > > > > > these > > > > > > > > > > > > > > > > > > > metrics will show influence of the PME on > > > > > > > > > > > > > > > > > > > cluster and user > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > operations. > > > > > > > > > > > > > > > > > > > I have prepared PR for this (Bot visa is > > > > > > > > > > > > > > > > > > > green). [1] Can > > > > > > > > > > > > > > > > > > > > anyone > > > > > > > > > > > > > > > take a > > > > > > > > > > > > > > > > > > > look? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июл. 2019 г. в 14:58, Nikolay > > > > > > > > > > > > > > > > > > > Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > I think administator of Ignite cluster > > > > > > > > > > > > > > > > > > > > should be able to > > > > > > > > > > > > > > > > > > > > > > > > > > monitor > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > Ignite process, including non blocking > > > > > > > > > > > > > > > > > > > PME. > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 14:57 +0300, Anton > > > > > > > > > > > > > > > > > > > > Vinogradov пишет: > > > > > > > > > > > > > > > > > > > > > BTW, > > > > > > > > > > > > > > > > > > > > > Found PME metric - > > > > > > > > > > > > > > > > > > > > > getCurrentPmeDuration(). > > > > > > > > > > > > > > > > > > > > > Seems, it shows exactly PME time and > > > > > > > > > > > > > > > > > > > > > not so useful > > > > > > > > > > > > > > > > > > > > because > > > > > > > > > > > > of > > > > > > > > > > > > > > > this. > > > > > > > > > > > > > > > > > > > > > The goal it so show exactly blocking > > > > > > > > > > > > > > > > > > > > > period. > > > > > > > > > > > > > > > > > > > > > When PME cause no blocking, it's a > > > > > > > > > > > > > > > > > > > > > good PME and I see > > > > > > > > > > > > > > > > > > > > no > > > > > > > > > > > > > > > reason to have > > > > > > > > > > > > > > > > > > > > > monitoring related to it :) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 2:50 PM > > > > > > > > > > > > > > > > > > > > > Nikolay Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > Anton. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Why do we need to postpone > > > > > > > > > > > > > > > > > > > > > > implementation of this > > > > > > > > > > > > > > > > > > > > > > > > metrics? > > > > > > > > > > > > > > > > > > > > > > For now, implementation of new > > > > > > > > > > > > > > > > > > > > > > metric is very simple. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think we can implement this > > > > > > > > > > > > > > > > > > > > > > metrics as a single > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > contribution. > > > > > > > > > > > > > > > > > > > > > > В Вт, 16/07/2019 в 13:47 +0300, > > > > > > > > > > > > > > > > > > > > > > Anton Vinogradov > > > > > > > > > > > > > > > > > > > > пишет: > > > > > > > > > > > > > > > > > > > > > > > Nikita, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looks like all we need now is a 1 > > > > > > > > > > > > > > > > > > > > > > > simple metric: > > > > > > > > > > > > > > > > > > > > are > > > > > > > > > > > > > > > operations > > > > > > > > > > > > > > > > > > > blocked? > > > > > > > > > > > > > > > > > > > > > > > Just a true or false. > > > > > > > > > > > > > > > > > > > > > > > Lest start from this. > > > > > > > > > > > > > > > > > > > > > > > All other metrics can be > > > > > > > > > > > > > > > > > > > > > > > extracted from logs now > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > can > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > implemented > > > > > > > > > > > > > > > > > > > > > > > later. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 16, 2019 at 12:46 PM > > > > > > > > > > > > > > > > > > > > > > > Nikolay Izhikov < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nizhi...@apache.org > > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > +1. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Nikita, please, go ahead. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > вт, 16 июля 2019 г., 11:45 > > > > > > > > > > > > > > > > > > > > > > > > Nikita Amelchev < > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > nsamelc...@gmail.com > > > > > > > > > > > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > Hello, Igniters. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I suggest to add some useful > > > > > > > > > > > > > > > > > > > > > > > > > metrics about the > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > partition map > > > > > > > > > > > > > > > > > > > exchange > > > > > > > > > > > > > > > > > > > > > > > > > (PME). For now, the duration > > > > > > > > > > > > > > > > > > > > > > > > > of PME stages > > > > > > > > > > > > > > > > > > > > > > > > available > > > > > > > > > > > > > > > only in > > > > > > > > > > > > > > > > > > > log > > > > > > > > > > > > > > > > > > > > > > files > > > > > > > > > > > > > > > > > > > > > > > > > and cannot be obtained using > > > > > > > > > > > > > > > > > > > > > > > > > JMX or other > > > > > > > > > > > > > > > > > > > > external > > > > > > > > > > > > > > > tools. [1] > > > > > > > > > > > > > > > > > > > > > > > > > I made the list of local node > > > > > > > > > > > > > > > > > > > > > > > > > metrics that > > > > > > > > > > > > > > > > > > > > help to > > > > > > > > > > > > > > > understand > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > > > > > > actual status of current PME: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. initialVersion. Topology > > > > > > > > > > > > > > > > > > > > > > > > > version that > > > > > > > > > > > > > > > > > > > > initiates > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > exchange. > > > > > > > > > > > > > > > > > > > > > > > > > 2. initTime. Time PME was > > > > > > > > > > > > > > > > > > > > > > > > > started. > > > > > > > > > > > > > > > > > > > > > > > > > 3. initEvent. Event that > > > > > > > > > > > > > > > > > > > > > > > > > triggered PME. > > > > > > > > > > > > > > > > > > > > > > > > > 4. partitionReleaseTime. Time > > > > > > > > > > > > > > > > > > > > > > > > > when a node has > > > > > > > > > > > > > > > > > > > > > > > > > > finished > > > > > > > > > > > > > > > waiting > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > > > all > > > > > > > > > > > > > > > > > > > > > > > > > updates and translations on a > > > > > > > > > > > > > > > > > > > > > > > > > previous > > > > > > > > > > > > > > > > > > > > topology. > > > > > > > > > > > > > > > > > > > > > > > > > 5. sendSingleMessageTime. > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node > > > > > > > > > > > > > > > > > > > > sent a > > > > > > > > > > > > > > > single > > > > > > > > > > > > > > > > > > > message. > > > > > > > > > > > > > > > > > > > > > > > > > 6. recieveFullMessageTime. > > > > > > > > > > > > > > > > > > > > > > > > > Time when a node > > > > > > > > > > > > > > > > > > > > > > > > received > > > > > > > > > > > > > a > > > > > > > > > > > > > > > full > > > > > > > > > > > > > > > > > > > message. > > > > > > > > > > > > > > > > > > > > > > > > > 7. finishTime. Time PME was > > > > > > > > > > > > > > > > > > > > > > > > > ended. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > When new PME started all > > > > > > > > > > > > > > > > > > > > > > > > > these metrics resets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > These metrics help to > > > > > > > > > > > > > > > > > > > > > > > > > understand: > > > > > > > > > > > > > > > > > > > > > > > > > - how long PME was (current > > > > > > > > > > > > > > > > > > > > > > > > > or previous). > > > > > > > > > > > > > > > > > > > > > > > > > - how long awaited for all > > > > > > > > > > > > > > > > > > > > > > > > > updates was > > > > > > > > > > > > > > > > > > > > completed. > > > > > > > > > > > > > > > > > > > > > > > > > - what node blocks PME > > > > > > > > > > > > > > > > > > > > > > > > > (didn't send a single > > > > > > > > > > > > > > > > > > > > > > > > message) > > > > > > > > > > > > > > > > > > > > > > > > > - what triggered PME. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-11961 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Best wishes, > > > > > > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best wishes, > > > > > > > > > > Amelchev Nikita > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Zhenya Stanilovsky > > > > > > > > > > > > > > > -- > > > > > Best wishes, > > > > > Amelchev Nikita > > > > > > > > > > > > > > -- > > > Best wishes, > > > Amelchev Nikita
signature.asc
Description: This is a digitally signed message part