Igniters, thanks for comments. >From the discussion it can be seen that we need only two metrics for now: - сacheOperationsBlockedDuration (long) - totalCacheOperationsBlockedDuration (long)
I will prepare PR at the nearest time. ср, 24 июл. 2019 г. в 09:11, Zhenya Stanilovsky <arzamas...@mail.ru.invalid>: > > +1 with Anton decisions. > > > >Среда, 24 июля 2019, 8:44 +03:00 от Anton Vinogradov <a...@apache.org>: > > > >Folks, > > > >It looks like we're trying to implement "extended debug" instead of > >"monitoring". > >It should not be interesting for real admin what phase of PME is in > >progress and so on. > >Interested metrics are > >- total blocked time (will be used for real SLA counting) > >- are we blocked right now (shows we have an SLA degradation right now) > >Duration of the current blocking period can be easily presented using any > >modern monitoring tool by regular checks. > >Initial true will means "period start", precision will be a result of > >checks frequency. > >Anyway, I'm ok to have current metric presented with long, where long is a > >duration, see no reason, but ok :) > > > >All other features you mentioned are useful for code or > >deployment improving and can (should) be taken from logs at the analysis > >phase. > > > >On Tue, Jul 23, 2019 at 7:22 PM Ivan Rakov < ivan.glu...@gmail.com > wrote: > > > >> Folks, let me step in. > >> > >> Nikita, thanks for your suggestions! > >> > >> > 1. initialVersion. Topology version that initiates the exchange. > >> > 2. initTime. Time PME was started. > >> > 3. initEvent. Event that triggered PME. > >> > 4. partitionReleaseTime. Time when a node has finished waiting for all > >> > updates and translations on a previous topology. > >> > 5. sendSingleMessageTime. Time when a node sent a single message. > >> > 6. recieveFullMessageTime. Time when a node received a full message. > >> > 7. finishTime. Time PME was ended. > >> > > >> > When new PME started all these metrics resets. > >> Every metric from Nikita's list looks useful and simple to implement. > >> I think that it would be better to change format of metrics 4, 5, 6 and > >> 7 a bit: we can keep only difference between time of previous event and > >> time of corresponding event. Such metrics would be easier to perceive: > >> they answer to specific questions "how much time did partition release > >> take?" or "how much time did awaiting of distributed phase end take?". > >> Also, if results of 4, 5, 6, 7 will be exported to monitoring system, > >> graphs will show how different stages times change from one PME to another. > >> > >> > When PME cause no blocking, it's a good PME and I see no reason to have > >> > monitoring related to it > >> Agree with Anton here. These metrics should be measured only for true > >> distributed exchange. Saving results for client leave/join PMEs will > >> just complicate monitoring. > >> > >> > I agree with total blocking duration metric but > >> > I still don't understand why instant value indicating that operations are > >> > blocked should be boolean. > >> > Duration time since blocking has started looks more appropriate and > >> useful. > >> > It gives more information while semantic is left the same. > >> Totally agree with Pavel here. Both "accumulated block time" and > >> "current PME block time" metrics are useful. Growth of accumulated > >> metric for specific period of time (should be easy to check via > >> monitoring system graph) will show for how much business operations were > >> blocked in total, and non-zero current metric will show that we are > >> experiencing issues right now. Boolean metric "are we blocked right now" > >> is not needed as it's obviously can be inferred from "current PME block > >> time". > >> > >> Best Regards, > >> Ivan Rakov > >> > >> On 23.07.2019 16:02, Pavel Kovalenko wrote: > >> > Nikita, > >> > > >> > I agree with total blocking duration metric but > >> > I still don't understand why instant value indicating that operations are > >> > blocked should be boolean. > >> > Duration time since blocking has started looks more appropriate and > >> useful. > >> > It gives more information while semantic is left the same. > >> > > >> > > >> > > >> > вт, 23 июл. 2019 г. в 11:42, Nikita Amelchev < nsamelc...@gmail.com >: > >> > > >> >> Folks, > >> >> > >> >> All previous suggestions have some disadvantages. It can be several > >> >> exchanges between two metric updates and fast exchange can rewrite > >> >> previous long exchange. > >> >> > >> >> We can introduce a metric of total blocking duration that will > >> >> accumulate at the end of the exchange. So, users will get actual > >> >> information about how long operations were blocked. Cluster metric > >> >> will be a maximum of local nodes metrics. And we need a boolean metric > >> >> that will indicate realtime status. It needs because of duration > >> >> metric updates at the end of the exchange. > >> >> > >> >> So I propose to change the current metric that not released to the > >> >> totalCacheOperationsBlockingDuration metric and to add the > >> >> isCacheOperationsBlocked metric. > >> >> > >> >> WDYT? > >> >> > >> >> пн, 22 июл. 2019 г. в 09:27, Anton Vinogradov < a...@apache.org >: > >> >>> Nikolay, > >> >>> > >> >>> Still see no reason to replace boolean with long. > >> >>> > >> >>> On Mon, Jul 22, 2019 at 9:19 AM Nikolay Izhikov < nizhi...@apache.org > > >> >> wrote: > >> >>>> Anton. > >> >>>> > >> >>>> 1. Value exported based on SPI settings, not in the moment it changed. > >> >>>> > >> >>>> 2. Clock synchronisation - if we export start time, we should also > >> >> export > >> >>>> node local timestamp. > >> >>>> > >> >>>> пн, 22 июля 2019 г., 8:33 Anton Vinogradov < a...@apache.org >: > >> >>>> > >> >>>>> Folks, > >> >>>>> > >> >>>>> What's the reason for duration counting? > >> >>>>> AFAIU, it's a monitoring system feature to count the durations. > >> >>>>> Sine monitoring system checks metrics periodically it will know the > >> >>>>> duration by its own log. > >> >>>>> > >> >>>>> On Fri, Jul 19, 2019 at 7:32 PM Pavel Kovalenko < jokse...@gmail.com > >> >>>>> > > >> >>>>> wrote: > >> >>>>> > >> >>>>>> Nikita, > >> >>>>>> > >> >>>>>> Yes, I mean duration not timestamp. For the metric name, I suggest > >> >>>>>> "cacheOperationsBlockingDuration", I think it cleaner represents > >> >> what > >> >>>> is > >> >>>>>> blocked during PME. > >> >>>>>> We can also combine both timestamp > >> >> "cacheOperationsBlockingStartTs" and > >> >>>>>> duration to have better correlation when cache operations were > >> >> blocked > >> >>>>> and > >> >>>>>> how much time it's taken. > >> >>>>>> For instant view (like in JMX bean) a calculated value as you > >> >> mentioned > >> >>>>>> can be used. > >> >>>>>> For metrics are exported to some backend (IEP-35) a counter can be > >> >>>> used. > >> >>>>>> The counter is incremented by blocking time after blocking has > >> >> ended. > >> >>>>>> пт, 19 июл. 2019 г. в 19:10, Nikita Amelchev < nsamelc...@gmail.com > >> >>> : > >> >>>>>>> Pavel, > >> >>>>>>> > >> >>>>>>> The main purpose of this metric is > >> >>>>>>>>> how much time we wait for resuming cache operations > >> >>>>>>> Seems I misunderstood you. Do you mean timestamp or duration here? > >> >>>>>>>>> What do you think if we change the boolean value of metric to a > >> >>>> long > >> >>>>>>> value that represents time in milliseconds when operations were > >> >>>> blocked? > >> >>>>>>> This time can be calculated as (currentTime - > >> >>>>>>> timeSinceOperationsBlocked) in case of timestamp. > >> >>>>>>> > >> >>>>>>> Duration will be more understandable. It'll be something like > >> >>>>>>> getCurrentBlockingPmeDuration. But I haven't come up with a better > >> >>>>>>> name yet. > >> >>>>>>> > >> >>>>>>> пт, 19 июл. 2019 г. в 18:30, Pavel Kovalenko < jokse...@gmail.com > >> >>> : > >> >>>>>>>> Nikita, > >> >>>>>>>> > >> >>>>>>>> I think getCurrentPmeDuration doesn't show useful information. > >> >> The > >> >>>>> main > >> >>>>>>> PME side effect for end-users is blocking cache operations. Not > >> >> all > >> >>>> PME > >> >>>>>>> time blocks it. > >> >>>>>>>> What information gives to an end-user timestamp of > >> >>>>>>> "timeSinceOperationsBlocked"? For what analysis it can be used and > >> >>>> how? > >> >>>>>>>> пт, 19 июл. 2019 г. в 17:48, Nikita Amelchev < > >> >> nsamelc...@gmail.com > >> >>>>> : > >> >>>>>>>>> Hi Pavel, > >> >>>>>>>>> > >> >>>>>>>>> This time already can be obtained from the > >> >> getCurrentPmeDuration > >> >>>> and > >> >>>>>>>>> new isOperationsBlockedByPme metrics. > >> >>>>>>>>> > >> >>>>>>>>> As an alternative solution, I can rework recently added > >> >>>>>>>>> getCurrentPmeDuration metric (not released yet). Seems for > >> >> users it > >> >>>>>>>>> useless in case of non-blocking PME. > >> >>>>>>>>> Lets name it timeSinceOperationsBlocked. It'll be timestamp > >> >> when > >> >>>>>>>>> blocking started (minimal value of cluster nodes) and 0 if > >> >> blocking > >> >>>>>>>>> ends (there is no running PME). > >> >>>>>>>>> > >> >>>>>>>>> WDYT? > >> >>>>>>>>> > >> >>>>>>>>> пт, 19 июл. 2019 г. в 15:56, Pavel Kovalenko < > >> >> jokse...@gmail.com >: > >> >>>>>>>>>> Hi Nikita, > >> >>>>>>>>>> > >> >>>>>>>>>> Thank you for working on this. What do you think if we > >> >> change the > >> >>>>>>> boolean > >> >>>>>>>>>> value of metric to a long value that represents time in > >> >>>>> milliseconds > >> >>>>>>> when > >> >>>>>>>>>> operations were blocked? > >> >>>>>>>>>> Since we have not only JMX and now metrics are periodically > >> >>>>> exported > >> >>>>>>> to > >> >>>>>>>>>> some backend it can give a more clear picture of how much > >> >> time we > >> >>>>>>> wait for > >> >>>>>>>>>> resuming cache operations instead of instant boolean > >> >> indicator. > >> >>>>>>>>>> пт, 19 июл. 2019 г. в 14:41, Nikita Amelchev < > >> >>>> nsamelc...@gmail.com > >> >>>>>> : > >> >>>>>>>>>>> Anton, Nikolay, > >> >>>>>>>>>>> > >> >>>>>>>>>>> Thanks for the support. > >> >>>>>>>>>>> > >> >>>>>>>>>>> For now, we have the getCurrentPmeDuration() metric that > >> >> does > >> >>>> not > >> >>>>>>> show > >> >>>>>>>>>>> influence on the cluster correctly. PME can be without > >> >> blocking > >> >>>>>>>>>>> operations. For example, client node join/leave events. > >> >>>>>>>>>>> > >> >>>>>>>>>>> I suggest add new metric - isOperationsBlockedByPme(). > >> >>>> Together, > >> >>>>>>> these > >> >>>>>>>>>>> metrics will show influence of the PME on cluster and user > >> >>>>>>> operations. > >> >>>>>>>>>>> I have prepared PR for this (Bot visa is green). [1] Can > >> >> anyone > >> >>>>>>> take a > >> >>>>>>>>>>> look? > >> >>>>>>>>>>> > >> >>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-11961 > >> >>>>>>>>>>> > >> >>>>>>>>>>> вт, 16 июл. 2019 г. в 14:58, Nikolay Izhikov < > >> >>>>> nizhi...@apache.org > >> >>>>>>>> : > >> >>>>>>>>>>>> I think administator of Ignite cluster should be able to > >> >>>>> monitor > >> >>>>>>> all > >> >>>>>>>>>>> Ignite process, including non blocking PME. > >> >>>>>>>>>>>> В Вт, 16/07/2019 в 14:57 +0300, Anton Vinogradov пишет: > >> >>>>>>>>>>>>> BTW, > >> >>>>>>>>>>>>> Found PME metric - getCurrentPmeDuration(). > >> >>>>>>>>>>>>> Seems, it shows exactly PME time and not so useful > >> >> because > >> >>>> of > >> >>>>>>> this. > >> >>>>>>>>>>>>> The goal it so show exactly blocking period. > >> >>>>>>>>>>>>> When PME cause no blocking, it's a good PME and I see > >> >> no > >> >>>>>>> reason to have > >> >>>>>>>>>>>>> monitoring related to it :) > >> >>>>>>>>>>>>> > >> >>>>>>>>>>>>> On Tue, Jul 16, 2019 at 2:50 PM Nikolay Izhikov < > >> >>>>>>> nizhi...@apache.org > > >> >>>>>>>>>>> wrote: > >> >>>>>>>>>>>>>> Anton. > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> Why do we need to postpone implementation of this > >> >>>> metrics? > >> >>>>>>>>>>>>>> For now, implementation of new metric is very simple. > >> >>>>>>>>>>>>>> > >> >>>>>>>>>>>>>> I think we can implement this metrics as a single > >> >>>>>>> contribution. > >> >>>>>>>>>>>>>> В Вт, 16/07/2019 в 13:47 +0300, Anton Vinogradov > >> >> пишет: > >> >>>>>>>>>>>>>>> Nikita, > >> >>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>> Looks like all we need now is a 1 simple metric: > >> >> are > >> >>>>>>> operations > >> >>>>>>>>>>> blocked? > >> >>>>>>>>>>>>>>> Just a true or false. > >> >>>>>>>>>>>>>>> Lest start from this. > >> >>>>>>>>>>>>>>> All other metrics can be extracted from logs now > >> >> and > >> >>>> can > >> >>>>> be > >> >>>>>>>>>>> implemented > >> >>>>>>>>>>>>>>> later. > >> >>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>> On Tue, Jul 16, 2019 at 12:46 PM Nikolay Izhikov < > >> >>>>>>>>>>> nizhi...@apache.org > > >> >>>>>>>>>>>>>>> wrote: > >> >>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>> +1. > >> >>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>> Nikita, please, go ahead. > >> >>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>> вт, 16 июля 2019 г., 11:45 Nikita Amelchev < > >> >>>>>>> nsamelc...@gmail.com > >> >>>>>>>>>>>> : > >> >>>>>>>>>>>>>>>>> Hello, Igniters. > >> >>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>> I suggest to add some useful metrics about the > >> >>>>>>> partition map > >> >>>>>>>>>>> exchange > >> >>>>>>>>>>>>>>>>> (PME). For now, the duration of PME stages > >> >>>> available > >> >>>>>>> only in > >> >>>>>>>>>>> log > >> >>>>>>>>>>>>>> files > >> >>>>>>>>>>>>>>>>> and cannot be obtained using JMX or other > >> >> external > >> >>>>>>> tools. [1] > >> >>>>>>>>>>>>>>>>> I made the list of local node metrics that > >> >> help to > >> >>>>>>> understand > >> >>>>>>>>>>> the > >> >>>>>>>>>>>>>>>>> actual status of current PME: > >> >>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>> 1. initialVersion. Topology version that > >> >> initiates > >> >>>>> the > >> >>>>>>>>>>> exchange. > >> >>>>>>>>>>>>>>>>> 2. initTime. Time PME was started. > >> >>>>>>>>>>>>>>>>> 3. initEvent. Event that triggered PME. > >> >>>>>>>>>>>>>>>>> 4. partitionReleaseTime. Time when a node has > >> >>>>> finished > >> >>>>>>> waiting > >> >>>>>>>>>>> for > >> >>>>>>>>>>>>>> all > >> >>>>>>>>>>>>>>>>> updates and translations on a previous > >> >> topology. > >> >>>>>>>>>>>>>>>>> 5. sendSingleMessageTime. Time when a node > >> >> sent a > >> >>>>>>> single > >> >>>>>>>>>>> message. > >> >>>>>>>>>>>>>>>>> 6. recieveFullMessageTime. Time when a node > >> >>>> received > >> >>>>> a > >> >>>>>>> full > >> >>>>>>>>>>> message. > >> >>>>>>>>>>>>>>>>> 7. finishTime. Time PME was ended. > >> >>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>> When new PME started all these metrics resets. > >> >>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>> These metrics help to understand: > >> >>>>>>>>>>>>>>>>> - how long PME was (current or previous). > >> >>>>>>>>>>>>>>>>> - how long awaited for all updates was > >> >> completed. > >> >>>>>>>>>>>>>>>>> - what node blocks PME (didn't send a single > >> >>>> message) > >> >>>>>>>>>>>>>>>>> - what triggered PME. > >> >>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>> Thoughts? > >> >>>>>>>>>>>>>>>>> > >> >>>>>>>>>>>>>>>>> [1] > >> >>>>> https://issues.apache.org/jira/browse/IGNITE-11961 > >> >>>>>>>>>>>>>>>>> -- > >> >>>>>>>>>>>>>>>>> Best wishes, > >> >>>>>>>>>>>>>>>>> Amelchev Nikita > >> >>>>>>>>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> > >> >>>>>>>>>>> -- > >> >>>>>>>>>>> Best wishes, > >> >>>>>>>>>>> Amelchev Nikita > >> >>>>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> -- > >> >>>>>>>>> Best wishes, > >> >>>>>>>>> Amelchev Nikita > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> -- > >> >>>>>>> Best wishes, > >> >>>>>>> Amelchev Nikita > >> >>>>>>> > >> >> > >> >> > >> >> -- > >> >> Best wishes, > >> >> Amelchev Nikita > >> >> > >> > > > -- > Zhenya Stanilovsky -- Best wishes, Amelchev Nikita