Re: [IEP-35] GridJobProcessorMetrics migration
Hi, Nickolay Yes, sure. I've left some comments on GitHub. пн, 24 июн. 2019 г. в 19:15, Nikolay Izhikov : > Hello, Alex. > > Based on our private discussion I've additionally migrated > `totalExecutionTime` and `totalWaitingTime` counters. > Can you review the PR [1]? > > [1] https://github.com/apache/ignite/pull/6622 > > В Пн, 24/06/2019 в 15:14 +0300, Nikolay Izhikov пишет: > > Hello, Alex. > > > > Thanks for the answer. > > > > 1. I, actually, don't understand your proposal :) > > Can you write it down? > > What numbers should be additionally migrated in this PR? > > Or it's OK for now? > > > > > I think "idle time" is a useful metric > > > > I think "usefulness" or "uselessness" of specific metrics depends on the > questions we can answer with it. > > What questions we can ask about Ignite instance and answer with "idle > time" metric? > > > > > About execution and waiting time , it's not the right way to calculate > it > > > using a jobs list. > > > > Same question here. > > > > What questions we can answer with current numbers? > > > > > Will jobs list contain only active jobs? > > > > All jobs that are scheduled for execution on the node(active + waiting) > should be in the list. > > I try to put more details here, to expose my way of thinking about > metrics and lists: > > > > If you have some issues with the jobs on the node it can be 2 kinds of > issues: > > 1. You are waiting for the results of some job and want to know > why it doesn't execute. > > > > In this case, you should query "jobs list" from Ignite. > > You can get an answer on: > > * What jobs currently executes? > > * How many time your job waiting to be executed? > > > > You can also check "activeJobs", "waitingJobs" metrics > graphics to know changes in the jobs queue during the time. > > Seems, you can predict the start of your job from these > numbers. > > > > 2. You want to understand the lifecycle of some finished(failed > job). > > > > In this case, you should analyze the log of the node. > > It should contain information about time: > > * node recieve job information > > * job added to the queue > > * job started execution > > * job finished(failed) execution. > > > > I don't see questions we can't ask from these sources. > > Do we have such? > > How numbers from current GridJobMetrics can help with these questions? > > > > > > > But, what if a user doesn't use any > > > external monitoring system and wants to know the health of Ignite > instance? > > > > It depends on how we define "health". > > And it's not trivial question :) > > > > > Do we have any plans to implement some simple aggregator and ship it > with Ignite? > > > > I think NO. > > We shouldn't do it. > > > > > Do we have plans to provide some presets for Ignite monitoring for > > > popular monitoring systems? > > > > I think we shouldn't do it. > > Because monitoring presets heavily depends on the usage scenario. > > And it can heavily vary for the Ignite. > > > > > > В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет: > > > Hi Nikolay, > > > > > > I think "idle time" is a useful metric, but it can be calculated > outside of > > > Ignite using external monitoring system. > > > > > > About execution and waiting time, it's not the right way to calculate > it > > > using a jobs list. Will jobs list contain only active jobs? In this > case, > > > you can't calculate these metrics at all, since you don't know the > time of > > > finished jobs. If the list will contain all jobs (will it be > unlimited?), > > > iterating over this list will be resource consuming. In any way, it's > much > > > simpler (and sometimes only possible) for an external monitoring > system to > > > just get some scalar metric than iterate over a list with some > condition. > > > > > > About aggregation, yes, in an ideal world aggregation should be done > with > > > the external monitoring system. But, what if a user doesn't use any > > > external monitoring system and wants to know the health of Ignite > instance? > > > Do we have any plans to implement some simple aggregator and ship it > with > > > Ignite? Do we have plans to provide some presets for Ignite monitoring > for > > > popular monitoring systems? (These questions not related to this PR, > but > > > related to IEP at all) > > > > > > Also, some aggregation metrics ("max" for example) can't be effectively > > > calculated using the external system (you should iterate over a jobs > list > > > again and still precision of such calculation will be no more than the > time > > > between probes). >
Re: [IEP-35] GridJobProcessorMetrics migration
Hello, Alex. Based on our private discussion I've additionally migrated `totalExecutionTime` and `totalWaitingTime` counters. Can you review the PR [1]? [1] https://github.com/apache/ignite/pull/6622 В Пн, 24/06/2019 в 15:14 +0300, Nikolay Izhikov пишет: > Hello, Alex. > > Thanks for the answer. > > 1. I, actually, don't understand your proposal :) > Can you write it down? > What numbers should be additionally migrated in this PR? > Or it's OK for now? > > > I think "idle time" is a useful metric > > I think "usefulness" or "uselessness" of specific metrics depends on the > questions we can answer with it. > What questions we can ask about Ignite instance and answer with "idle time" > metric? > > > About execution and waiting time , it's not the right way to calculate it > > using a jobs list. > > Same question here. > > What questions we can answer with current numbers? > > > Will jobs list contain only active jobs? > > All jobs that are scheduled for execution on the node(active + waiting) > should be in the list. > I try to put more details here, to expose my way of thinking about metrics > and lists: > > If you have some issues with the jobs on the node it can be 2 kinds of > issues: > 1. You are waiting for the results of some job and want to know why it > doesn't execute. > > In this case, you should query "jobs list" from Ignite. > You can get an answer on: > * What jobs currently executes? > * How many time your job waiting to be executed? > > You can also check "activeJobs", "waitingJobs" metrics graphics > to know changes in the jobs queue during the time. > Seems, you can predict the start of your job from these > numbers. > > 2. You want to understand the lifecycle of some finished(failed job). > > In this case, you should analyze the log of the node. > It should contain information about time: > * node recieve job information > * job added to the queue > * job started execution > * job finished(failed) execution. > > I don't see questions we can't ask from these sources. > Do we have such? > How numbers from current GridJobMetrics can help with these questions? > > > > But, what if a user doesn't use any > > external monitoring system and wants to know the health of Ignite instance? > > It depends on how we define "health". > And it's not trivial question :) > > > Do we have any plans to implement some simple aggregator and ship it with > > Ignite? > > I think NO. > We shouldn't do it. > > > Do we have plans to provide some presets for Ignite monitoring for > > popular monitoring systems? > > I think we shouldn't do it. > Because monitoring presets heavily depends on the usage scenario. > And it can heavily vary for the Ignite. > > > В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет: > > Hi Nikolay, > > > > I think "idle time" is a useful metric, but it can be calculated outside of > > Ignite using external monitoring system. > > > > About execution and waiting time, it's not the right way to calculate it > > using a jobs list. Will jobs list contain only active jobs? In this case, > > you can't calculate these metrics at all, since you don't know the time of > > finished jobs. If the list will contain all jobs (will it be unlimited?), > > iterating over this list will be resource consuming. In any way, it's much > > simpler (and sometimes only possible) for an external monitoring system to > > just get some scalar metric than iterate over a list with some condition. > > > > About aggregation, yes, in an ideal world aggregation should be done with > > the external monitoring system. But, what if a user doesn't use any > > external monitoring system and wants to know the health of Ignite instance? > > Do we have any plans to implement some simple aggregator and ship it with > > Ignite? Do we have plans to provide some presets for Ignite monitoring for > > popular monitoring systems? (These questions not related to this PR, but > > related to IEP at all) > > > > Also, some aggregation metrics ("max" for example) can't be effectively > > calculated using the external system (you should iterate over a jobs list > > again and still precision of such calculation will be no more than the time > > between probes). signature.asc Description: This is a digitally signed message part
Re: [IEP-35] GridJobProcessorMetrics migration
Hello, Ivan. > Ignite is a cluster which almost every > time assumes an external monitoring for a production use. +1. > 1. Are we going to preserve a compatibility with metrics present > before? Or are we going to keep only those making sense today? 1. Backward compatibility preserved. 2. Deprecated metrics(and metric APIs) will be removed in Ignite 3. 3. We should make a decision what numbers are "make sense" and what don't. > 2. Can we configure which supported metrics are calculated/exposed? Or > do we calculate/expose everything every time? 1. You can configure filter for the exposed metrics. Only required subset of the metric will be exported. 2. For now, all metrics(not lists!) will be calculated. Please, note, that every metrics is the simple long(double) counter. В Пн, 24/06/2019 в 14:43 +0300, Павлухин Иван пишет: > Hi Nikolay, Alex, > > A couple of my humble comments > > Aggregation should be done with the metric collect system(Prometheus, > > Graphite, etc.). > > I like that statement very much! > > > But, what if a user doesn't use any external monitoring system and wants to > > know the health of Ignite instance? > > I think that we can add more capabilities if a real user demand > appears in future. Generally, Ignite is a cluster which almost every > time assumes an external monitoring for a production use. > > And a couple of general questions regarding monitoring. If they are > answered in IEP you can simply redirect me there. > 1. Are we going to preserve a compatibility with metrics present > before? Or are we going to keep only those making sense today? > 2. Can we configure which supported metrics are calculated/exposed? Or > do we calculate/expose everything every time? > > пн, 24 июн. 2019 г. в 12:46, Alex Plehanov : > > > > Hi Nikolay, > > > > I think "idle time" is a useful metric, but it can be calculated outside of > > Ignite using external monitoring system. > > > > About execution and waiting time, it's not the right way to calculate it > > using a jobs list. Will jobs list contain only active jobs? In this case, > > you can't calculate these metrics at all, since you don't know the time of > > finished jobs. If the list will contain all jobs (will it be unlimited?), > > iterating over this list will be resource consuming. In any way, it's much > > simpler (and sometimes only possible) for an external monitoring system to > > just get some scalar metric than iterate over a list with some condition. > > > > About aggregation, yes, in an ideal world aggregation should be done with > > the external monitoring system. But, what if a user doesn't use any > > external monitoring system and wants to know the health of Ignite instance? > > Do we have any plans to implement some simple aggregator and ship it with > > Ignite? Do we have plans to provide some presets for Ignite monitoring for > > popular monitoring systems? (These questions not related to this PR, but > > related to IEP at all) > > > > Also, some aggregation metrics ("max" for example) can't be effectively > > calculated using the external system (you should iterate over a jobs list > > again and still precision of such calculation will be no more than the time > > between probes). > > > signature.asc Description: This is a digitally signed message part
Re: [IEP-35] GridJobProcessorMetrics migration
Hello, Alex. Thanks for the answer. 1. I, actually, don't understand your proposal :) Can you write it down? What numbers should be additionally migrated in this PR? Or it's OK for now? > I think "idle time" is a useful metric I think "usefulness" or "uselessness" of specific metrics depends on the questions we can answer with it. What questions we can ask about Ignite instance and answer with "idle time" metric? > About execution and waiting time , it's not the right way to calculate it > using a jobs list. Same question here. What questions we can answer with current numbers? > Will jobs list contain only active jobs? All jobs that are scheduled for execution on the node(active + waiting) should be in the list. I try to put more details here, to expose my way of thinking about metrics and lists: If you have some issues with the jobs on the node it can be 2 kinds of issues: 1. You are waiting for the results of some job and want to know why it doesn't execute. In this case, you should query "jobs list" from Ignite. You can get an answer on: * What jobs currently executes? * How many time your job waiting to be executed? You can also check "activeJobs", "waitingJobs" metrics graphics to know changes in the jobs queue during the time. Seems, you can predict the start of your job from these numbers. 2. You want to understand the lifecycle of some finished(failed job). In this case, you should analyze the log of the node. It should contain information about time: * node recieve job information * job added to the queue * job started execution * job finished(failed) execution. I don't see questions we can't ask from these sources. Do we have such? How numbers from current GridJobMetrics can help with these questions? > But, what if a user doesn't use any > external monitoring system and wants to know the health of Ignite instance? It depends on how we define "health". And it's not trivial question :) > Do we have any plans to implement some simple aggregator and ship it with > Ignite? I think NO. We shouldn't do it. > Do we have plans to provide some presets for Ignite monitoring for > popular monitoring systems? I think we shouldn't do it. Because monitoring presets heavily depends on the usage scenario. And it can heavily vary for the Ignite. В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет: > Hi Nikolay, > > I think "idle time" is a useful metric, but it can be calculated outside of > Ignite using external monitoring system. > > About execution and waiting time, it's not the right way to calculate it > using a jobs list. Will jobs list contain only active jobs? In this case, > you can't calculate these metrics at all, since you don't know the time of > finished jobs. If the list will contain all jobs (will it be unlimited?), > iterating over this list will be resource consuming. In any way, it's much > simpler (and sometimes only possible) for an external monitoring system to > just get some scalar metric than iterate over a list with some condition. > > About aggregation, yes, in an ideal world aggregation should be done with > the external monitoring system. But, what if a user doesn't use any > external monitoring system and wants to know the health of Ignite instance? > Do we have any plans to implement some simple aggregator and ship it with > Ignite? Do we have plans to provide some presets for Ignite monitoring for > popular monitoring systems? (These questions not related to this PR, but > related to IEP at all) > > Also, some aggregation metrics ("max" for example) can't be effectively > calculated using the external system (you should iterate over a jobs list > again and still precision of such calculation will be no more than the time > between probes). signature.asc Description: This is a digitally signed message part
Re: [IEP-35] GridJobProcessorMetrics migration
Hi Nikolay, Alex, A couple of my humble comments > Aggregation should be done with the metric collect system(Prometheus, > Graphite, etc.). I like that statement very much! > But, what if a user doesn't use any external monitoring system and wants to > know the health of Ignite instance? I think that we can add more capabilities if a real user demand appears in future. Generally, Ignite is a cluster which almost every time assumes an external monitoring for a production use. And a couple of general questions regarding monitoring. If they are answered in IEP you can simply redirect me there. 1. Are we going to preserve a compatibility with metrics present before? Or are we going to keep only those making sense today? 2. Can we configure which supported metrics are calculated/exposed? Or do we calculate/expose everything every time? пн, 24 июн. 2019 г. в 12:46, Alex Plehanov : > > Hi Nikolay, > > I think "idle time" is a useful metric, but it can be calculated outside of > Ignite using external monitoring system. > > About execution and waiting time, it's not the right way to calculate it > using a jobs list. Will jobs list contain only active jobs? In this case, > you can't calculate these metrics at all, since you don't know the time of > finished jobs. If the list will contain all jobs (will it be unlimited?), > iterating over this list will be resource consuming. In any way, it's much > simpler (and sometimes only possible) for an external monitoring system to > just get some scalar metric than iterate over a list with some condition. > > About aggregation, yes, in an ideal world aggregation should be done with > the external monitoring system. But, what if a user doesn't use any > external monitoring system and wants to know the health of Ignite instance? > Do we have any plans to implement some simple aggregator and ship it with > Ignite? Do we have plans to provide some presets for Ignite monitoring for > popular monitoring systems? (These questions not related to this PR, but > related to IEP at all) > > Also, some aggregation metrics ("max" for example) can't be effectively > calculated using the external system (you should iterate over a jobs list > again and still precision of such calculation will be no more than the time > between probes). -- Best regards, Ivan Pavlukhin
Re: [IEP-35] GridJobProcessorMetrics migration
Hi Nikolay, I think "idle time" is a useful metric, but it can be calculated outside of Ignite using external monitoring system. About execution and waiting time, it's not the right way to calculate it using a jobs list. Will jobs list contain only active jobs? In this case, you can't calculate these metrics at all, since you don't know the time of finished jobs. If the list will contain all jobs (will it be unlimited?), iterating over this list will be resource consuming. In any way, it's much simpler (and sometimes only possible) for an external monitoring system to just get some scalar metric than iterate over a list with some condition. About aggregation, yes, in an ideal world aggregation should be done with the external monitoring system. But, what if a user doesn't use any external monitoring system and wants to know the health of Ignite instance? Do we have any plans to implement some simple aggregator and ship it with Ignite? Do we have plans to provide some presets for Ignite monitoring for popular monitoring systems? (These questions not related to this PR, but related to IEP at all) Also, some aggregation metrics ("max" for example) can't be effectively calculated using the external system (you should iterate over a jobs list again and still precision of such calculation will be no more than the time between probes).
[IEP-35] GridJobProcessorMetrics migration
Hello, Igniters. Especially, Ignite veterans. I've prepared PR [1] for the ticket IGNITE-11926 [2]. I found that we don't have any tests for the current GridJobMetrics implementation. So I added basic tests for the current implementation in the PR. Guys, do we have real-world usages of numbers from these metrics? Back to my PR: I think we should migrate only a few of the existing GridJobProcessor metrics. And that's why: 1. We shouldn't migrate aggregate metrics - max*, avg* Aggregation should be done with the metric collect system(Prometheus, Graphite, etc.). 2. We shouldn't migrate `cpuLoadAvg` Metrics for CPU should be available from separate sources(OS sensors or similar). 3. We shouldn't migrate `curidleTime`, `totalIdleTime`. Idle metrics doesn't make sense for me. They can be obtained from regularly scrapped `activeJobs` value. Seems, they can't be used in the real world. Imagine 32 CPU server with only one active job. Idle time will be 0 for this scenario. 4. Execution(waiting) time should be available per job in the job list. So my PR contains counters for the following numbers. All the code belongs to the GridJobProcessor becomes deprecated. Can someone do the review? ``` /** Number of started jobs. */ final LongMetricImpl startedJobsMetric; /** Number of active jobs currently executing. */ final LongMetricImpl activeJobsMetric; /** Number of currently queued jobs waiting to be executed. */ final LongMetricImpl waitingJobsMetric; /** Number of cancelled jobs that are still running. */ final LongMetricImpl canceledJobsMetric; /** Number of jobs rejected after more recent collision resolution operation. */ final LongMetricImpl rejectedJobsMetric; /** Number of finished jobs. */ final LongMetricImpl finishedJobsMetric; ``` [1] https://github.com/apache/ignite/pull/6622 [2] https://issues.apache.org/jira/browse/IGNITE-11926 signature.asc Description: This is a digitally signed message part