Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-25 Thread Alex Plehanov
Hi, Nickolay

Yes, sure. I've left some comments on GitHub.

пн, 24 июн. 2019 г. в 19:15, Nikolay Izhikov :

> Hello, Alex.
>
> Based on our private discussion I've additionally migrated
> `totalExecutionTime` and `totalWaitingTime` counters.
> Can you review the PR [1]?
>
> [1] https://github.com/apache/ignite/pull/6622
>
> В Пн, 24/06/2019 в 15:14 +0300, Nikolay Izhikov пишет:
> > Hello, Alex.
> >
> > Thanks for the answer.
> >
> > 1. I, actually, don't understand your proposal :)
> > Can you write it down?
> > What numbers should be additionally migrated in this PR?
> > Or it's OK for now?
> >
> > > I think "idle time" is a useful metric
> >
> > I think "usefulness" or "uselessness" of specific metrics depends on the
> questions we can answer with it.
> > What questions we can ask about Ignite instance and answer with "idle
> time" metric?
> >
> > > About execution and waiting time , it's not the right way to calculate
> it
> > > using a jobs list.
> >
> > Same question here.
> >
> > What questions we can answer with current numbers?
> >
> > > Will jobs list contain only active jobs?
> >
> > All jobs that are scheduled for execution on the node(active + waiting)
> should be in the list.
> > I try to put more details here, to expose my way of thinking about
> metrics and lists:
> >
> > If you have some issues with the jobs on the node it can be 2 kinds of
> issues:
> >   1. You are waiting for the results of some job and want to know
> why it doesn't execute.
> >
> >   In this case, you should query "jobs list" from Ignite.
> >   You can get an answer on:
> >   * What jobs currently executes?
> >   * How many time your job waiting to be executed?
> >
> >   You can also check "activeJobs", "waitingJobs" metrics
> graphics to know changes in the jobs queue during the time.
> >   Seems, you can predict the start of your job from these
> numbers.
> >
> >   2. You want to understand the lifecycle of some finished(failed
> job).
> >
> >   In this case, you should analyze the log of the node.
> >   It should contain information about time:
> >   * node recieve job information
> >   * job added to the queue
> >   * job started execution
> >   * job finished(failed) execution.
> >
> > I don't see questions we can't ask from these sources.
> > Do we have such?
> > How numbers from current GridJobMetrics can help with these questions?
> >
> >
> > > But, what if a user doesn't use any
> > > external monitoring system and wants to know the health of Ignite
> instance?
> >
> > It depends on how we define "health".
> > And it's not trivial question :)
> >
> > > Do we have any plans to implement some simple aggregator and ship it
> with Ignite?
> >
> > I think NO.
> > We shouldn't do it.
> >
> > > Do we have plans to provide some presets for Ignite monitoring for
> > > popular monitoring systems?
> >
> > I think we shouldn't do it.
> > Because monitoring presets heavily depends on the usage scenario.
> > And it can heavily vary for the Ignite.
> >
> >
> > В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет:
> > > Hi Nikolay,
> > >
> > > I think "idle time" is a useful metric, but it can be calculated
> outside of
> > > Ignite using external monitoring system.
> > >
> > > About execution and waiting time, it's not the right way to calculate
> it
> > > using a jobs list. Will jobs list contain only active jobs? In this
> case,
> > > you can't calculate these metrics at all, since you don't know the
> time of
> > > finished jobs. If the list will contain all jobs (will it be
> unlimited?),
> > > iterating over this list will be resource consuming. In any way, it's
> much
> > > simpler (and sometimes only possible) for an external monitoring
> system to
> > > just get some scalar metric than iterate over a list with some
> condition.
> > >
> > > About aggregation, yes, in an ideal world aggregation should be done
> with
> > > the external monitoring system. But, what if a user doesn't use any
> > > external monitoring system and wants to know the health of Ignite
> instance?
> > > Do we have any plans to implement some simple aggregator and ship it
> with
> > > Ignite? Do we have plans to provide some presets for Ignite monitoring
> for
> > > popular monitoring systems? (These questions not related to this PR,
> but
> > > related to IEP at all)
> > >
> > > Also, some aggregation metrics ("max" for example) can't be effectively
> > > calculated using the external system (you should iterate over a jobs
> list
> > > again and still precision of such calculation will be no more than the
> time
> > > between probes).
>


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Nikolay Izhikov
Hello, Alex.

Based on our private discussion I've additionally migrated `totalExecutionTime` 
and `totalWaitingTime` counters.
Can you review the PR [1]?

[1] https://github.com/apache/ignite/pull/6622

В Пн, 24/06/2019 в 15:14 +0300, Nikolay Izhikov пишет:
> Hello, Alex.
> 
> Thanks for the answer.
> 
> 1. I, actually, don't understand your proposal :)
> Can you write it down? 
> What numbers should be additionally migrated in this PR? 
> Or it's OK for now?
> 
> > I think "idle time" is a useful metric
> 
> I think "usefulness" or "uselessness" of specific metrics depends on the 
> questions we can answer with it.
> What questions we can ask about Ignite instance and answer with "idle time" 
> metric?
> 
> > About execution and waiting time , it's not the right way to calculate it
> > using a jobs list. 
> 
> Same question here.
> 
> What questions we can answer with current numbers?
> 
> > Will jobs list contain only active jobs?
> 
> All jobs that are scheduled for execution on the node(active + waiting) 
> should be in the list.
> I try to put more details here, to expose my way of thinking about metrics 
> and lists:
> 
> If you have some issues with the jobs on the node it can be 2 kinds of 
> issues: 
>   1. You are waiting for the results of some job and want to know why it 
> doesn't execute.
> 
>   In this case, you should query "jobs list" from Ignite.
>   You can get an answer on:
>   * What jobs currently executes?
>   * How many time your job waiting to be executed?
> 
>   You can also check "activeJobs", "waitingJobs" metrics graphics 
> to know changes in the jobs queue during the time.
>   Seems, you can predict the start of your job from these 
> numbers.
> 
>   2. You want to understand the lifecycle of some finished(failed job).
> 
>   In this case, you should analyze the log of the node.
>   It should contain information about time:
>   * node recieve job information
>   * job added to the queue
>   * job started execution
>   * job finished(failed) execution.
> 
> I don't see questions we can't ask from these sources.
> Do we have such?
> How numbers from current GridJobMetrics can help with these questions?
> 
> 
> > But, what if a user doesn't use any
> > external monitoring system and wants to know the health of Ignite instance?
> 
> It depends on how we define "health".
> And it's not trivial question :)
> 
> > Do we have any plans to implement some simple aggregator and ship it with 
> > Ignite?
> 
> I think NO.
> We shouldn't do it.
> 
> > Do we have plans to provide some presets for Ignite monitoring for
> > popular monitoring systems?
> 
> I think we shouldn't do it.
> Because monitoring presets heavily depends on the usage scenario.
> And it can heavily vary for the Ignite.
> 
> 
> В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет:
> > Hi Nikolay,
> > 
> > I think "idle time" is a useful metric, but it can be calculated outside of
> > Ignite using external monitoring system.
> > 
> > About execution and waiting time, it's not the right way to calculate it
> > using a jobs list. Will jobs list contain only active jobs? In this case,
> > you can't calculate these metrics at all, since you don't know the time of
> > finished jobs. If the list will contain all jobs (will it be unlimited?),
> > iterating over this list will be resource consuming. In any way, it's much
> > simpler (and sometimes only possible) for an external monitoring system to
> > just get some scalar metric than iterate over a list with some condition.
> > 
> > About aggregation, yes, in an ideal world aggregation should be done with
> > the external monitoring system. But, what if a user doesn't use any
> > external monitoring system and wants to know the health of Ignite instance?
> > Do we have any plans to implement some simple aggregator and ship it with
> > Ignite? Do we have plans to provide some presets for Ignite monitoring for
> > popular monitoring systems? (These questions not related to this PR, but
> > related to IEP at all)
> > 
> > Also, some aggregation metrics ("max" for example) can't be effectively
> > calculated using the external system (you should iterate over a jobs list
> > again and still precision of such calculation will be no more than the time
> > between probes).


signature.asc
Description: This is a digitally signed message part


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Nikolay Izhikov

Hello, Ivan.

> Ignite is a cluster which almost every
> time assumes an external monitoring for a production use.

+1.

> 1. Are we going to preserve a compatibility with metrics present
> before? Or are we going to keep only those making sense today?

1. Backward compatibility preserved.
2. Deprecated metrics(and metric APIs) will be removed in Ignite 3.
3. We should make a decision what numbers are "make sense" and what don't.

> 2. Can we configure which supported metrics are calculated/exposed? Or
> do we calculate/expose everything every time?

1. You can configure filter for the exposed metrics. Only required subset of 
the metric will be exported.
2. For now, all metrics(not lists!) will be calculated. Please, note, that 
every metrics is the simple long(double) counter.

В Пн, 24/06/2019 в 14:43 +0300, Павлухин Иван пишет:
> Hi Nikolay, Alex,
> 
> A couple of my humble comments
> > Aggregation should be done with the metric collect system(Prometheus, 
> > Graphite, etc.).
> 
> I like that statement very much!
> 
> > But, what if a user doesn't use any external monitoring system and wants to 
> > know the health of Ignite instance?
> 
> I think that we can add more capabilities if a real user demand
> appears in future. Generally, Ignite is a cluster which almost every
> time assumes an external monitoring for a production use.
> 
> And a couple of general questions regarding monitoring. If they are
> answered in IEP you can simply redirect me there.
> 1. Are we going to preserve a compatibility with metrics present
> before? Or are we going to keep only those making sense today?
> 2. Can we configure which supported metrics are calculated/exposed? Or
> do we calculate/expose everything every time?
> 
> пн, 24 июн. 2019 г. в 12:46, Alex Plehanov :
> > 
> > Hi Nikolay,
> > 
> > I think "idle time" is a useful metric, but it can be calculated outside of
> > Ignite using external monitoring system.
> > 
> > About execution and waiting time, it's not the right way to calculate it
> > using a jobs list. Will jobs list contain only active jobs? In this case,
> > you can't calculate these metrics at all, since you don't know the time of
> > finished jobs. If the list will contain all jobs (will it be unlimited?),
> > iterating over this list will be resource consuming. In any way, it's much
> > simpler (and sometimes only possible) for an external monitoring system to
> > just get some scalar metric than iterate over a list with some condition.
> > 
> > About aggregation, yes, in an ideal world aggregation should be done with
> > the external monitoring system. But, what if a user doesn't use any
> > external monitoring system and wants to know the health of Ignite instance?
> > Do we have any plans to implement some simple aggregator and ship it with
> > Ignite? Do we have plans to provide some presets for Ignite monitoring for
> > popular monitoring systems? (These questions not related to this PR, but
> > related to IEP at all)
> > 
> > Also, some aggregation metrics ("max" for example) can't be effectively
> > calculated using the external system (you should iterate over a jobs list
> > again and still precision of such calculation will be no more than the time
> > between probes).
> 
> 
> 


signature.asc
Description: This is a digitally signed message part


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Nikolay Izhikov
Hello, Alex.

Thanks for the answer.

1. I, actually, don't understand your proposal :)
Can you write it down? 
What numbers should be additionally migrated in this PR? 
Or it's OK for now?

> I think "idle time" is a useful metric

I think "usefulness" or "uselessness" of specific metrics depends on the 
questions we can answer with it.
What questions we can ask about Ignite instance and answer with "idle time" 
metric?

> About execution and waiting time , it's not the right way to calculate it
> using a jobs list. 

Same question here.

What questions we can answer with current numbers?

> Will jobs list contain only active jobs?

All jobs that are scheduled for execution on the node(active + waiting) should 
be in the list.
I try to put more details here, to expose my way of thinking about metrics and 
lists:

If you have some issues with the jobs on the node it can be 2 kinds of issues: 
1. You are waiting for the results of some job and want to know why it 
doesn't execute.

In this case, you should query "jobs list" from Ignite.
You can get an answer on:
* What jobs currently executes?
* How many time your job waiting to be executed?

You can also check "activeJobs", "waitingJobs" metrics graphics 
to know changes in the jobs queue during the time.
Seems, you can predict the start of your job from these 
numbers.

2. You want to understand the lifecycle of some finished(failed job).

In this case, you should analyze the log of the node.
It should contain information about time:
* node recieve job information
* job added to the queue
* job started execution
* job finished(failed) execution.

I don't see questions we can't ask from these sources.
Do we have such?
How numbers from current GridJobMetrics can help with these questions?


> But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?

It depends on how we define "health".
And it's not trivial question :)

> Do we have any plans to implement some simple aggregator and ship it with 
> Ignite?

I think NO.
We shouldn't do it.

> Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems?

I think we shouldn't do it.
Because monitoring presets heavily depends on the usage scenario.
And it can heavily vary for the Ignite.


В Пн, 24/06/2019 в 12:46 +0300, Alex Plehanov пишет:
> Hi Nikolay,
> 
> I think "idle time" is a useful metric, but it can be calculated outside of
> Ignite using external monitoring system.
> 
> About execution and waiting time, it's not the right way to calculate it
> using a jobs list. Will jobs list contain only active jobs? In this case,
> you can't calculate these metrics at all, since you don't know the time of
> finished jobs. If the list will contain all jobs (will it be unlimited?),
> iterating over this list will be resource consuming. In any way, it's much
> simpler (and sometimes only possible) for an external monitoring system to
> just get some scalar metric than iterate over a list with some condition.
> 
> About aggregation, yes, in an ideal world aggregation should be done with
> the external monitoring system. But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?
> Do we have any plans to implement some simple aggregator and ship it with
> Ignite? Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems? (These questions not related to this PR, but
> related to IEP at all)
> 
> Also, some aggregation metrics ("max" for example) can't be effectively
> calculated using the external system (you should iterate over a jobs list
> again and still precision of such calculation will be no more than the time
> between probes).


signature.asc
Description: This is a digitally signed message part


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Павлухин Иван
Hi Nikolay, Alex,

A couple of my humble comments
> Aggregation should be done with the metric collect system(Prometheus, 
> Graphite, etc.).
I like that statement very much!

> But, what if a user doesn't use any external monitoring system and wants to 
> know the health of Ignite instance?
I think that we can add more capabilities if a real user demand
appears in future. Generally, Ignite is a cluster which almost every
time assumes an external monitoring for a production use.

And a couple of general questions regarding monitoring. If they are
answered in IEP you can simply redirect me there.
1. Are we going to preserve a compatibility with metrics present
before? Or are we going to keep only those making sense today?
2. Can we configure which supported metrics are calculated/exposed? Or
do we calculate/expose everything every time?

пн, 24 июн. 2019 г. в 12:46, Alex Plehanov :
>
> Hi Nikolay,
>
> I think "idle time" is a useful metric, but it can be calculated outside of
> Ignite using external monitoring system.
>
> About execution and waiting time, it's not the right way to calculate it
> using a jobs list. Will jobs list contain only active jobs? In this case,
> you can't calculate these metrics at all, since you don't know the time of
> finished jobs. If the list will contain all jobs (will it be unlimited?),
> iterating over this list will be resource consuming. In any way, it's much
> simpler (and sometimes only possible) for an external monitoring system to
> just get some scalar metric than iterate over a list with some condition.
>
> About aggregation, yes, in an ideal world aggregation should be done with
> the external monitoring system. But, what if a user doesn't use any
> external monitoring system and wants to know the health of Ignite instance?
> Do we have any plans to implement some simple aggregator and ship it with
> Ignite? Do we have plans to provide some presets for Ignite monitoring for
> popular monitoring systems? (These questions not related to this PR, but
> related to IEP at all)
>
> Also, some aggregation metrics ("max" for example) can't be effectively
> calculated using the external system (you should iterate over a jobs list
> again and still precision of such calculation will be no more than the time
> between probes).



-- 
Best regards,
Ivan Pavlukhin


Re: [IEP-35] GridJobProcessorMetrics migration

2019-06-24 Thread Alex Plehanov
Hi Nikolay,

I think "idle time" is a useful metric, but it can be calculated outside of
Ignite using external monitoring system.

About execution and waiting time, it's not the right way to calculate it
using a jobs list. Will jobs list contain only active jobs? In this case,
you can't calculate these metrics at all, since you don't know the time of
finished jobs. If the list will contain all jobs (will it be unlimited?),
iterating over this list will be resource consuming. In any way, it's much
simpler (and sometimes only possible) for an external monitoring system to
just get some scalar metric than iterate over a list with some condition.

About aggregation, yes, in an ideal world aggregation should be done with
the external monitoring system. But, what if a user doesn't use any
external monitoring system and wants to know the health of Ignite instance?
Do we have any plans to implement some simple aggregator and ship it with
Ignite? Do we have plans to provide some presets for Ignite monitoring for
popular monitoring systems? (These questions not related to this PR, but
related to IEP at all)

Also, some aggregation metrics ("max" for example) can't be effectively
calculated using the external system (you should iterate over a jobs list
again and still precision of such calculation will be no more than the time
between probes).


[IEP-35] GridJobProcessorMetrics migration

2019-06-20 Thread Nikolay Izhikov
Hello, Igniters.

Especially, Ignite veterans.

I've prepared PR [1] for the ticket IGNITE-11926 [2].

I found that we don't have any tests for the current GridJobMetrics 
implementation.
So I added basic tests for the current implementation in the PR.

Guys, do we have real-world usages of numbers from these metrics?

Back to my PR: I think we should migrate only a few of the existing 
GridJobProcessor metrics.
And that's why:

1. We shouldn't migrate aggregate metrics - max*, avg*
Aggregation should be done with the metric collect system(Prometheus, Graphite, 
etc.).

2. We shouldn't migrate `cpuLoadAvg`
Metrics for CPU should be available from separate sources(OS sensors or 
similar).

3. We shouldn't migrate `curidleTime`, `totalIdleTime`.
Idle metrics doesn't make sense for me.

They can be obtained from regularly scrapped `activeJobs` value.
Seems, they can't be used in the real world. Imagine 32 CPU server with only 
one active job. 
Idle time will be 0 for this scenario.

4. Execution(waiting) time should be available per job in the job list.

So my PR contains counters for the following numbers.
All the code belongs to the GridJobProcessor becomes deprecated.

Can someone do the review?

```
/** Number of started jobs. */
final LongMetricImpl startedJobsMetric;

 /** Number of active jobs currently executing. */
final LongMetricImpl activeJobsMetric;

 /** Number of currently queued jobs waiting to be executed. */
final LongMetricImpl waitingJobsMetric;

 /** Number of cancelled jobs that are still running. */
final LongMetricImpl canceledJobsMetric;

 /** Number of jobs rejected after more recent collision resolution 
operation. */
final LongMetricImpl rejectedJobsMetric;

 /** Number of finished jobs. */
final LongMetricImpl finishedJobsMetric;
```




[1] https://github.com/apache/ignite/pull/6622
[2] https://issues.apache.org/jira/browse/IGNITE-11926


signature.asc
Description: This is a digitally signed message part