Re: [ovirt-devel] [vdsm] exploring a possible integration between collectd and Vdsm

2016-10-12 Thread Francesco Romani
- Original Message -
> From: "Francesco Romani" <from...@redhat.com>
> To: "devel" <devel@ovirt.org>
> Sent: Wednesday, October 12, 2016 9:29:37 AM
> Subject: Re: [ovirt-devel] [vdsm] exploring a possible integration between 
> collectd and Vdsm
> 
> - Original Message -
> > From: "Yaniv Kaul" <yk...@redhat.com>
> > To: "Francesco Romani" <from...@redhat.com>
> > Cc: "devel" <devel@ovirt.org>
> > Sent: Tuesday, October 11, 2016 10:31:14 PM
> > Subject: Re: [ovirt-devel] [vdsm] exploring a possible integration between
> > collectd and Vdsm
> > 
> > On Tue, Oct 11, 2016 at 2:05 PM, Francesco Romani <from...@redhat.com>
> > wrote:
> > 
> > > Hi all,
> > >
> > > In the last 2.5 days I was exploring if and how we can integrate collectd
> > > and Vdsm.
> [...]
> > This generally sounds like a good idea - and I hope it is coordinated with
> > our efforts for monitoring (see [1], [2]).
> 
> Sure it will. 

Sure it will be coordinated with the monitoring efforts*.

-- 
Francesco Romani
Red Hat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel


Re: [ovirt-devel] [vdsm] exploring a possible integration between collectd and Vdsm

2016-10-12 Thread Francesco Romani
- Original Message -
> From: "Yaniv Kaul" <yk...@redhat.com>
> To: "Francesco Romani" <from...@redhat.com>
> Cc: "devel" <devel@ovirt.org>
> Sent: Tuesday, October 11, 2016 10:31:14 PM
> Subject: Re: [ovirt-devel] [vdsm] exploring a possible integration between 
> collectd and Vdsm
> 
> On Tue, Oct 11, 2016 at 2:05 PM, Francesco Romani <from...@redhat.com>
> wrote:
> 
> > Hi all,
> >
> > In the last 2.5 days I was exploring if and how we can integrate collectd
> > and Vdsm.
[...]
> This generally sounds like a good idea - and I hope it is coordinated with
> our efforts for monitoring (see [1], [2]).

Sure it will. I played a couple of days with collectd to "just" see
a. how hard is to write a collectd plugin, and/or if it is feasible to
   ship ot out-of-tree for the initial few releases, until it stabilizes
   so it can be submitted upstream
b. if we can get events/notifications from collectd
c. if we can integrate those notifications with Vdsm

And turns out we *can* do all of the above, with various degrees of difficulty.

> Few notes:
> - I think the most compelling reason to move to collectd is actually to
> benefit from the already existing plugins that it already has, which will
> cover
> a lot of the missing monitoring requirements and wishes we have (example:
> local disk usage on the host), as well as integrate it into
> Engine monitoring (example: postgresql performance monitoring).

Agreed

> - You can't remove monitoring from VDSM - as it new VDSM may work against
> older Engine setups. You can gradually remove them.

Yes, for example we can make Vdsm poll collectd and act as facade to old 
Engines,
while new one should skip this step and ask collectd or the metrics aggregator
service you mention below.

> I'd actually begin with cleanup - there are some 'metrics' that are simply
> not needed and should not be reported in the first place and
> are there for historic reasons only. Remove them - from Engine first, from
> the DB and all, then later we can either send fake values or remove
> from VDSM.

Yes, this is the first place where we need to coordinate with the metrics 
effort.

> - If you are moving to collectd, as you can see from the metrics effort,
> we'd really want to send it elsewhere - and Engine should consume it from
> there.
> Metrics storages usually have a very nice REST UI with the ability to bring
> series with average, with different criteria (such as per hour, per minute
> or what not stats), etc.

Fully agreed

> - I agree with Nir about separating between our core business and the
> monitoring we do for extra. Keep in mind that some of the stats are for SLA
> and critical scheduling decisions as well.

Yes, of course adding a dependency for core monitoring is risky.
So far the bottom line is that relying on collectd for this is just one more
option on the table now.

[mostly brainstorming from now on]

However, I'd like highlight that is not just risky: is a different tradeoff.
Doing the core monitoring in Vdsm (so in python, essentially in a single 
threaded server)
is not a free lunch, because this has a quite high price on performance level.

If the main Vdsm process is overloaded, then the polling cycle can get longer, 
and the
overall response time of processing system events (e.g. disk detected full) can 
get
longer as well.
We've observed in not-so-distant past high response time from heavily loaded 
Vdsm.

I think the idea of having different instances for different monitoring purposes
(credit to Nir) is the best shot at the moment.
We could maybe have one standard system collectd for regular monitoring,
and perhaps one special purpose, very limited collectd instance for critical 
information.
On top of that, Vdsm could double-checl and keep doing the core monitoring 
itself,
albeit at lower rate (e.g. every 10s instead of every 2s; every 60s instead of 
every 15s).

Leveraging libvirt events is *the* right thing, no doubt about that, but it 
would be very nice
to have a dependable external service which can generate the events we need 
based on
libvirt data, and move the notification logic on top of it.

Something like (final picture, excluding intermediate compatibility layers)

[data source]
-+---
 |
 `-> [monitoring/metrics collection]
  -+---
   |
   +--> [metrics store] -{data}-> [Engine]
   |
   `--> [notification service] -{events}-> [Vdsm]


Not all the "boxes" need to be separate processes, for example collectd has some
support for thresholds and notifications which is ~80% of what Vdsm needs 
(again not
considering reliability, just feature-wise).  

[end brainstorm]

> - The libvirt collectd plugin 

Re: [ovirt-devel] [vdsm] exploring a possible integration between collectd and Vdsm

2016-10-11 Thread Yaniv Kaul
On Tue, Oct 11, 2016 at 2:05 PM, Francesco Romani 
wrote:

> Hi all,
>
> In the last 2.5 days I was exploring if and how we can integrate collectd
> and Vdsm.
>
> The final picture could look like:
> 1. collectd does all the monitoring and reporting currently Vdsm does
> 2. Engine consumes data from collectd
> 3. Vdsm consumes *notifications* from collectd - for few but important
> tasks like Drive high water mark monitoring
>
> Benefits (aka: why to bother?):
> 1. less code in Vdsm / long-awaited modularization of Vdsm
> 2. better integration with the system, reuse of well-known components
> 3. more flexibility in monitoring/reporting: collectd is special purpose
> existing solution
> 4. faster, more scalable operation because all the monitoring can be done
> in C
>
> At first glance, Collectd seems to have all the tools we need.
> 1. A plugin interface (https://collectd.org/wiki/
> index.php/Plugin_architecture and https://collectd.org/wiki/
> index.php/Table_of_Plugins)
> 2. Support for notifications and thresholds (https://collectd.org/wiki/
> index.php/Notifications_and_thresholds)
> 3. a libvirt plugin https://collectd.org/wiki/index.php/Plugin:virt
>
> So, the picture is like
>
> 1. we start requiring collectd as dependency of Vdsm
> 2. we either configure it appropriately (collectd support config drop-ins:
> /etc/collectd.d) or we document our requirements (or both)
> 3. collectd monitors the hosts and libvirt
> 4. Engine polls collectd
> 5. Vdsm listens from notifications
>
> Should libvirt deliver us the event we need (see
> https://bugzilla.redhat.com/show_bug.cgi?id=1181659),
> we can just stop using collectd notifications, everything else works as
> previously.
>
> Challenges:
> 1. Collectd does NOT consider the plugin API stable (
> https://collectd.org/wiki/index.php/Plugin_architecture#
> The_interface.27s_stability)
>so the plugins should be inclueded in the main tree, much like the
> modules of the linux kernel
>Worth mentioning that the plugin API itself has a good deal of rough
> edges.
>we will need to maintain this plugin ourselves, *and* we need to
> maintain our thin API
>layer, to make sure the plugin loads and works with recent versions of
> collectd.
> 2. the virt plugin is out of date, doesn't report some data we need: see
> https://github.com/collectd/collectd/issues/1945
> 3. the notification message(s) are tailored for human consumption, those
> messages are not easy
>to parse for machines.
> 4. the threshold support in collectd seems to match values against
> constants; it doesn't seem possible
>to match a value against another one, as we need to do for high water
> monitoring (capacity VS allocation).
>
> How I'm addressing, or how I plan to address those challenges (aka action
> items):
> 1. I've been experimenting with out-of-tree plugins, and I managed
> develop, build, install and run
>one out-of-tree plugin: https://github.com/mojaves/
> vmon/tree/master/collectd
>The development pace of collectd looks sustainable, so this doesn't
> look such a big deal.
>Furthermore, we can engage with upstream to merge our plugins, either
> as-is or to extend existing ones.
> 2. Write another collectd plugin based on the Vdsm python code and/or my
> past accelerator executable project
>(https://github.com/mojaves/vmon)
> 3. patch the collectd notification code. It is yet another plugin
>OR
> 4. send notification from the new virt module as per #2, bypassing the
> threshold system. This move could preclude
>the new virt module to be merged in the collectd tree.
>
> Current status of the action items:
> 1. done BUT PoC quality
> 2. To be done (more work than #1/possible dupe with github issue)
> 3. need more investigation, conflicts with #4
> 4. need more investigation, conflicts with #3
>
> All the code I'm working on will be found on https://github.com/mojaves/
> vmon
>
> Comments are appreciated
>

This generally sounds like a good idea - and I hope it is coordinated with
our efforts for monitoring (see [1], [2]).
Note that ages ago, ovirt-node actually had it already[3].

Few notes:
- I think the most compelling reason to move to collectd is actually to
benefit from the already existing plugins that it already has, which will
cover
a lot of the missing monitoring requirements and wishes we have (example:
local disk usage on the host), as well as integrate it into
Engine monitoring (example: postgresql performance monitoring).
- You can't remove monitoring from VDSM - as it new VDSM may work against
older Engine setups. You can gradually remove them.
I'd actually begin with cleanup - there are some 'metrics' that are simply
not needed and should not be reported in the first place and
are there for historic reasons only. Remove them - from Engine first, from
the DB and all, then later we can either send fake values or remove
from VDSM.
- If you are moving to collectd, as you can see from the metrics effort,
we'd 

Re: [ovirt-devel] [vdsm] exploring a possible integration between collectd and Vdsm

2016-10-11 Thread Nir Soffer
On Tue, Oct 11, 2016 at 2:05 PM, Francesco Romani  wrote:
> Hi all,
>
> In the last 2.5 days I was exploring if and how we can integrate collectd and 
> Vdsm.

Some comments regarding storage high watermarks only. I will comment later
on other aspects.

> The final picture could look like:
> 1. collectd does all the monitoring and reporting currently Vdsm does
> 2. Engine consumes data from collectd
> 3. Vdsm consumes *notifications* from collectd - for few but important tasks 
> like Drive high water mark monitoring

Drive high watermark is our core business,  we cannot outsource
it to collectd.

Vdsm will always monitor high watermarks directly from libvirt.

> Benefits (aka: why to bother?):
> 1. less code in Vdsm / long-awaited modularization of Vdsm
> 2. better integration with the system, reuse of well-known components
> 3. more flexibility in monitoring/reporting: collectd is special purpose 
> existing solution
> 4. faster, more scalable operation because all the monitoring can be done in C

If the problem in monitoring is python, we can have small and simple
helper doing the monitoring (for storage), like ioprocess.

> At first glance, Collectd seems to have all the tools we need.
> 1. A plugin interface 
> (https://collectd.org/wiki/index.php/Plugin_architecture and 
> https://collectd.org/wiki/index.php/Table_of_Plugins)
> 2. Support for notifications and thresholds 
> (https://collectd.org/wiki/index.php/Notifications_and_thresholds)

Setting threshhold and getting notifications when treshold is reached
sounds like the best design for monitoring drive high watermarks.

But I would like to depend on component that does *only* this task, and
service only vdsm.

> 3. a libvirt plugin https://collectd.org/wiki/index.php/Plugin:virt
>
> So, the picture is like
>
> 1. we start requiring collectd as dependency of Vdsm
> 2. we either configure it appropriately (collectd support config drop-ins: 
> /etc/collectd.d) or we document our requirements (or both)
> 3. collectd monitors the hosts and libvirt
> 4. Engine polls collectd
> 5. Vdsm listens from notifications

Sounds good

>
> Should libvirt deliver us the event we need (see 
> https://bugzilla.redhat.com/show_bug.cgi?id=1181659),
> we can just stop using collectd notifications, everything else works as 
> previously.
>
> Challenges:
> 1. Collectd does NOT consider the plugin API stable 
> (https://collectd.org/wiki/index.php/Plugin_architecture#The_interface.27s_stability)
>so the plugins should be inclueded in the main tree, much like the modules 
> of the linux kernel
>Worth mentioning that the plugin API itself has a good deal of rough edges.
>we will need to maintain this plugin ourselves, *and* we need to maintain 
> our thin API
>layer, to make sure the plugin loads and works with recent versions of 
> collectd.
> 2. the virt plugin is out of date, doesn't report some data we need: see 
> https://github.com/collectd/collectd/issues/1945
> 3. the notification message(s) are tailored for human consumption, those 
> messages are not easy
>to parse for machines.
> 4. the threshold support in collectd seems to match values against constants; 
> it doesn't seem possible
>to match a value against another one, as we need to do for high water 
> monitoring (capacity VS allocation).
>
> How I'm addressing, or how I plan to address those challenges (aka action 
> items):
> 1. I've been experimenting with out-of-tree plugins, and I managed develop, 
> build, install and run
>one out-of-tree plugin: 
> https://github.com/mojaves/vmon/tree/master/collectd
>The development pace of collectd looks sustainable, so this doesn't look 
> such a big deal.
>Furthermore, we can engage with upstream to merge our plugins, either 
> as-is or to extend existing ones.
> 2. Write another collectd plugin based on the Vdsm python code and/or my past 
> accelerator executable project
>(https://github.com/mojaves/vmon)
> 3. patch the collectd notification code. It is yet another plugin
>OR
> 4. send notification from the new virt module as per #2, bypassing the 
> threshold system. This move could preclude
>the new virt module to be merged in the collectd tree.
>
> Current status of the action items:
> 1. done BUT PoC quality
> 2. To be done (more work than #1/possible dupe with github issue)
> 3. need more investigation, conflicts with #4
> 4. need more investigation, conflicts with #3
>
> All the code I'm working on will be found on https://github.com/mojaves/vmon
>
> Comments are appreciated
>
> --
> Francesco Romani
> RedHat Engineering Virtualization R & D
> Phone: 8261328
> IRC: fromani
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel


[ovirt-devel] [vdsm] exploring a possible integration between collectd and Vdsm

2016-10-11 Thread Francesco Romani
Hi all,

In the last 2.5 days I was exploring if and how we can integrate collectd and 
Vdsm.

The final picture could look like:
1. collectd does all the monitoring and reporting currently Vdsm does
2. Engine consumes data from collectd
3. Vdsm consumes *notifications* from collectd - for few but important tasks 
like Drive high water mark monitoring

Benefits (aka: why to bother?):
1. less code in Vdsm / long-awaited modularization of Vdsm
2. better integration with the system, reuse of well-known components
3. more flexibility in monitoring/reporting: collectd is special purpose 
existing solution
4. faster, more scalable operation because all the monitoring can be done in C

At first glance, Collectd seems to have all the tools we need.
1. A plugin interface (https://collectd.org/wiki/index.php/Plugin_architecture 
and https://collectd.org/wiki/index.php/Table_of_Plugins)
2. Support for notifications and thresholds 
(https://collectd.org/wiki/index.php/Notifications_and_thresholds)
3. a libvirt plugin https://collectd.org/wiki/index.php/Plugin:virt

So, the picture is like

1. we start requiring collectd as dependency of Vdsm
2. we either configure it appropriately (collectd support config drop-ins: 
/etc/collectd.d) or we document our requirements (or both)
3. collectd monitors the hosts and libvirt
4. Engine polls collectd
5. Vdsm listens from notifications

Should libvirt deliver us the event we need (see 
https://bugzilla.redhat.com/show_bug.cgi?id=1181659),
we can just stop using collectd notifications, everything else works as 
previously.

Challenges:
1. Collectd does NOT consider the plugin API stable 
(https://collectd.org/wiki/index.php/Plugin_architecture#The_interface.27s_stability)
   so the plugins should be inclueded in the main tree, much like the modules 
of the linux kernel
   Worth mentioning that the plugin API itself has a good deal of rough edges.
   we will need to maintain this plugin ourselves, *and* we need to maintain 
our thin API
   layer, to make sure the plugin loads and works with recent versions of 
collectd.
2. the virt plugin is out of date, doesn't report some data we need: see 
https://github.com/collectd/collectd/issues/1945
3. the notification message(s) are tailored for human consumption, those 
messages are not easy
   to parse for machines.
4. the threshold support in collectd seems to match values against constants; 
it doesn't seem possible
   to match a value against another one, as we need to do for high water 
monitoring (capacity VS allocation).

How I'm addressing, or how I plan to address those challenges (aka action 
items):
1. I've been experimenting with out-of-tree plugins, and I managed develop, 
build, install and run
   one out-of-tree plugin: https://github.com/mojaves/vmon/tree/master/collectd
   The development pace of collectd looks sustainable, so this doesn't look 
such a big deal.
   Furthermore, we can engage with upstream to merge our plugins, either as-is 
or to extend existing ones.
2. Write another collectd plugin based on the Vdsm python code and/or my past 
accelerator executable project
   (https://github.com/mojaves/vmon)
3. patch the collectd notification code. It is yet another plugin
   OR
4. send notification from the new virt module as per #2, bypassing the 
threshold system. This move could preclude
   the new virt module to be merged in the collectd tree.

Current status of the action items:
1. done BUT PoC quality
2. To be done (more work than #1/possible dupe with github issue)
3. need more investigation, conflicts with #4
4. need more investigation, conflicts with #3

All the code I'm working on will be found on https://github.com/mojaves/vmon

Comments are appreciated

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani
___
Devel mailing list
Devel@ovirt.org
http://lists.ovirt.org/mailman/listinfo/devel