Re: Using net-snmp to monitor a dynamic cloud-based back-end

Fulko Hew Tue, 05 Nov 2013 05:44:59 -0800

On Tue, Oct 29, 2013 at 11:17 AM, Maarten Wijnants <
maarten.wijna...@uhasselt.be> wrote:


... snip...


>
> We would like to monitor not only the spawned VMs globally, but also the
> individual server applications which they accommodate. The monitoring
> should in addition involve not only “physical” resource usage (i.e., CPU,
> memory, network bandwidth, …), but also application-layer metrics (e.g.,
> number of clients that are connected to a particular application server).
> Finally, we would like the monitoring to be automated as much as possible.
> For example, as soon as a new VM is spawned, we would like the NMS to
> automatically start collecting statistics about it; the same holds true for
> situations where a new application server is started on an existing VM. We
> are currently using Cacti [2] to aggregate the SNMP data and to visualize
> the result.
>
> Is it possible to realize this with net-snmp (and Cacti)? We have found
> tutorials and documentation about how to extend the net-snmp agent to
> support “non-standard” metrics. We have read about writing snmpd extensions
> (e.g., [3]), and we have successfully experimented with the “extends”
> directive to introduce new OID values (e.g., [4]). We succeeded in using
> the latter approach to attach a “client_count” OID to a local file on a
> VM’s disk that is continuously updated with the current client count of a
> server application that is running on that VM. Unfortunately, we are
> struggling to generalize this approach so that it works for dynamic numbers
> of application servers (and VMs).
>
> From the information that we have collected so far, we feel like our best
> bet is to define a custom MIB tree, with a branch for each of the (physical
> or application-layer) metrics that we would like to monitor from the
> application servers. For instance, the “client_count” branch of the MIB
> tree could hold one instance/row for each currently active application
> server to specify the current client count of this particular server.
> Another branch could hold the CPU usage of each application server. The NMS
> could then do a SNMP walk of these subtrees to obtain, for example, the
> client count values of all currently running application servers on a
> particular VM. Is this approach viable? If so, how could it best be
> realized? If not, does any of you have any suggestions about how our
> problem could be addressed?
>

... snip ...

I have instituted exactly this philosophy on the application developers
within my company.

After a few decades of creating and supporting applications and devices,
I came to the conclusion that standalone applications should not be looked
at any differently than 'embedded' applications in dedicated devices (eg.
routers).
After all, even a router is 'just another application'.

So I then abstracted out the 'common items' that all applications have
and bundled them into a MIB (and implemented a Perl and Java library
that allow our apps (written in those languages only at the moment)
to provide and receive info (snmp get and set) and send traps for
noteworthy items.  And these libraries interact with Net-SNMP via AgentX

The abstracted stuff fell into 5 categories:
- Dependencies - All apps 'need' stuff, like files/connections/other apps,
etc
  as inputs and outputs.  Ie. Things 'this app' depends on to work, and if
  a dependency isn't being fulfilled, the app will not work as needed.
  This table also holds simple usage and response stats on each 'dependency'
  along with traps for up/down, usage threshold alarms and response time
alarms.
- Capacity/resource - Where 'dependencies' are a 'comms' kind of thing
  the 'capacity' table, allows the app developers to communicate info/status
  on 'consumable' things an app needs or consumes (buffer pools, memory,
sockets, etc)
  And it defines traps for capacity alarms and when the consumption goes
back
  to normal.
- Response time - Even though the dependency table has response time info
  it is very simple.  This table allows you to report _any_ kind of thing
  that you want to comminicate the response time of... with thresholds and
  good/bad traps.
- SLA Criteria - Allows a designer to communicate the 'things that should be
  measured, and there threshold targets; in order to calculate the
application's
  availability for your service level agreement.
- Info - A place to communicate versioning info about application
components,
  show and remotely control debug/logging levels, remote restart a
application, etc.

So these tables define the 'things' that should or could be communicated,
now its up to the application designers to populate these tables... put
rows in the tables for each of their applications 'needs', and that will
be different for each application.

And then applications also have application specific 'other stuff' that they
may want to expose, that doesn't fit into the above generic tables.

So as you browse a device, these tables let you 'find out what different
things may be running, how they are interconnected, how well they are
running (or not), and provide capacity planning info (and alarms on
'potential
capacity problems so you don't actually have to do capacity planning
until you get 'close' to your 'gee maybe I should start paying attention'
usage level.

So I think this design (and implementation) covers the kind of stuff you
are thinking about (and more).  Unfortunately, my implementation is
(at the moment) company proprietary; but I can still 'talk' about it.

Fulko Hew

------------------------------------------------------------------------------
November Webinars for C, C++, Fortran Developers
Accelerate application performance with scalable programming models. Explore
techniques for threading, error checking, porting, and tuning. Get the most 
from the latest Intel processors and coprocessors. See abstracts and register
http://pubads.g.doubleclick.net/gampad/clk?id=60136231&iu=/4140/ostg.clktrk

_______________________________________________
Net-snmp-users mailing list
Net-snmp-users@lists.sourceforge.net
Please see the following page to unsubscribe or change other options:
https://lists.sourceforge.net/lists/listinfo/net-snmp-users

Re: Using net-snmp to monitor a dynamic cloud-based back-end

Reply via email to