Hi Andreas,

please see my other email with regards to the overrun.

The problem that the loopback traffic is counted also is part of the perfstat_netinterface_total() routine. When looking at some of the metric implementations for bytes_in I found the following:
- AIX, your version and my version have that "problem"
- Linux, specifically "filters" out the "lo:" network traffic
- Solaris, seems to also have that "problem"
- FreeBSD, specifically "filters" out the loopback traffic
- HP-UX, no implementation for bytes_in
- IRIX, no implementation for bytes_in
- Darwin, specifically "filters" out the loopback traffic
- ....

So I would guess, that it is a wanted feature to filter out the loopback traffic for the total number of bytes_in. The same thing would be true for bytes_out, pkts_in and pkts_out.

Here is a version which would hopefully implement all that and should take care of the integer ovverun too:

g_val_t
bytes_in_func( void )
{
  g_val_t val;
  perfstat_netinterface_total_t n;
  perfstat_id_t name;
  perfstat_netinterface nif;
  static u_longlong_t last_bytes_in = 0, bytes_in;
  longlong_t d;
  static double last_time = 0.0;
  double now, delta_t;
  struct timeval timeValue;
  struct timezone timeZone;


  gettimeofday( &timeValue, &timeZone );

now = (double) (timeValue.tv_sec - boottime) + (timeValue.tv_usec / 1000000.0);

if (perfstat_netinterface_total( NULL, &n, sizeof( perfstat_netinterface_total_t ), 1 ) == -1)
     val.f = 0.0;
  else
{ strcpy( name.name, "lo0" );

if (perfstat_netinterface( &name, &nif, sizeof( perfstat_netinterface_t ), 1) == -1)
        val.f = 0.0;
else {
/* subtract the loopback device bytes, check for integer overrun */
        d = n.ibytes - nif.ibytes;
        if (d < 0) d += ULONG_MAX;

/* get the number of bytes transferred in, check for integer overrun */
        d -= last_bytes_in;
        if (d < 0) d += ULONG_MAX;

        delta_t = now - last_time;

        if ( delta_t )
           val.f = (double) d / delta_t;
        else
           val.f = 0.0;

        last_bytes_in = d;
     }
  }

  last_time = now;

  return( val );
}

Any comments/takers ?

Best Regards,
Michael

Andreas Schoenfeld wrote:
Hi Michael,

the fix for a overrun looks good to me. But your code still has the
problem that loop back traffic is counted, too.
perfstat_netinterface_total is the sum of all network devices including
 lo0, etc.

Best regards
   Andreas




Michael Perzl schrieb:
 Andreas,

thank you for taking the blame but you are off the hook here.  ;-)

If I understood David correctly, he is using my AIX Ganglia RPM packages
with POWER5 extensions. Here most if not all implementation of how the
metrics are collected under AIX have been changed. Everything is
documented on my homepage (http://www.perzl.org/ganglia/) though.
So everything what goes wrong here is entiremy my fault :-[

After some investigating and some discussions with Nigel I have come to
terms with the following facts regarding the bytes_in/bytes_out problem:
- libperfstat (the library on AIX which obtains all the system
performance data) uses u_longlong_t data types (these are definitely
64-bit large).
- The AIX kernel internally, though, may probably not be using 64-bit
data types - more realistic is probably unsigned 32-bit - in order not
to break compatibility (my personal opinion)
- The consequence now is that integer overrun may occur much easier with
32-bit data types than with 64-bit data types (we all probably don't
live long enough to see that happen).

Please take a look at my implementation of the bytes_in metric (the
bytes_out implementation is accordingly):

01  g_val_t
02  bytes_in_func( void )
03  {
04     g_val_t val;
05     perfstat_netinterface_total_t n;
06     static u_longlong_t last_bytes_in = 0, bytes_in;
07     static double last_time = 0.0;
08     double now, delta_t;
09     struct timeval timeValue;
10     struct timezone timeZone;
11
12     gettimeofday( &timeValue, &timeZone );
13
14     now = (double) (timeValue.tv_sec - boottime) + (timeValue.tv_usec
/ 1000000.0);
15
16     if (perfstat_netinterface_total( NULL, &n, sizeof(
perfstat_netinterface_total_t ), 1 ) == -1)
17        val.f = 0.0;
18     else
19     {
20        bytes_in = n.ibytes;
21
22        delta_t = now - last_time;
23
24        if ( delta_t )
25           val.f = (double) (bytes_in - last_bytes_in) / delta_t;
26        else
27           val.f = 0.0;
28
29        last_bytes_in = bytes_in;
30     }
31
32     last_time = now;
33
34     return( val );
35  }

In my opinion the overrun occurs in line #25 when "bytes_in <
last_bytes_in".
In my naivity I had assumed as both are of type u_longlong_t that an
integer overrun might never happen.

So to solve the overrun a check for "bytes_in < last_bytes_in" must be
introduced, something like:

u_longlong_t d;
d = bytes_in - last_bytes_in;
if (d < 0) d += ULONG_MAX;

and line #25 would essentially become
25           val.f = (double) d / delta_t;

Comments ?

Regards,
Michael

PS: David, the reason why you don't see it happen with pkts_in and
pkts_out is that probably no overrun so far has occurred but at some
point it will also happen.

PPS: David, if this is a solution (I want some comments on that before,
though) then I would be building new RPMs with the then hopefully
correct code.


Reply via email to