Michael,

 

If d is an unsigned integer, won't the test (d < 0) always be false?

 

My initial thought on the problem was to test if d is abnormally large,
which is what I would expect to happen when you subtract a larger
unsigned value from a smaller one.

 

Another option:

 

As long as the kernel is using 32 bit structures, we could convert the
unsigned integers to signed integers before doing the subtraction and
test for negative.  But when the AIX kernel "upgrades" to 64 bit
structures, I imagine we would need to remove the "workaround" or we
risk the possibility of losing precision?

 

David Wong

Senior Systems Engineer

Management Dynamics, Inc.

Phone: 201-804-6127

[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]> 

________________________________

From: Michael Perzl [mailto:[EMAIL PROTECTED] 
Sent: Friday, March 30, 2007 12:25 PM
To: Andreas Schoenfeld
Cc: David Wong; [EMAIL PROTECTED];
[email protected]; [EMAIL PROTECTED]
Subject: Re: [Ganglia-general] Help! I have a petabyte/s network (Martin
Knoblauch)

 

Andreas,

thank you for taking the blame but you are off the hook here.  ;-) 

If I understood David correctly, he is using my AIX Ganglia RPM packages
with POWER5 extensions. Here most if not all implementation of how the
metrics are collected under AIX have been changed. Everything is
documented on my homepage (http://www.perzl.org/ganglia/) though.
So everything what goes wrong here is entiremy my fault :-[ 

After some investigating and some discussions with Nigel I have come to
terms with the following facts regarding the bytes_in/bytes_out problem:
- libperfstat (the library on AIX which obtains all the system
performance data) uses u_longlong_t data types (these are definitely
64-bit large).
- The AIX kernel internally, though, may probably not be using 64-bit
data types - more realistic is probably unsigned 32-bit - in order not
to break compatibility (my personal opinion)
- The consequence now is that integer overrun may occur much easier with
32-bit data types than with 64-bit data types (we all probably don't
live long enough to see that happen).

Please take a look at my implementation of the bytes_in metric (the
bytes_out implementation is accordingly):

01  g_val_t
02  bytes_in_func( void )
03  {
04     g_val_t val;
05     perfstat_netinterface_total_t n;
06     static u_longlong_t last_bytes_in = 0, bytes_in;
07     static double last_time = 0.0;
08     double now, delta_t;
09     struct timeval timeValue;
10     struct timezone timeZone;
11
12     gettimeofday( &timeValue, &timeZone );
13
14     now = (double) (timeValue.tv_sec - boottime) + (timeValue.tv_usec
/ 1000000.0);
15
16     if (perfstat_netinterface_total( NULL, &n, sizeof(
perfstat_netinterface_total_t ), 1 ) == -1)
17        val.f = 0.0;
18     else
19     {
20        bytes_in = n.ibytes;
21
22        delta_t = now - last_time;
23
24        if ( delta_t )
25           val.f = (double) (bytes_in - last_bytes_in) / delta_t;
26        else
27           val.f = 0.0;
28
29        last_bytes_in = bytes_in;
30     }
31
32     last_time = now;
33
34     return( val );
35  }

In my opinion the overrun occurs in line #25 when "bytes_in <
last_bytes_in".
In my naivity I had assumed as both are of type u_longlong_t that an
integer overrun might never happen.

So to solve the overrun a check for "bytes_in < last_bytes_in" must be
introduced, something like:

u_longlong_t d;
d = bytes_in - last_bytes_in;
if (d < 0) d += ULONG_MAX;

and line #25 would essentially become
25           val.f = (double) d / delta_t;

Comments ?

Regards,
Michael

PS: David, the reason why you don't see it happen with pkts_in and
pkts_out is that probably no overrun so far has occurred but at some
point it will also happen.

PPS: David, if this is a solution (I want some comments on that before,
though) then I would be building new RPMs with the then hopefully
correct code.

Andreas Schoenfeld wrote: 

Hi David and Martin,
 
I suppose the network code is still the code I wrote, so there are two
problems  I know of:
1. yes there is a problem with owerflows
2. the shown network traffic is the sum of all network interfaces
including local loopback devices (lo0...).
 
Both Problems could lead to astonishing data transfer rate in ganglia.
 
Sorry I had promised to fix the problems, but there was to much other
work ...
 
Best regards
   Andreas
 
  

        Date: Thu, 29 Mar 2007 08:21:38 -0700 (PDT)
        From: Martin Knoblauch <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]> 
        Subject: Re: [Ganglia-general] Help! I have a petabyte/s network
        To: David Wong <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]> , [EMAIL PROTECTED],
          [email protected]
        Message-ID: <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]> 
        Content-Type: text/plain; charset=iso-8859-1
         
        David,
         
         good catch. I will have to look at it for a bit.
         
        Cheers
        Martin
        --- David Wong <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>  wrote:
         
            

                        I don't write much code nowadays, so I'm going
to need a lot of help
                        with this.
                         
                        I dug through the ganglia code and I found this
interesting tidbit in
                        libmetrics/aix/metrics.c which may be indicative
of the problem.
                         
                        There's an assignment from cur_ninfo.ibytes to
cur_net_stat.ibytes,
                        but
                        the types of the two variables are different.
                         
                        net_stat::ibytes is a double: 
                         
                        struct net_stat{
                          double ipackets;
                          double opackets;
                          double ibytes;
                          double obytes;
                        } cur_net_stat;
                         
                        and we have *ninfo declared here:
                         
                        perfstat_netinterface_total_t
ninfo[2],*last_ninfo, *cur_ninfo ;
                         
                        libperfstat.h has
perfstat_netinterface_total_t::ibytes as
                        u_longlong_t.
                         
                        Does this code try to do what I think it is
doing, i.e. assign an
                        unsigned 64 bit integer to a signed 64bit
integer?
                         
                        I'm willing to test the code if someone who's
more adept at coding
                        and
                        building will take on the challenge.
                         
                        It looks to me that the type mismatch will have
to fixed in a few
                        places, such as CALC_NETSTAT, and we'll have to
add an unsigned long
                        long to g_val_t too.  Those are the ones I can
see so far.
                         
                        David Wong
                        Senior Systems Engineer
                        Management Dynamics, Inc.
                        Phone: 201-804-6127
                        [EMAIL PROTECTED]
                         
                        -----Original Message-----
                        From: Martin Knoblauch
[mailto:[EMAIL PROTECTED] 
                        Sent: Wednesday, March 28, 2007 12:00 PM
                        To: David Wong;
[email protected]
                        Subject: Re: [Ganglia-general] Help! I have a
petabyte/s network
                         
                        David,
                         
                         as far as I remember, the AIX metrics code had
an
                        overflow/wrap-around
                        problem prior to 3.0.4. Maybe the fixes are not
thorough enough.
                         
                         The packets/sec are of course less affected.
                         
                        Cheers
                        Martin
                                

 
 
  

Reply via email to