Hi Michael, the fix for a overrun looks good to me. But your code still has the problem that loop back traffic is counted, too. perfstat_netinterface_total is the sum of all network devices including lo0, etc.
Best regards Andreas Michael Perzl schrieb: > Andreas, > > thank you for taking the blame but you are off the hook here. ;-) > > If I understood David correctly, he is using my AIX Ganglia RPM packages > with POWER5 extensions. Here most if not all implementation of how the > metrics are collected under AIX have been changed. Everything is > documented on my homepage (http://www.perzl.org/ganglia/) though. > So everything what goes wrong here is entiremy my fault :-[ > > After some investigating and some discussions with Nigel I have come to > terms with the following facts regarding the bytes_in/bytes_out problem: > - libperfstat (the library on AIX which obtains all the system > performance data) uses u_longlong_t data types (these are definitely > 64-bit large). > - The AIX kernel internally, though, may probably not be using 64-bit > data types - more realistic is probably unsigned 32-bit - in order not > to break compatibility (my personal opinion) > - The consequence now is that integer overrun may occur much easier with > 32-bit data types than with 64-bit data types (we all probably don't > live long enough to see that happen). > > Please take a look at my implementation of the bytes_in metric (the > bytes_out implementation is accordingly): > > 01 g_val_t > 02 bytes_in_func( void ) > 03 { > 04 g_val_t val; > 05 perfstat_netinterface_total_t n; > 06 static u_longlong_t last_bytes_in = 0, bytes_in; > 07 static double last_time = 0.0; > 08 double now, delta_t; > 09 struct timeval timeValue; > 10 struct timezone timeZone; > 11 > 12 gettimeofday( &timeValue, &timeZone ); > 13 > 14 now = (double) (timeValue.tv_sec - boottime) + (timeValue.tv_usec > / 1000000.0); > 15 > 16 if (perfstat_netinterface_total( NULL, &n, sizeof( > perfstat_netinterface_total_t ), 1 ) == -1) > 17 val.f = 0.0; > 18 else > 19 { > 20 bytes_in = n.ibytes; > 21 > 22 delta_t = now - last_time; > 23 > 24 if ( delta_t ) > 25 val.f = (double) (bytes_in - last_bytes_in) / delta_t; > 26 else > 27 val.f = 0.0; > 28 > 29 last_bytes_in = bytes_in; > 30 } > 31 > 32 last_time = now; > 33 > 34 return( val ); > 35 } > > In my opinion the overrun occurs in line #25 when "bytes_in < > last_bytes_in". > In my naivity I had assumed as both are of type u_longlong_t that an > integer overrun might never happen. > > So to solve the overrun a check for "bytes_in < last_bytes_in" must be > introduced, something like: > > u_longlong_t d; > d = bytes_in - last_bytes_in; > if (d < 0) d += ULONG_MAX; > > and line #25 would essentially become > 25 val.f = (double) d / delta_t; > > Comments ? > > Regards, > Michael > > PS: David, the reason why you don't see it happen with pkts_in and > pkts_out is that probably no overrun so far has occurred but at some > point it will also happen. > > PPS: David, if this is a solution (I want some comments on that before, > though) then I would be building new RPMs with the then hopefully > correct code. > -- Dr. Andreas Schoenfeld | Dr. Andreas Schoenfeld | Technische Universitaet Darmstadt | Technische Universitaet Darmstadt Hochschulrechenzentrum (HRZ)| University Computing Centre | Petersenstrasse 30 | Petersenstrasse 30 64287 Darmstadt | 64287 Darmstadt | Germany | Tel. 06151-16 5608 | Tel. +49 (0) 6151-16 5608 e-mail: [EMAIL PROTECTED]

