hi rocco,

yes, ok, i think i'm following you:

-i know that these ICMP Echos aren't leaving the NIC (during these multi-minute periods), from the sniffer i have posted just outside the box

-many ICMP Echos are exitting the NIC ... but i happen to know which boxes i'm pinging using Nagios, and which I'm pinging using NodeWatch (which employs POE), and the pings i'm seeing are headed to/from the Nagios-monitored boxes, not the NodeWatch-monitored boxes. additionally, NodeWatch logs to syslog when it is emitting pings ... and, in at least one of my traces, no ICMP Echos leave the box at all at that point ... (presumably, Nagios was off doing something else at the time)

-so these ICMP Echos are being dropped somewhere inside the box

-as you say, perhaps the NIC's outbound buffer is overflowing -- i hadn't thought of that. checking the output of 'netstat -i' ... i see zeros underneath the TX-ERR, TX-DRP, and TX-OVR columns (and under the RX-ERR/DRP/OVR columns as well). this suggests that the NIC isn't dropping frames ... but i haven't a clue how reliable this output is. i could imagine a NIC driver which doesn't update the relevant counter somewhere, when it discards a frame. so let's keep this possibility on the table

-i wonder if the OS tracks 'outbound ICMP drops'?

gnat> netstat -s -w
[...]
Icmp:
    3510705 ICMP messages received
    34835 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 111508
        timeout in transit: 26581
        echo requests: 84611
        echo replies: 3288005
    7155838 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 64167
        echo request: 7007060
        echo replies: 84611

i wonder what 'input ICMP message failed' means? are those ICMP Echos for which the OS didn't receive any replies? or are those ICMP messages which the OS received but dropped due to resource constraints? or ... ICMP messages which the OS was asked to send but which the OS dropped due to resource constraints? this morning, i poked through Benvunti's "Understanding Linux Network Internals", without success. i have 'Understanding the Linux Kernel' on order. i should probably dig up Steven's 'Implementation' (Vol 2?) as well. any other tips on where to dig to learn what "input ICMP message failed" means?

-now, if my application were CPU-starved, wouldn't i see the pings go out eventually? i mean, my PoCo pinger might have timed out already and have declared the ping lost ... but wouldn't i see the outbound ICMP Echos appear, eventually, in the packet trace? [by 'eventually', i mean delayed by seconds, not delayed by minutes ... i haven't poked deep enough into the trace to see whether or not the ICMP Echos appear minutes late ... not sure how i would set-up that experiment in fact]

-yes, i do tend to burst the pings:

  $max_pings = 100;

  [...]
  # Throw out more pings
  while (@addrs and $heap->{open_pings} < $max_pings}) {
    my $addr = shift @addrs;
    log_it("Pinging $addr") if ($debug == 5 or $debug == 6);

    # "Pinger, do a ping and return the results as a pong event.  The
    # address to ping is $addr."
    $result = $kernel->post(  'pinger',
                              'ping',
                              'pong',
                              $addr,
                              $timeout{$addr}
                           );
  [...]

i suppose i could try dropping $max_pings to something smaller. but, i'm not ready to actually solve the problem; instead, i want to verify that i understand the problem, notably, where exactly the OS is dropping the ball


==> anyway, so this is why i'm now checking the return code on post() ... that's a start, perhaps, but as you point out, hardly the end of my trouble-shooting efforts. i'm beginning to think that what i really want is Dtrace for Linux ... which doesn't exist ... but perhaps i can squeeze enough juice out of SystemTap to tackle this. stapprobes.netdev should allow me to see when data arrives on the NIC ... so i could at least tell whether or not the ICMP Echo arrives at the NIC ... and perhaps stapprobes.socket would tell me whether or not the OS received the request to create a raw socket to service the ICMP Echo


==> yes, i am poking at this from other angles: what changed recently? well, the answer is: a lot. i wiped the box, which was running SuSE 9.3, and installed CentOS 5.3. that's quite a bit of change, right there! i'm using perl-5.10.0 now, whereas i was using perl-5.8.8. my other NodeWatch boxes already run CentOS 5.3 ... so i'm gradually replacing perl-5.8.8 with perl-5.10.0 on them ... that would be a clue, if the problem starts appearing once i'm using perl-5.10.0 the actual applications & cron jobs i run today on the box are the same i was running prior to the OS change. yes, i'm staring at the timing involved ... at this point, the timing has become erratic: that 1:25am pattern has dissolved

anyway, if you have additional pointers, please let me know

--sk




Rocco Caputo wrote:
[...]
Time-based failures implicate periodic jobs. What's cron doing around that time?
[...]

Reply via email to