Re: PoCo::Client::Ping intermittently fails to emit pings

Stuart Kendrick Tue, 21 Apr 2009 05:21:16 -0700

hi rocco,

yes, ok, i think i'm following you:

-i know that these ICMP Echos aren't leaving the NIC (during these multi-minuteperiods), from the sniffer i have posted just outside the box

-many ICMP Echos are exitting the NIC ... but i happen to know which boxes i'mpinging using Nagios, and which I'm pinging using NodeWatch (which employs POE),and the pings i'm seeing are headed to/from the Nagios-monitored boxes, not theNodeWatch-monitored boxes. additionally, NodeWatch logs to syslog when it isemitting pings ... and, in at least one of my traces, no ICMP Echos leave thebox at all at that point ... (presumably, Nagios was off doing something else atthe time)


-so these ICMP Echos are being dropped somewhere inside the box

-as you say, perhaps the NIC's outbound buffer is overflowing -- i hadn'tthought of that. checking the output of 'netstat -i' ... i see zeros underneaththe TX-ERR, TX-DRP, and TX-OVR columns (and under the RX-ERR/DRP/OVR columns aswell). this suggests that the NIC isn't dropping frames ... but i haven't aclue how reliable this output is. i could imagine a NIC driver which doesn'tupdate the relevant counter somewhere, when it discards a frame. so let's keepthis possibility on the table


-i wonder if the OS tracks 'outbound ICMP drops'?

gnat> netstat -s -w
[...]
Icmp:
    3510705 ICMP messages received
    34835 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 111508
        timeout in transit: 26581
        echo requests: 84611
        echo replies: 3288005
    7155838 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 64167
        echo request: 7007060
        echo replies: 84611

i wonder what 'input ICMP message failed' means? are those ICMP Echos for whichthe OS didn't receive any replies? or are those ICMP messages which the OSreceived but dropped due to resource constraints? or ... ICMP messages whichthe OS was asked to send but which the OS dropped due to resource constraints?this morning, i poked through Benvunti's "Understanding Linux NetworkInternals", without success. i have 'Understanding the Linux Kernel' on order.i should probably dig up Steven's 'Implementation' (Vol 2?) as well. anyother tips on where to dig to learn what "input ICMP message failed" means?

-now, if my application were CPU-starved, wouldn't i see the pings go outeventually? i mean, my PoCo pinger might have timed out already and havedeclared the ping lost ... but wouldn't i see the outbound ICMP Echos appear,eventually, in the packet trace? [by 'eventually', i mean delayed by seconds,not delayed by minutes ... i haven't poked deep enough into the trace to seewhether or not the ICMP Echos appear minutes late ... not sure how i wouldset-up that experiment in fact]


-yes, i do tend to burst the pings:

  $max_pings = 100;

  [...]
  # Throw out more pings
  while (@addrs and $heap->{open_pings} < $max_pings}) {
    my $addr = shift @addrs;
    log_it("Pinging $addr") if ($debug == 5 or $debug == 6);

    # "Pinger, do a ping and return the results as a pong event.  The
    # address to ping is $addr."
    $result = $kernel->post(  'pinger',
                              'ping',
                              'pong',
                              $addr,
                              $timeout{$addr}
                           );
  [...]

i suppose i could try dropping $max_pings to something smaller. but, i'm notready to actually solve the problem; instead, i want to verify that i understandthe problem, notably, where exactly the OS is dropping the ball

==> anyway, so this is why i'm now checking the return code on post() ... that'sa start, perhaps, but as you point out, hardly the end of my trouble-shootingefforts. i'm beginning to think that what i really want is Dtrace for Linux ...which doesn't exist ... but perhaps i can squeeze enough juice out of SystemTapto tackle this. stapprobes.netdev should allow me to see when data arrives onthe NIC ... so i could at least tell whether or not the ICMP Echo arrives at theNIC ... and perhaps stapprobes.socket would tell me whether or not the OSreceived the request to create a raw socket to service the ICMP Echo

==> yes, i am poking at this from other angles: what changed recently? well,the answer is: a lot. i wiped the box, which was running SuSE 9.3, andinstalled CentOS 5.3. that's quite a bit of change, right there! i'm usingperl-5.10.0 now, whereas i was using perl-5.8.8. my other NodeWatch boxesalready run CentOS 5.3 ... so i'm gradually replacing perl-5.8.8 withperl-5.10.0 on them ... that would be a clue, if the problem starts appearingonce i'm using perl-5.10.0 the actual applications & cron jobs i run today onthe box are the same i was running prior to the OS change. yes, i'm staring atthe timing involved ... at this point, the timing has become erratic: that1:25am pattern has dissolved


anyway, if you have additional pointers, please let me know

--sk




Rocco Caputo wrote:
[...]

Time-based failures implicate periodic jobs. What's cron doing aroundthat time?

[...]

Re: PoCo::Client::Ping intermittently fails to emit pings

Reply via email to