hi rocco,
yes, ok, i think i'm following you:
-i know that these ICMP Echos aren't leaving the NIC (during these multi-minute
periods), from the sniffer i have posted just outside the box
-many ICMP Echos are exitting the NIC ... but i happen to know which boxes i'm
pinging using Nagios, and which I'm pinging using NodeWatch (which employs POE),
and the pings i'm seeing are headed to/from the Nagios-monitored boxes, not the
NodeWatch-monitored boxes. additionally, NodeWatch logs to syslog when it is
emitting pings ... and, in at least one of my traces, no ICMP Echos leave the
box at all at that point ... (presumably, Nagios was off doing something else at
the time)
-so these ICMP Echos are being dropped somewhere inside the box
-as you say, perhaps the NIC's outbound buffer is overflowing -- i hadn't
thought of that. checking the output of 'netstat -i' ... i see zeros underneath
the TX-ERR, TX-DRP, and TX-OVR columns (and under the RX-ERR/DRP/OVR columns as
well). this suggests that the NIC isn't dropping frames ... but i haven't a
clue how reliable this output is. i could imagine a NIC driver which doesn't
update the relevant counter somewhere, when it discards a frame. so let's keep
this possibility on the table
-i wonder if the OS tracks 'outbound ICMP drops'?
gnat> netstat -s -w
[...]
Icmp:
3510705 ICMP messages received
34835 input ICMP message failed.
ICMP input histogram:
destination unreachable: 111508
timeout in transit: 26581
echo requests: 84611
echo replies: 3288005
7155838 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 64167
echo request: 7007060
echo replies: 84611
i wonder what 'input ICMP message failed' means? are those ICMP Echos for which
the OS didn't receive any replies? or are those ICMP messages which the OS
received but dropped due to resource constraints? or ... ICMP messages which
the OS was asked to send but which the OS dropped due to resource constraints?
this morning, i poked through Benvunti's "Understanding Linux Network
Internals", without success. i have 'Understanding the Linux Kernel' on order.
i should probably dig up Steven's 'Implementation' (Vol 2?) as well. any
other tips on where to dig to learn what "input ICMP message failed" means?
-now, if my application were CPU-starved, wouldn't i see the pings go out
eventually? i mean, my PoCo pinger might have timed out already and have
declared the ping lost ... but wouldn't i see the outbound ICMP Echos appear,
eventually, in the packet trace? [by 'eventually', i mean delayed by seconds,
not delayed by minutes ... i haven't poked deep enough into the trace to see
whether or not the ICMP Echos appear minutes late ... not sure how i would
set-up that experiment in fact]
-yes, i do tend to burst the pings:
$max_pings = 100;
[...]
# Throw out more pings
while (@addrs and $heap->{open_pings} < $max_pings}) {
my $addr = shift @addrs;
log_it("Pinging $addr") if ($debug == 5 or $debug == 6);
# "Pinger, do a ping and return the results as a pong event. The
# address to ping is $addr."
$result = $kernel->post( 'pinger',
'ping',
'pong',
$addr,
$timeout{$addr}
);
[...]
i suppose i could try dropping $max_pings to something smaller. but, i'm not
ready to actually solve the problem; instead, i want to verify that i understand
the problem, notably, where exactly the OS is dropping the ball
==> anyway, so this is why i'm now checking the return code on post() ... that's
a start, perhaps, but as you point out, hardly the end of my trouble-shooting
efforts. i'm beginning to think that what i really want is Dtrace for Linux ...
which doesn't exist ... but perhaps i can squeeze enough juice out of SystemTap
to tackle this. stapprobes.netdev should allow me to see when data arrives on
the NIC ... so i could at least tell whether or not the ICMP Echo arrives at the
NIC ... and perhaps stapprobes.socket would tell me whether or not the OS
received the request to create a raw socket to service the ICMP Echo
==> yes, i am poking at this from other angles: what changed recently? well,
the answer is: a lot. i wiped the box, which was running SuSE 9.3, and
installed CentOS 5.3. that's quite a bit of change, right there! i'm using
perl-5.10.0 now, whereas i was using perl-5.8.8. my other NodeWatch boxes
already run CentOS 5.3 ... so i'm gradually replacing perl-5.8.8 with
perl-5.10.0 on them ... that would be a clue, if the problem starts appearing
once i'm using perl-5.10.0 the actual applications & cron jobs i run today on
the box are the same i was running prior to the OS change. yes, i'm staring at
the timing involved ... at this point, the timing has become erratic: that
1:25am pattern has dissolved
anyway, if you have additional pointers, please let me know
--sk
Rocco Caputo wrote:
[...]
Time-based failures implicate periodic jobs. What's cron doing around
that time?
[...]