Re: Diagnosing packet loss

2011-11-24 Thread Matthew Seaman
On 24/11/2011 10:07, Kees Jan Koster wrote:
 This seems to be local to my machine. Here is another reason why I
 say that: I can reliably transmit data when I bind to the aliased IP
 address: If I use mtr to measure packet loss from saffron (the stricken
 machine) to cumin (another machine in a different data center) I see the
 following:
 
  saffron (ip address a) - cumin: packet loss
  saffron (ip address b) - cumin: no packet loss
 
  cumin - saffron (ip address a): packet loss
  cumin - saffron (ip address b): no packet loss
 
 This is consistent from running mtr for 5 minutes straight. This to
 me shows that the hardware is fine. Using the alias IP address I can
 run with no packet loss for as long as I like.
 
 Sooo Now what? I am completely at a loss. :-/

Hmm... I wouldn't dismiss hardware problems just yet. Earlier you showed
the ifconfig output for your problem machine:

 [kjkoster@saffron ~]$ ifconfig bge0
 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500
   
 options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE
   ether 00:e0:81:32:ed:b4
   inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167
   inet 91.196.169.166 netmask 0x broadcast 91.196.169.166
   media: Ethernet autoselect (100baseTX 
 full-duplex,flowcontrol,rxpause,txpause)
   status: active

Where there is a one-bit difference between the addresses.  Can you try
temporarily using two even-numbered addresses and then two odd-numbered
addresses and repeat your mtr tests?  If the packet loss problem
correlates with whether the address is even or odd, then I think that's
pretty good evidence for a dud network interface: a one-bit problem in a
memory register somewhere, occasionally flipping the least significant
bit in the address to 0.

Another test would be to swap the configuration order (ie. make .166 the
primary address and .165 the alias) -- if it's always the first
configured address that has problems, again that indicates memory
trouble in the hardware.

Are these NICs built-in to your motherboard?  If so, they will almost
certainly share a PHY, which is where the problem would be, and why
swapping the cables between interfaces made no difference.
Unfortunately in that case to fix the problem, you'll either have to
swap out the motherboard or add a separate NIC card to your system.
Hopefully the system is still under warranty.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
JID: matt...@infracaninophile.co.uk   Kent, CT11 9PW



signature.asc
Description: OpenPGP digital signature


Re: Diagnosing packet loss

2011-11-22 Thread Gary Gatten
Well, 1% is not good but I've seen worse for sure!  Sounds like you tried the 
obvious.  I would recommend a different IP to rule out a dupe ip; else it must 
be NIC related - either hardware or driver.  Also, perhaps swap cables and 
ports with a working machine and see if the problem follows or stays put.

- Original Message -
From: Kees Jan Koster [mailto:kjkos...@gmail.com]
Sent: Tuesday, November 22, 2011 02:33 PM
To: freebsd-questions@freebsd.org freebsd-questions@freebsd.org
Subject: Diagnosing packet loss

Dear All,

I am stuck with a machine that shows serious packet loss (about 1% of all 
traffic is dropped). I tried the obvious (new network cable, different switch 
port, different ethernet interface on the machine), but the problems remain.

Another machine that sits in the same rack and is hooked up to the same switch 
shows no such packet loss issues. The problematic machine is a dual Opteron 
with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 CEST 2011.

The machine is lightly loaded. A MySQL slave is running, but the machine is not 
serving queries. Plus a Munin server process.

I am at a loss where to start diagnosing this. Can you advise me where to look? 
Are there network buffers that may be overflowing?
--
Kees Jan

http://java-monitor.com/
kjkos...@kjkoster.org
+31651838192

Change is good. Granted, it is good in retrospect, but change is good.

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org





font size=1
div style='border:none;border-bottom:double windowtext 2.25pt;padding:0in 0in 
1.0pt 0in'
/div
This email is intended to be reviewed by only the intended recipient
 and may contain information that is privileged and/or confidential.
 If you are not the intended recipient, you are hereby notified that
 any review, use, dissemination, disclosure or copying of this email
 and its attachments, if any, is strictly prohibited.  If you have
 received this email in error, please immediately notify the sender by
 return email and delete this email from your system.
/font

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Diagnosing packet loss

2011-11-22 Thread Kees Jan Koster
Dear Gary,

Thank you for your reply. Your comment about dupe IP triggered something that I 
failed to mention: the interface is aliased. It has two IP addresses. IP 
address a and it has an alias IP address b. I just tested binding mtr to each 
of these interfaces separately to measure packet loss.

If I use mtr to measure packet loss from saffron (the stricken machine) to 
cumin (another machine in a different data center) I see the following:

  saffron (ip address a) - cumin: packet loss
  saffron (ip address b) - cumin: no packet loss

  cumin - saffron (ip address a): packet loss
  cumin - saffron (ip address b): no packet loss

This is consistent from running mtr for 5 minutes straight. This to me shows 
that the hardware is fine. Using the alias IP address I can run with no packet 
loss for as long as I like.

Hum Could it be that my switch does not support IP aliasing? Then why is 
there packet loss only on one IP and not on both?

This is getting weirder and weirder.

Kees Jan


On 22 Nov 2011, at 22:15, Gary Gatten wrote:

 Well, 1% is not good but I've seen worse for sure!  Sounds like you tried the 
 obvious.  I would recommend a different IP to rule out a dupe ip; else it 
 must be NIC related - either hardware or driver.  Also, perhaps swap cables 
 and ports with a working machine and see if the problem follows or stays put.
 
 - Original Message -
 From: Kees Jan Koster [mailto:kjkos...@gmail.com]
 Sent: Tuesday, November 22, 2011 02:33 PM
 To: freebsd-questions@freebsd.org freebsd-questions@freebsd.org
 Subject: Diagnosing packet loss
 
 Dear All,
 
 I am stuck with a machine that shows serious packet loss (about 1% of all 
 traffic is dropped). I tried the obvious (new network cable, different switch 
 port, different ethernet interface on the machine), but the problems remain.
 
 Another machine that sits in the same rack and is hooked up to the same 
 switch shows no such packet loss issues. The problematic machine is a dual 
 Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 CEST 2011.
 
 The machine is lightly loaded. A MySQL slave is running, but the machine is 
 not serving queries. Plus a Munin server process.
 
 I am at a loss where to start diagnosing this. Can you advise me where to 
 look? Are there network buffers that may be overflowing?
 --
 Kees Jan
 
 http://java-monitor.com/
 kjkos...@kjkoster.org
 +31651838192
 
 Change is good. Granted, it is good in retrospect, but change is good.
 
 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
 
 
 
 
 
 font size=1
 div style='border:none;border-bottom:double windowtext 2.25pt;padding:0in 
 0in 1.0pt 0in'
 /div
 This email is intended to be reviewed by only the intended recipient
 and may contain information that is privileged and/or confidential.
 If you are not the intended recipient, you are hereby notified that
 any review, use, dissemination, disclosure or copying of this email
 and its attachments, if any, is strictly prohibited.  If you have
 received this email in error, please immediately notify the sender by
 return email and delete this email from your system.
 /font
 


--
Kees Jan

http://java-monitor.com/
kjkos...@kjkoster.org
+31651838192

I hate unit tests; I much prefer the illusion that there are no errors in my 
code.
 -- Hendrik 
Muller

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Diagnosing packet loss

2011-11-22 Thread Michael Sierchio
On Tue, Nov 22, 2011 at 1:58 PM, Kees Jan Koster kjkos...@gmail.com wrote:

 Thank you for your reply. Your comment about dupe IP triggered something that 
 I failed to mention: the interface is aliased. It has two IP addresses. IP 
 address a and it has an alias IP address b. I just tested binding mtr to each 
 of these interfaces separately to measure packet loss.

Show us the ifconfig output.  My guess is that the alias is
incorrectly configured.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Diagnosing packet loss

2011-11-22 Thread Matthew Seaman
On 22/11/2011 20:33, Kees Jan Koster wrote:
 I am stuck with a machine that shows serious packet loss (about 1% of
 all traffic is dropped). I tried the obvious (new network cable,
 different switch port, different ethernet interface on the machine),
 but the problems remain.
 
 Another machine that sits in the same rack and is hooked up to the
 same switch shows no such packet loss issues. The problematic machine
 is a dual Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47
 CEST 2011.
 
 The machine is lightly loaded. A MySQL slave is running, but the
 machine is not serving queries. Plus a Munin server process.
 
 I am at a loss where to start diagnosing this. Can you advise me
 where to look? Are there network buffers that may be overflowing?

You say lightly loaded, but how much does that actually equate to in
kb/s or Mb/s?  I'd call anything less than about 1Mb/s on a GB ethernet
link pretty light, but other people have different ideas.

Check for duplex mismatch -- normally everything just works allowing the
NIC and the switch to autonegotiate, but every so often some bright
spark gets the idea that wiring down the speed setting is a good idea.
Trouble is you have to set *both* ends of the ethernet link to the same
settings -- if one end is trying to auto and the other is fixed, you'll
end up with the auto end defaulting to 100baseTX half-duplex and
performance will suck, and suck increasingly hard as network load goes
up.  Amazing how often that 'set both ends the same' thing leads to grief.

Another hideously embarrassing error would be to spend ages debugging
before finding out you had a duplicate IP number on your network.  Can
you definitely rule that out?

A third networking problem that also has the potential to make you the
butt of a few jokes is if your network cables are kinked, crushed, over
stretched or simply cable-tied too tightly.  Anything like that can
cause signal leakage between the pairs of conductors in the cable which
can be enough to disrupt packet transmission.  Simply snipping through a
too-tight cable tie can have a magical-seeming effect.

What sort on NICs are there on your machine?  It's well known that re(4)
interfaces simply cannot keep up with the throughput of a good server
NIC like em(4) or bge(4).  [But re(4)'s are cheap and good enough for
most home systems...]  If you can try swapping in a reasonably good NIC
card -- beg, borrow or steal from another machine just for a few hours
to use for testing -- and see if that cures the problem.

Other considerations: are you doing anything beyond just plain ethernet
networking?  Any VLANS?  What about ipsec or other
tunnelled/encapsulated traffic?  Are you using RSTP or lagg to make your
networking resilient to failures?  If the answer to any of these is
yes -- does temporarily disabling that feature and doing it simple and
stupid help with the packet loss?

Do you get the same sort of packet loss if you take the switch away and
just run a cable direct between two machines.  (Nb. If your NICs don't
support MSIx you'll need a crossover cable.)

On another host on your net, can you use wireshark to capture and
examine the traffic from your failing machine?  For best results, either
wire the two machines directly together or configure the switch port
your wireshark box is connected to as a /monitor/ port so it sees all
the traffic coming out of your problem box.

Does your NIC have hardware checksumming?  If so, does disabling that
help with the error rate? (see ifconfig(8) and the man page for your NIC
in section 4 for how.)  There have been a number of instances of buggy
checksumming causing problems in the past.  Nb. with hardware
checksumming, the checksum field is calculated and inserted in packets
very late; after any way of examining the packets as they leave your
machine has ceased to be possible.  Makes it look like the checksums are
all wrong if you sample the traffic on the originating machine.  This is
why you need to use another, external machine to watch for this sort of
error.

Cheers,

Matthew

-- 
Dr Matthew J Seaman MA, D.Phil.   7 Priory Courtyard
  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate
JID: matt...@infracaninophile.co.uk   Kent, CT11 9PW



signature.asc
Description: OpenPGP digital signature


Re: Diagnosing packet loss

2011-11-22 Thread J65nko
On Tue, Nov 22, 2011 at 9:33 PM, Kees Jan Koster kjkos...@gmail.com wrote:
 Dear All,

 I am stuck with a machine that shows serious packet loss (about 1% of all 
 traffic is dropped). I tried the obvious (new network cable, different switch 
 port, different ethernet interface on the machine), but the problems remain.

 Another machine that sits in the same rack and is hooked up to the same 
 switch shows no such packet loss issues. The problematic machine is a dual 
 Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 CEST 2011.

 The machine is lightly loaded. A MySQL slave is running, but the machine is 
 not serving queries. Plus a Munin server process.

 I am at a loss where to start diagnosing this. Can you advise me where to 
 look? Are there network buffers that may be overflowing?
 --

To check input/output errors and collisions : netstat -in

Detailed TCP/IP statistics: netstat -s  or  netstat -ss

Checking Receive and Send Queue : netstat -an -f inet

Buffers: netstat -m

Adriaan
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Diagnosing packet loss

2011-11-22 Thread Kees Jan Koster
Dear Michael,

[kjkoster@saffron ~]$ ifconfig bge0
bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500

options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE
ether 00:e0:81:32:ed:b4
inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167
inet 91.196.169.166 netmask 0x broadcast 91.196.169.166
media: Ethernet autoselect (100baseTX 
full-duplex,flowcontrol,rxpause,txpause)
status: active
[kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf 
ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248
ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255

That broadcast address and netmask look wrong for sure.

Should I just change that to 255.255.255.248 as well?

Kees Jan


On 22 Nov 2011, at 23:00, Michael Sierchio wrote:

 On Tue, Nov 22, 2011 at 1:58 PM, Kees Jan Koster kjkos...@gmail.com wrote:
 
 Thank you for your reply. Your comment about dupe IP triggered something 
 that I failed to mention: the interface is aliased. It has two IP addresses. 
 IP address a and it has an alias IP address b. I just tested binding mtr to 
 each of these interfaces separately to measure packet loss.
 
 Show us the ifconfig output.  My guess is that the alias is
 incorrectly configured.


--
Kees Jan

http://java-monitor.com/
kjkos...@kjkoster.org
+31651838192

I hate unit tests; I much prefer the illusion that there are no errors in my 
code.
 -- Hendrik 
Muller

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Diagnosing packet loss

2011-11-22 Thread Adam Vande More
On Tue, Nov 22, 2011 at 4:11 PM, Kees Jan Koster kjkos...@gmail.com wrote:

 [kjkoster@saffron ~]$ ifconfig bge0
 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500

  options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE
ether 00:e0:81:32:ed:b4
inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167
inet 91.196.169.166 netmask 0x broadcast 91.196.169.166
media: Ethernet autoselect (100baseTX
 full-duplex,flowcontrol,rxpause,txpause)
status: active
 [kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf
 ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248
 ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255

 That broadcast address and netmask look wrong for sure.

 Should I just change that to 255.255.255.248 as well?


No, that is correct.  Leave your alias alone if you want it to continue to
work.

-- 
Adam Vande More
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: Diagnosing packet loss

2011-11-22 Thread Michael Sierchio
Matthew suggests turning off hardware checksums - it won't hurt to
give that a try:

ifconfig bge0 media 100baseTX mediaopt -txcsum


On Tue, Nov 22, 2011 at 2:26 PM, Adam Vande More amvandem...@gmail.com wrote:
 On Tue, Nov 22, 2011 at 4:11 PM, Kees Jan Koster kjkos...@gmail.com wrote:

 [kjkoster@saffron ~]$ ifconfig bge0
 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500

  options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE
        ether 00:e0:81:32:ed:b4
        inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167
        inet 91.196.169.166 netmask 0x broadcast 91.196.169.166
        media: Ethernet autoselect (100baseTX
 full-duplex,flowcontrol,rxpause,txpause)
        status: active
 [kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf
 ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248
 ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255

 That broadcast address and netmask look wrong for sure.

 Should I just change that to 255.255.255.248 as well?

 No, that is correct.  Leave your alias alone if you want it to continue to
 work.

 --
 Adam Vande More

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


RE: Diagnosing packet loss

2011-11-22 Thread Gary Gatten
I noticed the Ethernet flow control is enabled - or seems to be?  Perhaps 
DISABLE Ethernet flow control (rxpause/txpause) in your box AND the switch.


From: Adam Vande More [mailto:amvandem...@gmail.com]
Sent: Tuesday, November 22, 2011 4:26 PM
To: Kees Jan Koster
Cc: Michael Sierchio; Gary Gatten; freebsd-questions@freebsd.org
Subject: Re: Diagnosing packet loss

On Tue, Nov 22, 2011 at 4:11 PM, Kees Jan Koster 
kjkos...@gmail.commailto:kjkos...@gmail.com wrote:
[kjkoster@saffron ~]$ ifconfig bge0
bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500
   
options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE
   ether 00:e0:81:32:ed:b4
   inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167
   inet 91.196.169.166 netmask 0x broadcast 91.196.169.166
   media: Ethernet autoselect (100baseTX 
full-duplex,flowcontrol,rxpause,txpause)
   status: active
[kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf
ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248
ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255

That broadcast address and netmask look wrong for sure.

Should I just change that to 255.255.255.248 as well?

No, that is correct.  Leave your alias alone if you want it to continue to work.

--
Adam Vande More





font size=1
div style='border:none;border-bottom:double windowtext 2.25pt;padding:0in 0in 
1.0pt 0in'
/div
This email is intended to be reviewed by only the intended recipient
 and may contain information that is privileged and/or confidential.
 If you are not the intended recipient, you are hereby notified that
 any review, use, dissemination, disclosure or copying of this email
 and its attachments, if any, is strictly prohibited.  If you have
 received this email in error, please immediately notify the sender by
 return email and delete this email from your system.
/font

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org