Re: Diagnosing packet loss
On 24/11/2011 10:07, Kees Jan Koster wrote: This seems to be local to my machine. Here is another reason why I say that: I can reliably transmit data when I bind to the aliased IP address: If I use mtr to measure packet loss from saffron (the stricken machine) to cumin (another machine in a different data center) I see the following: saffron (ip address a) - cumin: packet loss saffron (ip address b) - cumin: no packet loss cumin - saffron (ip address a): packet loss cumin - saffron (ip address b): no packet loss This is consistent from running mtr for 5 minutes straight. This to me shows that the hardware is fine. Using the alias IP address I can run with no packet loss for as long as I like. Sooo Now what? I am completely at a loss. :-/ Hmm... I wouldn't dismiss hardware problems just yet. Earlier you showed the ifconfig output for your problem machine: [kjkoster@saffron ~]$ ifconfig bge0 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE ether 00:e0:81:32:ed:b4 inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167 inet 91.196.169.166 netmask 0x broadcast 91.196.169.166 media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause) status: active Where there is a one-bit difference between the addresses. Can you try temporarily using two even-numbered addresses and then two odd-numbered addresses and repeat your mtr tests? If the packet loss problem correlates with whether the address is even or odd, then I think that's pretty good evidence for a dud network interface: a one-bit problem in a memory register somewhere, occasionally flipping the least significant bit in the address to 0. Another test would be to swap the configuration order (ie. make .166 the primary address and .165 the alias) -- if it's always the first configured address that has problems, again that indicates memory trouble in the hardware. Are these NICs built-in to your motherboard? If so, they will almost certainly share a PHY, which is where the problem would be, and why swapping the cables between interfaces made no difference. Unfortunately in that case to fix the problem, you'll either have to swap out the motherboard or add a separate NIC card to your system. Hopefully the system is still under warranty. Cheers, Matthew -- Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard Flat 3 PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate JID: matt...@infracaninophile.co.uk Kent, CT11 9PW signature.asc Description: OpenPGP digital signature
Re: Diagnosing packet loss
Well, 1% is not good but I've seen worse for sure! Sounds like you tried the obvious. I would recommend a different IP to rule out a dupe ip; else it must be NIC related - either hardware or driver. Also, perhaps swap cables and ports with a working machine and see if the problem follows or stays put. - Original Message - From: Kees Jan Koster [mailto:kjkos...@gmail.com] Sent: Tuesday, November 22, 2011 02:33 PM To: freebsd-questions@freebsd.org freebsd-questions@freebsd.org Subject: Diagnosing packet loss Dear All, I am stuck with a machine that shows serious packet loss (about 1% of all traffic is dropped). I tried the obvious (new network cable, different switch port, different ethernet interface on the machine), but the problems remain. Another machine that sits in the same rack and is hooked up to the same switch shows no such packet loss issues. The problematic machine is a dual Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 CEST 2011. The machine is lightly loaded. A MySQL slave is running, but the machine is not serving queries. Plus a Munin server process. I am at a loss where to start diagnosing this. Can you advise me where to look? Are there network buffers that may be overflowing? -- Kees Jan http://java-monitor.com/ kjkos...@kjkoster.org +31651838192 Change is good. Granted, it is good in retrospect, but change is good. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org font size=1 div style='border:none;border-bottom:double windowtext 2.25pt;padding:0in 0in 1.0pt 0in' /div This email is intended to be reviewed by only the intended recipient and may contain information that is privileged and/or confidential. If you are not the intended recipient, you are hereby notified that any review, use, dissemination, disclosure or copying of this email and its attachments, if any, is strictly prohibited. If you have received this email in error, please immediately notify the sender by return email and delete this email from your system. /font ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Diagnosing packet loss
Dear Gary, Thank you for your reply. Your comment about dupe IP triggered something that I failed to mention: the interface is aliased. It has two IP addresses. IP address a and it has an alias IP address b. I just tested binding mtr to each of these interfaces separately to measure packet loss. If I use mtr to measure packet loss from saffron (the stricken machine) to cumin (another machine in a different data center) I see the following: saffron (ip address a) - cumin: packet loss saffron (ip address b) - cumin: no packet loss cumin - saffron (ip address a): packet loss cumin - saffron (ip address b): no packet loss This is consistent from running mtr for 5 minutes straight. This to me shows that the hardware is fine. Using the alias IP address I can run with no packet loss for as long as I like. Hum Could it be that my switch does not support IP aliasing? Then why is there packet loss only on one IP and not on both? This is getting weirder and weirder. Kees Jan On 22 Nov 2011, at 22:15, Gary Gatten wrote: Well, 1% is not good but I've seen worse for sure! Sounds like you tried the obvious. I would recommend a different IP to rule out a dupe ip; else it must be NIC related - either hardware or driver. Also, perhaps swap cables and ports with a working machine and see if the problem follows or stays put. - Original Message - From: Kees Jan Koster [mailto:kjkos...@gmail.com] Sent: Tuesday, November 22, 2011 02:33 PM To: freebsd-questions@freebsd.org freebsd-questions@freebsd.org Subject: Diagnosing packet loss Dear All, I am stuck with a machine that shows serious packet loss (about 1% of all traffic is dropped). I tried the obvious (new network cable, different switch port, different ethernet interface on the machine), but the problems remain. Another machine that sits in the same rack and is hooked up to the same switch shows no such packet loss issues. The problematic machine is a dual Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 CEST 2011. The machine is lightly loaded. A MySQL slave is running, but the machine is not serving queries. Plus a Munin server process. I am at a loss where to start diagnosing this. Can you advise me where to look? Are there network buffers that may be overflowing? -- Kees Jan http://java-monitor.com/ kjkos...@kjkoster.org +31651838192 Change is good. Granted, it is good in retrospect, but change is good. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org font size=1 div style='border:none;border-bottom:double windowtext 2.25pt;padding:0in 0in 1.0pt 0in' /div This email is intended to be reviewed by only the intended recipient and may contain information that is privileged and/or confidential. If you are not the intended recipient, you are hereby notified that any review, use, dissemination, disclosure or copying of this email and its attachments, if any, is strictly prohibited. If you have received this email in error, please immediately notify the sender by return email and delete this email from your system. /font -- Kees Jan http://java-monitor.com/ kjkos...@kjkoster.org +31651838192 I hate unit tests; I much prefer the illusion that there are no errors in my code. -- Hendrik Muller ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Diagnosing packet loss
On Tue, Nov 22, 2011 at 1:58 PM, Kees Jan Koster kjkos...@gmail.com wrote: Thank you for your reply. Your comment about dupe IP triggered something that I failed to mention: the interface is aliased. It has two IP addresses. IP address a and it has an alias IP address b. I just tested binding mtr to each of these interfaces separately to measure packet loss. Show us the ifconfig output. My guess is that the alias is incorrectly configured. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Diagnosing packet loss
On 22/11/2011 20:33, Kees Jan Koster wrote: I am stuck with a machine that shows serious packet loss (about 1% of all traffic is dropped). I tried the obvious (new network cable, different switch port, different ethernet interface on the machine), but the problems remain. Another machine that sits in the same rack and is hooked up to the same switch shows no such packet loss issues. The problematic machine is a dual Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 CEST 2011. The machine is lightly loaded. A MySQL slave is running, but the machine is not serving queries. Plus a Munin server process. I am at a loss where to start diagnosing this. Can you advise me where to look? Are there network buffers that may be overflowing? You say lightly loaded, but how much does that actually equate to in kb/s or Mb/s? I'd call anything less than about 1Mb/s on a GB ethernet link pretty light, but other people have different ideas. Check for duplex mismatch -- normally everything just works allowing the NIC and the switch to autonegotiate, but every so often some bright spark gets the idea that wiring down the speed setting is a good idea. Trouble is you have to set *both* ends of the ethernet link to the same settings -- if one end is trying to auto and the other is fixed, you'll end up with the auto end defaulting to 100baseTX half-duplex and performance will suck, and suck increasingly hard as network load goes up. Amazing how often that 'set both ends the same' thing leads to grief. Another hideously embarrassing error would be to spend ages debugging before finding out you had a duplicate IP number on your network. Can you definitely rule that out? A third networking problem that also has the potential to make you the butt of a few jokes is if your network cables are kinked, crushed, over stretched or simply cable-tied too tightly. Anything like that can cause signal leakage between the pairs of conductors in the cable which can be enough to disrupt packet transmission. Simply snipping through a too-tight cable tie can have a magical-seeming effect. What sort on NICs are there on your machine? It's well known that re(4) interfaces simply cannot keep up with the throughput of a good server NIC like em(4) or bge(4). [But re(4)'s are cheap and good enough for most home systems...] If you can try swapping in a reasonably good NIC card -- beg, borrow or steal from another machine just for a few hours to use for testing -- and see if that cures the problem. Other considerations: are you doing anything beyond just plain ethernet networking? Any VLANS? What about ipsec or other tunnelled/encapsulated traffic? Are you using RSTP or lagg to make your networking resilient to failures? If the answer to any of these is yes -- does temporarily disabling that feature and doing it simple and stupid help with the packet loss? Do you get the same sort of packet loss if you take the switch away and just run a cable direct between two machines. (Nb. If your NICs don't support MSIx you'll need a crossover cable.) On another host on your net, can you use wireshark to capture and examine the traffic from your failing machine? For best results, either wire the two machines directly together or configure the switch port your wireshark box is connected to as a /monitor/ port so it sees all the traffic coming out of your problem box. Does your NIC have hardware checksumming? If so, does disabling that help with the error rate? (see ifconfig(8) and the man page for your NIC in section 4 for how.) There have been a number of instances of buggy checksumming causing problems in the past. Nb. with hardware checksumming, the checksum field is calculated and inserted in packets very late; after any way of examining the packets as they leave your machine has ceased to be possible. Makes it look like the checksums are all wrong if you sample the traffic on the originating machine. This is why you need to use another, external machine to watch for this sort of error. Cheers, Matthew -- Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard Flat 3 PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate JID: matt...@infracaninophile.co.uk Kent, CT11 9PW signature.asc Description: OpenPGP digital signature
Re: Diagnosing packet loss
On Tue, Nov 22, 2011 at 9:33 PM, Kees Jan Koster kjkos...@gmail.com wrote: Dear All, I am stuck with a machine that shows serious packet loss (about 1% of all traffic is dropped). I tried the obvious (new network cable, different switch port, different ethernet interface on the machine), but the problems remain. Another machine that sits in the same rack and is hooked up to the same switch shows no such packet loss issues. The problematic machine is a dual Opteron with FreeBSD 8.2-STABLE from Thu Aug 11 14:05:47 CEST 2011. The machine is lightly loaded. A MySQL slave is running, but the machine is not serving queries. Plus a Munin server process. I am at a loss where to start diagnosing this. Can you advise me where to look? Are there network buffers that may be overflowing? -- To check input/output errors and collisions : netstat -in Detailed TCP/IP statistics: netstat -s or netstat -ss Checking Receive and Send Queue : netstat -an -f inet Buffers: netstat -m Adriaan ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Diagnosing packet loss
Dear Michael, [kjkoster@saffron ~]$ ifconfig bge0 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE ether 00:e0:81:32:ed:b4 inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167 inet 91.196.169.166 netmask 0x broadcast 91.196.169.166 media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause) status: active [kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248 ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255 That broadcast address and netmask look wrong for sure. Should I just change that to 255.255.255.248 as well? Kees Jan On 22 Nov 2011, at 23:00, Michael Sierchio wrote: On Tue, Nov 22, 2011 at 1:58 PM, Kees Jan Koster kjkos...@gmail.com wrote: Thank you for your reply. Your comment about dupe IP triggered something that I failed to mention: the interface is aliased. It has two IP addresses. IP address a and it has an alias IP address b. I just tested binding mtr to each of these interfaces separately to measure packet loss. Show us the ifconfig output. My guess is that the alias is incorrectly configured. -- Kees Jan http://java-monitor.com/ kjkos...@kjkoster.org +31651838192 I hate unit tests; I much prefer the illusion that there are no errors in my code. -- Hendrik Muller ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Diagnosing packet loss
On Tue, Nov 22, 2011 at 4:11 PM, Kees Jan Koster kjkos...@gmail.com wrote: [kjkoster@saffron ~]$ ifconfig bge0 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE ether 00:e0:81:32:ed:b4 inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167 inet 91.196.169.166 netmask 0x broadcast 91.196.169.166 media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause) status: active [kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248 ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255 That broadcast address and netmask look wrong for sure. Should I just change that to 255.255.255.248 as well? No, that is correct. Leave your alias alone if you want it to continue to work. -- Adam Vande More ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: Diagnosing packet loss
Matthew suggests turning off hardware checksums - it won't hurt to give that a try: ifconfig bge0 media 100baseTX mediaopt -txcsum On Tue, Nov 22, 2011 at 2:26 PM, Adam Vande More amvandem...@gmail.com wrote: On Tue, Nov 22, 2011 at 4:11 PM, Kees Jan Koster kjkos...@gmail.com wrote: [kjkoster@saffron ~]$ ifconfig bge0 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE ether 00:e0:81:32:ed:b4 inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167 inet 91.196.169.166 netmask 0x broadcast 91.196.169.166 media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause) status: active [kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248 ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255 That broadcast address and netmask look wrong for sure. Should I just change that to 255.255.255.248 as well? No, that is correct. Leave your alias alone if you want it to continue to work. -- Adam Vande More ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
RE: Diagnosing packet loss
I noticed the Ethernet flow control is enabled - or seems to be? Perhaps DISABLE Ethernet flow control (rxpause/txpause) in your box AND the switch. From: Adam Vande More [mailto:amvandem...@gmail.com] Sent: Tuesday, November 22, 2011 4:26 PM To: Kees Jan Koster Cc: Michael Sierchio; Gary Gatten; freebsd-questions@freebsd.org Subject: Re: Diagnosing packet loss On Tue, Nov 22, 2011 at 4:11 PM, Kees Jan Koster kjkos...@gmail.commailto:kjkos...@gmail.com wrote: [kjkoster@saffron ~]$ ifconfig bge0 bge0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=8009bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE ether 00:e0:81:32:ed:b4 inet 91.196.169.165 netmask 0xfff8 broadcast 91.196.169.167 inet 91.196.169.166 netmask 0x broadcast 91.196.169.166 media: Ethernet autoselect (100baseTX full-duplex,flowcontrol,rxpause,txpause) status: active [kjkoster@saffron ~]$ fgrep bge0 /etc/rc.conf ifconfig_bge0=inet 91.196.169.165 netmask 255.255.255.248 ifconfig_bge0_alias0=91.196.169.166 netmask 255.255.255.255 That broadcast address and netmask look wrong for sure. Should I just change that to 255.255.255.248 as well? No, that is correct. Leave your alias alone if you want it to continue to work. -- Adam Vande More font size=1 div style='border:none;border-bottom:double windowtext 2.25pt;padding:0in 0in 1.0pt 0in' /div This email is intended to be reviewed by only the intended recipient and may contain information that is privileged and/or confidential. If you are not the intended recipient, you are hereby notified that any review, use, dissemination, disclosure or copying of this email and its attachments, if any, is strictly prohibited. If you have received this email in error, please immediately notify the sender by return email and delete this email from your system. /font ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org