Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken
On Sat, Apr 07, 2012 at 04:29:38AM +0100, Ben Hutchings wrote: I would like to take this upstream now, but first I need to check whether it has already been fixed after 2.6.32. Please can you test the current kernel package from testing, unstable or squeeze-backports (linux-image-3.2.0-2-amd64 or linux-image-3.2.0-0.bpo.2-amd64)? I installed linux-image-3.2.0-0.bpo.2-amd64, plus the upgraded linux-base and initramfs-tools, plus the indicated firmware-bnx2 upgrade -- and then rebooted into that kernel, but the machine wouldn't respond to ping over the xenbr2 interface (the one with the default gateway). I logged into it fine through the xenbr54 interface, and tried to ping the default gateway, and it didn't work. This was with the workaround - only bnx2/eth2 in the bonding interface. Then I removed the default gateway and added it back just to see if it'll work, and then it started pinging. Weird. After that, I tried to reproduce this bug, but failed, it looks like the bug is fixed there. I noticed a significant lag with some of those bonding --detach/--change-active actions, but after a few sections everything continued to work fine. -- 2. That which causes joy or happiness. -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120411122453.ga29...@entuzijast.net
Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken
On Wed, 2012-04-04 at 09:55 +0200, Josip Rodin wrote: On Mon, Apr 02, 2012 at 05:22:37AM +0100, Ben Hutchings wrote: On Sun, 2012-04-01 at 12:40 +0200, Josip Rodin wrote: On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote: I bet this is due to the combination of LRO plus bridging. We try to turn off LRO in devices under a bridge, but that won't work if there's an intermediate bonding device. If you run: # ethtool -K eth0 lro off # ethtool -K eth2 lro off does the bridge start working? Err... % sudo ethtool -K eth0 lro off Cannot set large receive offload settings: Operation not supported % sudo ethtool -K eth2 lro off Cannot set large receive offload settings: Operation not supported Hmm. Well it shouldn't be a problem but you could try also turning off GRO (similar commands). Ah, there we go. Once I ran sudo ethtool -K eth0 gro off, sudo ifenslave bond54 eth0 produced a still-working bond54. OK, this is quite unexpected. At least you have a workaround now (/usr/share/doc/ethtool/README.Debian.gz explains how to make this setting persistent). That's with eth0 removed from bonding, and eth2 inside. So the bonding device has only one slave now? Yes, it was like that. What if you take the bonding device out completely and add eth2 directly to the bridge? I think I had already tested that and everything was fine, too. Do you want me to test that or is the GRO removal conclusive? No need to test that. I would like to take this upstream now, but first I need to check whether it has already been fixed after 2.6.32. Please can you test the current kernel package from testing, unstable or squeeze-backports (linux-image-3.2.0-2-amd64 or linux-image-3.2.0-0.bpo.2-amd64)? Ben. -- Ben Hutchings Larkinson's Law: All laws are basically false. signature.asc Description: This is a digitally signed message part
Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken
On Mon, Apr 02, 2012 at 05:22:37AM +0100, Ben Hutchings wrote: On Sun, 2012-04-01 at 12:40 +0200, Josip Rodin wrote: On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote: I bet this is due to the combination of LRO plus bridging. We try to turn off LRO in devices under a bridge, but that won't work if there's an intermediate bonding device. If you run: # ethtool -K eth0 lro off # ethtool -K eth2 lro off does the bridge start working? Err... % sudo ethtool -K eth0 lro off Cannot set large receive offload settings: Operation not supported % sudo ethtool -K eth2 lro off Cannot set large receive offload settings: Operation not supported Hmm. Well it shouldn't be a problem but you could try also turning off GRO (similar commands). Ah, there we go. Once I ran sudo ethtool -K eth0 gro off, sudo ifenslave bond54 eth0 produced a still-working bond54. That's with eth0 removed from bonding, and eth2 inside. So the bonding device has only one slave now? Yes, it was like that. What if you take the bonding device out completely and add eth2 directly to the bridge? I think I had already tested that and everything was fine, too. Do you want me to test that or is the GRO removal conclusive? -- 2. That which causes joy or happiness. -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120404075557.ga3...@entuzijast.net
Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken
On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote: I bet this is due to the combination of LRO plus bridging. We try to turn off LRO in devices under a bridge, but that won't work if there's an intermediate bonding device. If you run: # ethtool -K eth0 lro off # ethtool -K eth2 lro off does the bridge start working? Err... % sudo ethtool -K eth0 lro off Cannot set large receive offload settings: Operation not supported % sudo ethtool -K eth2 lro off Cannot set large receive offload settings: Operation not supported That's with eth0 removed from bonding, and eth2 inside. -- 2. That which causes joy or happiness. -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20120401104044.ga28...@entuzijast.net
Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken
On Sun, 2012-04-01 at 12:40 +0200, Josip Rodin wrote: On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote: I bet this is due to the combination of LRO plus bridging. We try to turn off LRO in devices under a bridge, but that won't work if there's an intermediate bonding device. If you run: # ethtool -K eth0 lro off # ethtool -K eth2 lro off does the bridge start working? Err... % sudo ethtool -K eth0 lro off Cannot set large receive offload settings: Operation not supported % sudo ethtool -K eth2 lro off Cannot set large receive offload settings: Operation not supported Hmm. Well it shouldn't be a problem but you could try also turning off GRO (similar commands). That's with eth0 removed from bonding, and eth2 inside. So the bonding device has only one slave now? What if you take the bonding device out completely and add eth2 directly to the bridge? Ben. -- Ben Hutchings Reality is just a crutch for people who can't handle science fiction. signature.asc Description: This is a digitally signed message part
Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken
I bet this is due to the combination of LRO plus bridging. We try to turn off LRO in devices under a bridge, but that won't work if there's an intermediate bonding device. If you run: # ethtool -K eth0 lro off # ethtool -K eth2 lro off does the bridge start working? Ben. -- Ben Hutchings I'm always amazed by the number of people who take up solipsism because they heard someone else explain it. - E*Borg on alt.fan.pratchett signature.asc Description: This is a digitally signed message part
Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken
Package: linux-image-2.6.32-5-xen-amd64 Version: 2.6.32-41 Hi, The machine is a new IBM x3550 M3, with this network hardware: % lspci | grep Ethernet 0b:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 0b:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) 1a:00.0 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01) 1a:00.1 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01) One of each brands (eth0 and eth2) has a working cable plugged into a working Ethernet switch that's set up so that it serves a native VLAN (otherwise known as ID 54) and VLAN ID 2 trunked (tagged), among others. The devices are: lrwxrwxrwx 1 root root 0 Mar 19 15:42 /sys/class/net/eth0 - ../../devices/pci:00/:00:07.0/:1a:00.0/net/eth0/ lrwxrwxrwx 1 root root 0 Mar 19 15:42 /sys/class/net/eth2 - ../../devices/pci:00/:00:01.0/:0b:00.0/net/eth2/ So, if I read that right, eth0 is Intel, and eth2 is Broadcom. The desired network setup is, in interfaces(5) format: iface bond54 inet manual slaves eth0 eth2 bond_mode active-backup bond_miimon 100 iface xenbr54 inet static bridge-ports bond54 bridge-fd 0 address 192.168.54.2 netmask 255.255.255.0 iface vlan2 inet manual vlan-raw-device xenbr54 iface xenbr2 inet static bridge-ports vlan2 bridge-fd 0 address 213.202.97.156 netmask 255.255.255.240 gateway 213.202.97.145 This used to work for me elsewhere, however, on this machine it's broken as follows: Everything starts up fine, and the machine is perfectly usable (albeit I only used SSH) over the xenbr54 interface. However, over the xenbr2 interface, all the small network packets pass, such as ICMP, or the bringup and teardown of HTTP connections, but as soon as I try to actually GET something non-trivial over a seemingly established HTTP connection, the machine pretends it doesn't see that incoming traffic. Like this: % wget -O /dev/null http://ftp.hr.debian.org/debian/ls-lR.gz --2012-03-30 11:15:23-- http://ftp.hr.debian.org/debian/ls-lR.gz Resolving ftp.hr.debian.org... 161.53.160.11, 2001:b68:ff:1::11 Connecting to ftp.hr.debian.org|161.53.160.11|:80... connected. HTTP request sent, awaiting response... In parallel, the trace shows: % sudo tshark -n -i xenbr2 0.00 213.202.97.156 - 161.53.160.11 TCP 51657 80 [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=232632046 TSER=0 WS=1 0.001797 161.53.160.11 - 213.202.97.156 TCP 80 51657 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=643552423 TSER=232632046 WS=8 0.001816 213.202.97.156 - 161.53.160.11 TCP 51657 80 [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=232632046 TSER=643552423 0.001906 213.202.97.156 - 161.53.160.11 HTTP GET /debian/ls-lR.gz HTTP/1.0 0.003625 161.53.160.11 - 213.202.97.156 TCP 80 51657 [ACK] Seq=1 Ack=131 Win=6912 Len=0 TSV=643552423 TSER=232632046 And then it sits there. The server machine (which I happen to have control over) says: 0.00 213.202.97.156 - 161.53.160.11 TCP 51660 80 [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSV=232668023 TSER=0 WS=1 0.23 161.53.160.11 - 213.202.97.156 TCP 80 51660 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=643588400 TSER=232668023 WS=8 0.003117 213.202.97.156 - 161.53.160.11 TCP 51660 80 [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=232668024 TSER=643588400 0.003125 213.202.97.156 - 161.53.160.11 HTTP GET /debian/ls-lR.gz HTTP/1.0 0.003145 161.53.160.11 - 213.202.97.156 TCP 80 51660 [ACK] Seq=1 Ack=131 Win=6912 Len=0 TSV=643588401 TSER=232668024 0.003480 161.53.160.11 - 213.202.97.156 TCP [TCP segment of a reassembled PDU] 0.003500 161.53.160.11 - 213.202.97.156 TCP [TCP segment of a reassembled PDU] 0.204965 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP segment of a reassembled PDU] 0.613959 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP segment of a reassembled PDU] 1.428964 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP segment of a reassembled PDU] 3.061959 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP segment of a reassembled PDU] 6.329958 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP segment of a reassembled PDU] 12.853960 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP segment of a reassembled PDU] And then I Ctrl+C that wget, and the traces show: (on the client) 8.017451 213.202.97.156 - 161.53.160.11 TCP 51664 80 [FIN, ACK] Seq=131 Ack=1 Win=5840 Len=0 TSV=232696067 TSER=643614440 8.057740 161.53.160.11 - 213.202.97.156 TCP [TCP Previous segment lost] 80 51664 [ACK] Seq=4345 Ack=132 Win=6912 Len=0 TSV=643616454 TSER=232696067 (on the server) 8.017218 213.202.97.156 - 161.53.160.11 TCP 51664 80 [FIN, ACK] Seq=131 Ack=1 Win=5840 Len=0 TSV=232696067 TSER=643614440 8.055647 161.53.160.11 - 213.202.97.156 TCP 80 51664 [ACK] Seq=4345 Ack=132 Win=6912 Len=0