Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken

2012-04-11 Thread Josip Rodin
On Sat, Apr 07, 2012 at 04:29:38AM +0100, Ben Hutchings wrote:
 I would like to take this upstream now, but first I need to check
 whether it has already been fixed after 2.6.32.  Please can you test the
 current kernel package from testing, unstable or squeeze-backports
 (linux-image-3.2.0-2-amd64 or linux-image-3.2.0-0.bpo.2-amd64)?

I installed linux-image-3.2.0-0.bpo.2-amd64, plus the upgraded linux-base
and initramfs-tools, plus the indicated firmware-bnx2 upgrade -- and then
rebooted into that kernel, but the machine wouldn't respond to ping over
the xenbr2 interface (the one with the default gateway).

I logged into it fine through the xenbr54 interface, and tried to ping the
default gateway, and it didn't work. This was with the workaround - only
bnx2/eth2 in the bonding interface. Then I removed the default gateway
and added it back just to see if it'll work, and then it started pinging.
Weird.

After that, I tried to reproduce this bug, but failed, it looks like the bug
is fixed there. I noticed a significant lag with some of those bonding
--detach/--change-active actions, but after a few sections everything
continued to work fine.

-- 
 2. That which causes joy or happiness.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120411122453.ga29...@entuzijast.net



Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken

2012-04-06 Thread Ben Hutchings
On Wed, 2012-04-04 at 09:55 +0200, Josip Rodin wrote:
 On Mon, Apr 02, 2012 at 05:22:37AM +0100, Ben Hutchings wrote:
  On Sun, 2012-04-01 at 12:40 +0200, Josip Rodin wrote:
   On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote:
I bet this is due to the combination of LRO plus bridging.  We try to
turn off LRO in devices under a bridge, but that won't work if there's
an intermediate bonding device.

If you run:

# ethtool -K eth0 lro off
# ethtool -K eth2 lro off

does the bridge start working?
   
   Err...
   
   % sudo ethtool -K eth0 lro off
   Cannot set large receive offload settings: Operation not supported
   % sudo ethtool -K eth2 lro off
   Cannot set large receive offload settings: Operation not supported
  
  Hmm.  Well it shouldn't be a problem but you could try also turning off
  GRO (similar commands).
 
 Ah, there we go. Once I ran sudo ethtool -K eth0 gro off,
 sudo ifenslave bond54 eth0 produced a still-working bond54.

OK, this is quite unexpected.  At least you have a workaround now
(/usr/share/doc/ethtool/README.Debian.gz explains how to make this
setting persistent).

   That's with eth0 removed from bonding, and eth2 inside.
  
  So the bonding device has only one slave now?
 
 Yes, it was like that.
 
  What if you take the bonding device out completely and add eth2 directly
  to the bridge?
 
 I think I had already tested that and everything was fine, too.
 Do you want me to test that or is the GRO removal conclusive?

No need to test that.

I would like to take this upstream now, but first I need to check
whether it has already been fixed after 2.6.32.  Please can you test the
current kernel package from testing, unstable or squeeze-backports
(linux-image-3.2.0-2-amd64 or linux-image-3.2.0-0.bpo.2-amd64)?

Ben.

-- 
Ben Hutchings
Larkinson's Law: All laws are basically false.


signature.asc
Description: This is a digitally signed message part


Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken

2012-04-04 Thread Josip Rodin
On Mon, Apr 02, 2012 at 05:22:37AM +0100, Ben Hutchings wrote:
 On Sun, 2012-04-01 at 12:40 +0200, Josip Rodin wrote:
  On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote:
   I bet this is due to the combination of LRO plus bridging.  We try to
   turn off LRO in devices under a bridge, but that won't work if there's
   an intermediate bonding device.
   
   If you run:
   
   # ethtool -K eth0 lro off
   # ethtool -K eth2 lro off
   
   does the bridge start working?
  
  Err...
  
  % sudo ethtool -K eth0 lro off
  Cannot set large receive offload settings: Operation not supported
  % sudo ethtool -K eth2 lro off
  Cannot set large receive offload settings: Operation not supported
 
 Hmm.  Well it shouldn't be a problem but you could try also turning off
 GRO (similar commands).

Ah, there we go. Once I ran sudo ethtool -K eth0 gro off,
sudo ifenslave bond54 eth0 produced a still-working bond54.

  That's with eth0 removed from bonding, and eth2 inside.
 
 So the bonding device has only one slave now?

Yes, it was like that.

 What if you take the bonding device out completely and add eth2 directly
 to the bridge?

I think I had already tested that and everything was fine, too.
Do you want me to test that or is the GRO removal conclusive?

-- 
 2. That which causes joy or happiness.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120404075557.ga3...@entuzijast.net



Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken

2012-04-01 Thread Josip Rodin
On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote:
 I bet this is due to the combination of LRO plus bridging.  We try to
 turn off LRO in devices under a bridge, but that won't work if there's
 an intermediate bonding device.
 
 If you run:
 
 # ethtool -K eth0 lro off
 # ethtool -K eth2 lro off
 
 does the bridge start working?

Err...

% sudo ethtool -K eth0 lro off
Cannot set large receive offload settings: Operation not supported
% sudo ethtool -K eth2 lro off
Cannot set large receive offload settings: Operation not supported

That's with eth0 removed from bonding, and eth2 inside.

-- 
 2. That which causes joy or happiness.



-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20120401104044.ga28...@entuzijast.net



Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken

2012-04-01 Thread Ben Hutchings
On Sun, 2012-04-01 at 12:40 +0200, Josip Rodin wrote:
 On Sun, Apr 01, 2012 at 03:09:56AM +0100, Ben Hutchings wrote:
  I bet this is due to the combination of LRO plus bridging.  We try to
  turn off LRO in devices under a bridge, but that won't work if there's
  an intermediate bonding device.
  
  If you run:
  
  # ethtool -K eth0 lro off
  # ethtool -K eth2 lro off
  
  does the bridge start working?
 
 Err...
 
 % sudo ethtool -K eth0 lro off
 Cannot set large receive offload settings: Operation not supported
 % sudo ethtool -K eth2 lro off
 Cannot set large receive offload settings: Operation not supported

Hmm.  Well it shouldn't be a problem but you could try also turning off
GRO (similar commands).

 That's with eth0 removed from bonding, and eth2 inside.

So the bonding device has only one slave now?

What if you take the bonding device out completely and add eth2 directly
to the bridge?

Ben.

-- 
Ben Hutchings
Reality is just a crutch for people who can't handle science fiction.


signature.asc
Description: This is a digitally signed message part


Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken

2012-03-31 Thread Ben Hutchings
I bet this is due to the combination of LRO plus bridging.  We try to
turn off LRO in devices under a bridge, but that won't work if there's
an intermediate bonding device.

If you run:

# ethtool -K eth0 lro off
# ethtool -K eth2 lro off

does the bridge start working?

Ben.

-- 
Ben Hutchings
I'm always amazed by the number of people who take up solipsism because
they heard someone else explain it. - E*Borg on alt.fan.pratchett


signature.asc
Description: This is a digitally signed message part


Bug#666386: igb + bnx2 + ifenslave + brctl + vconfig = largely broken

2012-03-30 Thread Josip Rodin
Package: linux-image-2.6.32-5-xen-amd64
Version: 2.6.32-41

Hi,

The machine is a new IBM x3550 M3, with this network hardware:

% lspci | grep Ethernet
0b:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit 
Ethernet (rev 20)
0b:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit 
Ethernet (rev 20)
1a:00.0 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection 
(rev 01)
1a:00.1 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection 
(rev 01)

One of each brands (eth0 and eth2) has a working cable plugged into a
working Ethernet switch that's set up so that it serves a native VLAN
(otherwise known as ID 54) and VLAN ID 2 trunked (tagged), among others.

The devices are:

lrwxrwxrwx 1 root root 0 Mar 19 15:42 /sys/class/net/eth0 - 
../../devices/pci:00/:00:07.0/:1a:00.0/net/eth0/
lrwxrwxrwx 1 root root 0 Mar 19 15:42 /sys/class/net/eth2 - 
../../devices/pci:00/:00:01.0/:0b:00.0/net/eth2/

So, if I read that right, eth0 is Intel, and eth2 is Broadcom.

The desired network setup is, in interfaces(5) format:

iface bond54 inet manual
  slaves eth0 eth2
  bond_mode active-backup
  bond_miimon 100

iface xenbr54 inet static
  bridge-ports bond54
  bridge-fd 0
  address 192.168.54.2
  netmask 255.255.255.0

iface vlan2 inet manual
  vlan-raw-device xenbr54

iface xenbr2 inet static
  bridge-ports vlan2
  bridge-fd 0
  address 213.202.97.156
  netmask 255.255.255.240
  gateway 213.202.97.145

This used to work for me elsewhere, however, on this machine it's broken as
follows:

Everything starts up fine, and the machine is perfectly usable (albeit I
only used SSH) over the xenbr54 interface.

However, over the xenbr2 interface, all the small network packets pass, such
as ICMP, or the bringup and teardown of HTTP connections, but as soon as I
try to actually GET something non-trivial over a seemingly established HTTP
connection, the machine pretends it doesn't see that incoming traffic.

Like this:

% wget -O /dev/null http://ftp.hr.debian.org/debian/ls-lR.gz
--2012-03-30 11:15:23--  http://ftp.hr.debian.org/debian/ls-lR.gz
Resolving ftp.hr.debian.org... 161.53.160.11, 2001:b68:ff:1::11
Connecting to ftp.hr.debian.org|161.53.160.11|:80... connected.
HTTP request sent, awaiting response...

In parallel, the trace shows:

% sudo tshark -n -i xenbr2
  0.00 213.202.97.156 - 161.53.160.11 TCP 51657  80 [SYN] Seq=0 Win=5840 
Len=0 MSS=1460 TSV=232632046 TSER=0 WS=1
  0.001797 161.53.160.11 - 213.202.97.156 TCP 80  51657 [SYN, ACK] Seq=0 
Ack=1 Win=5792 Len=0 MSS=1460 TSV=643552423 TSER=232632046 WS=8
  0.001816 213.202.97.156 - 161.53.160.11 TCP 51657  80 [ACK] Seq=1 Ack=1 
Win=5840 Len=0 TSV=232632046 TSER=643552423
  0.001906 213.202.97.156 - 161.53.160.11 HTTP GET /debian/ls-lR.gz HTTP/1.0
  0.003625 161.53.160.11 - 213.202.97.156 TCP 80  51657 [ACK] Seq=1 Ack=131 
Win=6912 Len=0 TSV=643552423 TSER=232632046

And then it sits there. The server machine (which I happen to have control
over) says:

  0.00 213.202.97.156 - 161.53.160.11 TCP 51660  80 [SYN] Seq=0 Win=5840 
Len=0 MSS=1460 TSV=232668023 TSER=0 WS=1
  0.23 161.53.160.11 - 213.202.97.156 TCP 80  51660 [SYN, ACK] Seq=0 
Ack=1 Win=5792 Len=0 MSS=1460 TSV=643588400 TSER=232668023 WS=8
  0.003117 213.202.97.156 - 161.53.160.11 TCP 51660  80 [ACK] Seq=1 Ack=1 
Win=5840 Len=0 TSV=232668024 TSER=643588400
  0.003125 213.202.97.156 - 161.53.160.11 HTTP GET /debian/ls-lR.gz HTTP/1.0
  0.003145 161.53.160.11 - 213.202.97.156 TCP 80  51660 [ACK] Seq=1 Ack=131 
Win=6912 Len=0 TSV=643588401 TSER=232668024
  0.003480 161.53.160.11 - 213.202.97.156 TCP [TCP segment of a reassembled 
PDU]
  0.003500 161.53.160.11 - 213.202.97.156 TCP [TCP segment of a reassembled 
PDU]
  0.204965 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP 
segment of a reassembled PDU]
  0.613959 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP 
segment of a reassembled PDU]
  1.428964 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP 
segment of a reassembled PDU]
  3.061959 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP 
segment of a reassembled PDU]
  6.329958 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP 
segment of a reassembled PDU]
 12.853960 161.53.160.11 - 213.202.97.156 TCP [TCP Retransmission] [TCP 
segment of a reassembled PDU]

And then I Ctrl+C that wget, and the traces show:

(on the client)
  8.017451 213.202.97.156 - 161.53.160.11 TCP 51664  80 [FIN, ACK] Seq=131 
Ack=1 Win=5840 Len=0 TSV=232696067 TSER=643614440
  8.057740 161.53.160.11 - 213.202.97.156 TCP [TCP Previous segment lost] 80  
51664 [ACK] Seq=4345 Ack=132 Win=6912 Len=0 TSV=643616454 TSER=232696067

(on the server)
  8.017218 213.202.97.156 - 161.53.160.11 TCP 51664  80 [FIN, ACK] Seq=131 
Ack=1 Win=5840 Len=0 TSV=232696067 TSER=643614440
  8.055647 161.53.160.11 - 213.202.97.156 TCP 80  51664 [ACK] Seq=4345 
Ack=132 Win=6912 Len=0