Bug#754294: Regression: While routing Kernel chokes on spurious too big IP packets

2014-07-10 Thread Teodor Milkov
I wonder if this is the same bug I've been experiencing? See it reported 
and discussed at the following places:


  https://bugzilla.kernel.org/show_bug.cgi?id=79891
  http://www.spinics.net/lists/netdev/msg288798.html


Best regards,
Teodor


--
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/53bf795a.20...@del.bg



Bug#754294: Regression: While routing Kernel chokes on spurious too big IP packets

2014-07-09 Thread Marc A. Donges
Package: linux
Version: 3.2.60-1+deb7u1
Severity: important

Dear Maintainer,

the Kernel upgrade via Debian Security on Friday 2014-07-04 made routing 
service (in this case with NAT) somewhat broken.

*High level description*
This is experienced by users as very slow network access to some servers, and 
only by some client computers.

*Setup*
A gateway using Linux for routing and NAT running Debian stable (amd64) was 
updated from linux-image-3.2.0-4-amd64:amd64 version 3.2.57-3+deb7u2 to version 
3.2.60-1+deb7u1

The gateway is supposed to route and NAT traffic from a private network to the 
public internet, translating RFC1918 client source addresses to public 
addresses.

In the following tcpdumps I have replaced the RFC1918 address of a client with 
CLIENT, the public IP address of the relevant gateway with NAT and the 
public IP address of a server in the Internet with SERVER for reasons of 
privacy, as they were gathered in a LIVE environment.

*Problem description*
After the update the Kernel chokes on apparently too big IP packets that 
don't fit the MTU:

17:11:00.917355 IP SERVER.80  NAT.44991: Flags [.], seq 1:2921, ack 98, win 
5840, length 2920
17:11:00.917384 IP NAT  SERVER: ICMP NAT unreachable - need to frag (mtu 
1500), length 556

Note the large IP packet (2960  MTU of 1500). It has the DF bit set. The 
packet cannot have arrived via the network, though, as it is an Ethernet with 
an MTU of 1500, so this is odd. *1

*Workaround*
The problem disappears when GRO is deactivated:

ethtool -K eth0 gro off

The kernel then receives only valid packets of up to MTU in size:

17:14:53.288712 IP SERVER.80  NAT.44996: Flags [.], seq 1:1461, ack 98, win 
5840, length 1460
17:14:53.288730 IP SERVER.80  CLIENT.44996: Flags [.], seq 1:1461, ack 98, win 
5840, length 1460
17:14:53.288735 IP SERVER.80  NAT.44996: Flags [.], seq 1461:2921, ack 98, win 
5840, length 1460
17:14:53.27 IP SERVER.80  CLIENT.44996: Flags [.], seq 1461:2921, ack 98, 
win 5840, length 1460

GRO is a performance optimization where the NIC assembles packets into larger 
packets for smaller processing/interrupt overhead. GRO defaults to on (on this 
hardware).

*Regression*
The problem did not exist in 3.2.57-3+deb7u2. In that version the Kernel 
forwards those big packets as many smaller packets of up to MTU size:

16:23:01.394351 IP SERVER.80  NAT.44943: Flags [.], seq 1:2921, ack 98, win 
5840, length 2920
16:23:01.394375 IP SERVER.80  CLIENT.44943: Flags [.], seq 1:1461, ack 98, win 
5840, length 1460
16:23:01.394525 IP SERVER.80  CLIENT.44943: Flags [.], seq 1461:2921, ack 98, 
win 5840, length 1460

Note this is not IP fragmentation, as the smaller packets contain one TCP 
segment each.

*Possible causes*

I suspect the reason for how the error manifests to end users (very slow 
network access to some servers, and only by some client computers) is that the 
actual operation of GRO is influenced by the NIC/driver, timing of packet flow, 
and IP/TCP options used (which depend on client OS and configuration and server 
OS and configuration). Then, the server's retransmit behaviour may cause single 
packets to be transmitted, which are then not mangled by GRO and can be 
successfully forwarded to clients, although that is very slow.

There are two changes between 3.2.57-3+deb7u2 and 3.2.60-1+deb7u1 that look 
related, because they were supposed to fix a similar issue with IP packets that 
arrive fragmented but have the DF bit set:

In the Debian specific patch set 
patches/bugfix/all/netfilter-ipv4-defrag-set-local_df-flag-on-defragmen.patch:
[quote]
From: Florian Westphal f...@strlen.de
Date: Fri, 2 May 2014 15:32:16 +0200
Subject: netfilter: ipv4: defrag: set local_df flag on defragmented skb
Origin: https://git.kernel.org/linus/895162b1101b3ea5db08ca6822ae9672717efec0

else we may fail to forward skb even if original fragments do fit
outgoing link mtu:

1. remote sends 2k packets in two 1000 byte frags, DF set
2. we want to forward but only see '2k  mtu and DF set'
3. we then send icmp error saying that outgoing link is 1500

But original sender never sent a packet that would not fit
the outgoing link.

Setting local_df makes outgoing path test size vs.
IPCB(skb)-frag_max_size, so we will still send the correct
error in case the largest original size did not fit
outgoing link mtu.

Reported-by: Maxime Bizon mbi...@freebox.fr
Suggested-by: Maxime Bizon mbi...@freebox.fr
Fixes: 5f2d04f1f9 (ipv4: fix path MTU discovery with connection tracking)
Signed-off-by: Florian Westphal f...@strlen.de
Signed-off-by: Pablo Neira Ayuso pa...@netfilter.org
---
 net/ipv4/netfilter/nf_defrag_ipv4.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c 
b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 12e13bd..f40f321 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -22,7 +22,6 @@
 #endif
 #include net/netfilter/nf_conntrack_zones.h
 
-/* Returns 

Bug#754294: Regression: While routing Kernel chokes on spurious too big IP packets

2014-07-09 Thread Marc A. Donges
The problem has been reported on hardware with two similar Broadcom Ethernet 
chipsets using the same bnx2 kernel driver:

NIC 1:
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit 
Ethernet (rev 20)
Subsystem: Hewlett-Packard Company NC382i Integrated Multi-port PCI 
Express Gigabit Server Adapter
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at f400 (64-bit, non-prefetchable) [size=32M]
[virtual] Expansion ROM at e710 [disabled] [size=64K]
Capabilities: [48] Power Management version 3
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable- Count=1/16 Maskable- 64bit+
Capabilities: [a0] MSI-X: Enable+ Count=9 Masked-
Capabilities: [ac] Express Endpoint, MSI 00
Capabilities: [100] Device Serial Number [removed]
Capabilities: [110] Advanced Error Reporting
Capabilities: [150] Power Budgeting ?
Capabilities: [160] Virtual Channel
Kernel driver in use: bnx2

With the following default features:
Features for eth0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-unneeded: off [fixed]
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: on
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]

03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit 
Ethernet (rev 12)
Subsystem: Dell Device 01b3
Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 47
Memory at f800 (64-bit, non-prefetchable) [size=32M]
Capabilities: [40] PCI-X non-bridge device
Capabilities: [48] Power Management version 2
Capabilities: [50] Vital Product Data
Capabilities: [58] MSI: Enable+ Count=1/1 Maskable- 64bit+
Kernel driver in use: bnx2

With the following default features:
Features for eth0:
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-unneeded: off [fixed]
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off [fixed]
scatter-gather: on
tx-scatter-gather: on
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp6-segmentation: off [fixed]
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off [fixed]
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: on
loopback: off [fixed]


-- 
To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: https://lists.debian.org/20140709164237.ga4...@cindy.net.united.domain