Hi Iljitsch, In a previous message, regarding my server apparently sending out TCP packets which were longer than the MTU, you wrote:
> Maybe some work is being offloaded to the NIC? I should have checked this more carefully - you are right. The Ethernet chip is a Broadcom BCM5751, which does "Large Send Offload" AKA "task offload", "segmentation offload" or "stack offload": http://www.broadcom.com/collateral/pb/5751-PB03-R.pdf http://www.microsoft.com/whdc/device/network/taskoffload.mspx tcpdump sees the output of the Ethernet driver and the chip breaks it up the long packets into ordinary length TCP packets. Sorry about the false alarm! This page discusses these hardware based techniques: http://kb.pert.geant2.net/PERTKB/LargeSendOffloadLSO (Transport) Protocol Fossilization The way it is defined by most of the industry, LSO needs to be aware of the transport protocols. In particular, it must be able to split over-large transport segments into suitable sub-segments, and generate transport (e.g. TCP) headers for these sub-segments. This function is typically implemented in the adapter's firmware, for some popular transport protocol such as TCP. This makes it hard to implement additional functions such as IPSec, or the TCP MD5 Authentication option, or even other transport protocols such as SCTP. There is a weakened form of LSO that requires the host operating system to prepare the segmentation and construct headers. This allows for "dumber" network adapters, and in particular it doesn't require them to be transport protocol-aware. It still provides significant performance improvement because multiple segments can be transferred between host and adapter in a single transaction, which reduces bus occupation and other overhead. Sun's Solaris operating system supports this variant of LSO under the name of "MDT" (Multidata Transmit), and the Linux kernel added something similar as part of "GSO" in 2.6.18. This has turned up something potentially relevant to scalable routing - the use of hardware to generate the final packets sent to destination hosts. This is done today because it is marginally more efficient in terms of CPU load when the MTU is around 1500 bytes. I am not sure how much difference it would make if the artificially large TCP "super packet" was up to 64k long, and the NIC split it into 9k packets. GSO (Generic Segmentation Offload) does the splitting into smaller packets in software. http://www.linuxfoundation.org/en/Net:GSO = http://lwn.net/Articles/189970/ Herbert Xu (2006-06-20) "ethtool eth0" shows the NIC is running at 100Mbps, although it is a 1Gbps device. A friend told me Internet servers are usually connected at 100Mbps. ethtool can be used to turn on and off TCP Segmentation Offloading (TSO ~ Large Send Offload) UFO (UDP Fragmentation Offload) and GSO. The settings were : rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on <<< TSO ~= LSO udp fragmentation offload: off generic segmentation offload: off I gave the command: "ethtool -K eth0 tso off" and there were no more of these long packets reported by tcpdump. Thanks for the link to the long discussion about Microsoft's network ignoring PTB packets in May this year: > Hm, it wasn't so recent apparently, and ack on the quality of the archive: > > http://readlist.com/lists/trapdoor.merit.edu/nanog/7/35484.html >>> MSS is end-to-end, you still need PMTUD or fragmentation. > >> Yes - and with Google sending out large packets with DF=0, it is >> expecting any hapless router in the middle, with a lower next hop >> MTU than this length, to do a lot of work without complaint. > > Such are the perils of implementing RFC 791. Yes, but fragmentation in the network was not included in IPv6 and I am keen not to have it in ITRs, ETRs or in the path between them. . . . > Apparently there's a disconnect between spec writers and implementers on > the one hand and people who have to debug connectivity problems on the > other hand, with the clueless firewall admins living in a bubble > disconnected from everything. > >> I still like RFC 1191 better. There's no fragmentation and the >> sending host gets the fastest possible feedback that it needs to >> send smaller packets. > > I'll take reliable over fast. I think RFC 1191 is reliable if PTBs are not filtered. It is not a serious problem if the odd PTB is lost due to congestion. . . . >> Unfortunately, the surviving packet fragment isn't much use >> to the destination host, so it still takes 1.5 RTTs to get the data >> there. Still, that is better than 3.5 RTTs with RFC 1191. > > You can still use the data, except that you can't check its integrity > because the checksum is now incorrect. But you just wrote: > I'll take reliable over fast. ! > So the semi-ACK asks the other side to send just the checksum over > the data that was correctly received. This sounds unreliable and slow. >> My understanding of this is that if all hosts have a next hop MTU of >> 1500, and the core has an MTU of 9000, then it is no problem if the >> destination network blocks PTBs from leaving that network since no >> host would be sending packets bigger than 1500 anyway. > > Your premise is invalid so the conclusion is meaningless. I was describing an artificial situation for the sake of discussion. > Actually, it would be interesting to do some research into the MTU > distribution across the internet. Indeed. >> But plenty of servers - probably most by now - have gigabit ethernet >> and so have a real PMTU for most of the core, and into quite a few >> edge networks, of 9k or so. A friend told me today that most servers are connected with 100Mbps links, even if it has a gigabit NIC. Part of the reason is to reduce burstiness of each server. The trick would be to find a host with 1G links all the way to several border routers, which themselves have 1G links to . . . > Note that although the 9000-byte jumboframe capability is common, there > are also very many implementations that use different sizes so it's > impossible to standardize on anything, even if you could ignore the fact > that the current internet expects 1500. > > Also, because of the 802.3 spec, the jumboframe capability must be > enabled administratively, and because of the IP-over-ethernet specs, all > hosts on a subnet must use the same MTU, so basically deployment is > impossible. Yes - I understand that a single 1500 byte MTU device on an Ethernet switch (such as a 100Mbps NIC, or perhaps a 1Gbps NIC running with a 1500 byte MTU) forces all other devices to use 1500. > (This is what my draft addresses.) http://tools.ietf.org/html/draft-van-beijnum-multi-mtu-02 OK - for IPv6 and necessarily with host and router changes. >> When they send a packet to some edge network with 1500 MTU links, >> which blocks the PTBs which should go back to the sending host, then >> there is a black hole. > > You mean: a network that doesn't generate them in the outgoing direction? > > In practice this won't be a problem because few people will connect to > the internet with a 1500+ MTU and then not generate too bigs. Since > routers generate them out of the box and ISPs usually don't have > firewalls in the middle of their networks and don't like support calls, > ISPs tend to generate them. > > If you use an MTU bigger than the standard 1500, then you shoot yourself > in the foot with ICMP filtering so you're not likely to do both. The > trouble is mainly with using a smaller MTU: then the problem is caused > by _other_ people not listening to _your_ too bigs and there is little > that you can do. OK - I read the first message in the NANOG "Microsoft.com PMTUD black hole?" thread: http://readlist.com/lists/trapdoor.merit.edu/nanog/7/35484.html which involves Microsoft servers in a whole /16 ignoring PTB messages sent by routers in other networks. >> I guess the majority of websites now can send jumboframes, like my >> server can. Apparently not if most are connected to 100Mbs Ethernet switches, though I guess it would be possible to use a 1Gbps NIC and switch but somehow throttle the speed to something lower, but keep the usual ~9k MTU of 1Gbps Ethernet. > That doesn't mean that all the stuff in the middle can handle > jumboframes. The core of the network generally can, but the stuff around > the edges, like the cheap switches that connect dozens of servers like > yours, are likely to only support small packet sizes, either 1500 or > "mini jumbos" of 1500 - 2000 bytes. OK - just as my friend told me. Thanks for pursuing this discussion. Sorry about the false alarm with these too-long packets. - Robin -- to unsubscribe send a message to [EMAIL PROTECTED] with the word 'unsubscribe' in a single line as the message text body. archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg
