<I agree with pretty much everything said here, so I'll just snip it>
> I think our requirements are: > > 1. do not impede 1500-byte operation > 2. discover and utilize jumboframe capability where possible > 3. discover and utilize (close to) the maximum MTU > 4. recover from sudden MTU reductions fast enough for TCP and similar > to survive I'd add to this list: 5. Must be fully automatic and not require any admin intervention to do the "right" thing. 6. Minimise the resources used. > So it makes sense to come up with a new protocol for this. An > interesting notion here is that this protocol doesn't have to be IPv6- > specific. However, in IPv6 we have neighbor unreachability detection > which we can use to find MTU reductions fast enough to fall back to > 1500 bytes before bad things happen. In IPv4 or pure ethernet, we don't > have that, and we also don't have neighbor discovery to exchange > per-host MRU/MTU information. > > If we use IPv6 for this, I think a new ICMP type makes sense. Yep, makes sense to me. :) > Whenever two systems (hosts or routers) on a link perform neighbor > discovery, they can trigger the MTU verficiation immediately > afterward, and if jumboframe support is confirmed by receiving the > larger packets, the MTU for the the neighbor can be updated. If the > larger packets don't make it to the neighbor there is no complexity > and no delay: communication was already underway at 1500 bytes and > continues without the need for further action. Yep! I'd considered this, but at 3am didn't want to confuse the issue by introducing too many differing ideas at the same time. (Also, I wanted to go to sleep :) You've seemed to have thought through some of the issues a bit better than I had too :) > However, this doesn't accommodate finding out jumboframe support at > reduced sizes very well. For this, I think we should use an additional > exchange, but this one should probably happen over multicast. I disagree. There is no need to for every host to have a full understanding of the layer 2 topology of the network it is on. We're starting to see some very large L2 networks as MAN's (eg NLR[1]) and IPv6's /64 per subnet puts no real practical limit on how large a single L2 segment can be. The storage requirements for all of this information alone is either going to be very large (remembering MTU's for every neighbour on the link?!) and/or it's going to have be repeated almost constantly as hosts want to discover MTU's to specific neighbours. Nodes that are reasonably "idle" on the link are going to have to participate in this and therefore avoid entering low power modes. Multicasting out "jumbo" packets is going to tax switch queue sizes to the extreme. A 48 port switch, receiving a single multicast 9k packet (that it can forward) may end up with 48*9k=430,000 bytes (or nearly 1/2 a megabyte) of buffer space used with a single packet arriving. What happens on l2's where not every node can see every other node? Some L2's only allow end hosts to talk to a master host. What about a network with vlans where some hosts are on differing sets of vlans? > Hosts/routers could take turns in a distributed search for the largest > supported framesize. I think it's important that all jumbo-capable > systems take part in this in order to deal with unusual topologies. For > instance, consider a network with three switches: one support 9000, > another 8000 and the two are connected through a third switch that only > supports 3000 bytes: > > > A C > | | > +-+--+ +----+ +--+-+ > |9000+--+3000+--+8000| > +-+--+ +----+ +--+-+ > | | > B D > > Suppose all hosts support 9216 byte jumboframes. If A and B are talking to each other and C and D are talking to each other, why do (A and B) need to talk to C and D? > I think the most efficient way to handle this is to do two concurrent > searches: one for the maximum packet size that can be used to at least > one correspondent, and one for the minimum jumboframe size that is > supported by all jumboframe supporting systems. Why not a simpler protocol? Host A sends a ND to Host B with Host A's MRU. Host B replies with a NS to Host A with Host B's MRU. Host A can now start transmitting the data it wants to send. Host A now sends Host B a ICMP MTU Probe, at some size less than or equal to Host B's MRU If Host B receives the packet, it replies with an ICMP MTU Probe reply saying what it received. By working "up" known sizes instead of doing a binary search (or working down), Host A can quickly ratchet up sizes without waiting for a timeout gaining immediate benefits from larger MTUs as they are discovered. Which sizes should it use? Well, there probably should be a hard coded list of "well known MTU sizes" which could perhaps be expanded at runtime (if you find an host with MTU that's not in that list, then add the MTU to it so you can more quickly find that MTU in future). > So first A sends out an announcement that it's going to send a 9216 > byte and a 5596 (1500 + 4096) byte packet, and then sends the packets. > Nobody receives the first packet, but everyone knows A sent it because > of the preceding announcement, and B receives the second packet. If you persist with this idea, switch the order of the packets sent. Send the packet first, then send the announcement that it was sent. Assuming no reordering you then don't have to wait for a timeout. If reordering does occur you then send a "Whoops! reordering! didn't expect that on the same L2!" and then everyone flags that interface as "possible reordering" and then always waits for a timeout. In the common case of no reordering this will be much faster due to not waiting for timeouts. > Then B would (for instance) send out its 9216 byte packet along with a > 1500 + 2048 = 3548 byte packet, and also indicates the largest size > that worked (5596) and the smallest size that didn't work (9126). A > receives the 3548 byte packet but not the 9216 byte one. > > C is next and sends out 9216 and (1500 + 1024 = ) 2524 byte packets, > along with the information that no jumboframe size has worked so far. > A, B and D all receive the 2524 byte packet. > > D then sends out 9216 and (1500 + 1536 = ) 3036 byte packets with > information that it received 2524 but not 3548. C receives the 3036 > byte packet. > > It's now A's turn again. A knows that the size that everyone can > receive is betweeen 2524 and 3036 and the size that at least one > correspondent can receive is between 5596 and 9216. So it sends out > 2780 and 7406 byte packets. > > And so on. > > After a few round like this, each system knows the maximum jumboframe > size it can send/receive (so it can adjust its announcements in the ND > option), and the minimum jumboframe size that everyone supports. It's > probably doable to generalize this into any given number of levels, but > I doubt that more than 3 is worth the trouble, and maybe having two > levels even isn't. On the other hand, if some hosts support 9000 but > the majority support 8192 it may be a good idea to forget 9000 and just > do 8192. > > This may sound horribly complex, but it really isn't. :-) > > The biggest challenge is probably making the different systems talk in > turn, but that can probably be done by having a timer that depends on > the difference in MAC address between the last system to transmit and > prospective next one. Not everything has a MAC address. Difference in link local addresses? This sounds very much like turning Ethernet into token ring <grin>. > Extra credit: monitor spanning tree events for quick adaption to > changing layer 2 topologies. Ooh, very good call. I'd not considered this! > Alternatively, we could add an RA option that administrators can use to > tell hosts the jumboframe size the layer 2 network supports. (The RA > option doesn't say anything about the capabilities of the _router_.) > Then the whole multicast taking turns discovery isn't necessary, and we > can suffice with a quick one-to-one verification before jumboframes are > used. This still seems to fall foul of either requiring the administrator to configure the router or degrading the entire network to the level of the router. Since routers are often the bottleneck (eg DSL in NZ is currently <=~2mbit, so why have a DSL router thats faster than 10mbit? 10mbit still has a low MTU, but hosts within the site should still try and use jumbograms) ---- [1]: National Lambda Rail claim to be the first national switched Ethernet (http://www.nlr.net/architecture.html) -------------------------------------------------------------------- IETF IPv6 working group mailing list [email protected] Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6 --------------------------------------------------------------------
