Re: jumbo frame of GbE and IPv6 -- A proposal

Perry Lorier Sun, 24 Jul 2005 06:19:28 -0700

<I agree with pretty much everything said here, so I'll just snip it>


> I think our requirements are:
> 
> 1. do not impede 1500-byte operation 
> 2. discover and utilize jumboframe capability where possible
> 3. discover and utilize (close to) the maximum MTU
> 4. recover from sudden MTU reductions fast enough for TCP and similar
> to survive

I'd add to this list:
5. Must be fully automatic and not require any admin intervention to do
the "right" thing.
6. Minimise the resources used.

> So it makes sense to come up with a new protocol for this. An 
> interesting notion here is that this protocol doesn't have to be IPv6-
> specific. However, in IPv6 we have neighbor unreachability detection 
> which we can use to find MTU reductions fast enough to fall back to 
> 1500 bytes before bad things happen. In IPv4 or pure ethernet, we  don't
> have that, and we also don't have neighbor discovery to  exchange
> per-host MRU/MTU information.
> 
> If we use IPv6 for this, I think a new ICMP type makes sense.

Yep, makes sense to me. :)

> Whenever two systems (hosts or routers) on a link perform neighbor
> discovery, they can trigger the MTU verficiation immediately
> afterward, and if jumboframe support is confirmed by receiving the
> larger packets, the MTU for the the neighbor can be updated. If the
> larger packets don't make it to the neighbor there is no complexity
> and no delay: communication was already underway at 1500 bytes and
> continues without the need for further action.

Yep!  I'd considered this, but at 3am didn't want to confuse the issue
by introducing too many differing ideas at the same time. (Also, I
wanted to go to sleep :)  You've seemed to have thought through some of
the issues a bit better than I had too :)

> However, this doesn't accommodate finding out jumboframe support at 
> reduced sizes very well. For this, I think we should use an  additional
> exchange, but this one should probably happen over  multicast.

I disagree.  There is no need to for every host to have a full
understanding of the layer 2 topology of the network it is on.  We're
starting to see some very large L2 networks as MAN's (eg NLR[1]) and
IPv6's /64 per subnet puts no real practical limit on how large a single
L2 segment can be.  The storage requirements for all of this information
alone is either going to be very large (remembering MTU's for every
neighbour on the link?!) and/or it's going to have be repeated almost
constantly as hosts want to discover MTU's to specific neighbours.

Nodes that are reasonably "idle" on the link are going to have to
participate in this and therefore avoid entering low power modes.

Multicasting out "jumbo" packets is going to tax switch queue sizes to
the extreme.  A 48 port switch, receiving a single multicast 9k packet
(that it can forward) may end up with 48*9k=430,000 bytes (or nearly 1/2
a megabyte) of buffer space used with a single packet arriving.

What happens on l2's where not every node can see every other node?
Some L2's only allow end hosts to talk to a master host.  What about a
network with vlans where some hosts are on differing sets of vlans?

> Hosts/routers could take turns in a distributed search for  the largest
> supported framesize. I think it's important that all  jumbo-capable
> systems take part in this in order to deal with unusual  topologies. For
> instance, consider a network with three switches: one  support 9000,
> another 8000 and the two are connected through a third  switch that only
> supports 3000 bytes:
> 
> 
>   A                C
>   |                |
> +-+--+  +----+  +--+-+
> |9000+--+3000+--+8000|
> +-+--+  +----+  +--+-+
>   |                |
>   B                D
> 
> Suppose all hosts support 9216 byte jumboframes.

If A and B are talking to each other and C and D are talking to each
other, why do (A and B) need to talk to C and D?

> I think the most efficient way to handle this is to do two concurrent 
> searches: one for the maximum packet size that can be used to at  least
> one correspondent, and one for the minimum jumboframe size that  is
> supported by all jumboframe supporting systems.

Why not a simpler protocol?

Host A sends a ND to Host B with Host A's MRU.
Host B replies with a NS to Host A with Host B's MRU.
Host A can now start transmitting the data it wants to send.
Host A now sends Host B a ICMP MTU Probe, at some size less than or
equal to Host B's MRU
If Host B receives the packet, it replies with an ICMP MTU Probe reply
saying what it received.

By working "up" known sizes instead of doing a binary search (or working
down), Host A can quickly ratchet up sizes without waiting for a timeout
gaining immediate benefits from larger MTUs as they are discovered.
Which sizes should it use?  Well, there probably should be a hard coded
list of "well known MTU sizes" which could perhaps be expanded at
runtime (if you find an host with MTU that's not in that list, then add
the MTU to it so you can more quickly find that MTU in future).

> So first A sends out an announcement that it's going to send a 9216 
> byte and a 5596 (1500 + 4096) byte packet, and then sends the  packets.
> Nobody receives the first packet, but everyone knows A sent  it because
> of the preceding announcement, and B receives the second  packet.

If you persist with this idea, switch the order of the packets sent.
Send the packet first, then send the announcement that it was sent.
Assuming no reordering you then don't have to wait for a timeout.  If
reordering does occur you then send a "Whoops! reordering! didn't expect
that on the same L2!" and then everyone flags that interface as
"possible reordering" and then always waits for a timeout.

In the common case of no reordering this will be much faster due to not
waiting for timeouts.

> Then B would (for instance) send out its 9216 byte packet along with  a
> 1500 + 2048 = 3548 byte packet, and also indicates the largest size 
> that worked (5596) and the smallest size that didn't work (9126). A 
> receives the 3548 byte packet but not the 9216 byte one.
>
> C is next and sends out 9216 and (1500 + 1024 = ) 2524 byte packets, 
> along with the information that no jumboframe size has worked so far. 
> A, B and D all receive the 2524 byte packet.
> 
> D then sends out 9216 and (1500 + 1536 = ) 3036 byte packets with 
> information that it received 2524 but not 3548. C receives the 3036 
> byte packet.
> 
> It's now A's turn again. A knows that the size that everyone can 
> receive is betweeen 2524 and 3036 and the size that at least one 
> correspondent can receive is between 5596 and 9216. So it sends out 
> 2780 and 7406 byte packets.
> 
> And so on.
> 
> After a few round like this, each system knows the maximum jumboframe 
> size it can send/receive (so it can adjust its announcements in the  ND
> option), and the minimum jumboframe size that everyone supports.  It's
> probably doable to generalize this into any given number of  levels, but
> I doubt that more than 3 is worth the trouble, and maybe  having two
> levels even isn't. On the other hand, if some hosts  support 9000 but
> the majority support 8192 it may be a good idea to  forget 9000 and just
> do 8192.
> 
> This may sound horribly complex, but it really isn't.  :-)
> 
> The biggest challenge is probably making the different systems talk  in
> turn, but that can probably be done by having a timer that depends  on
> the difference in MAC address between the last system to transmit  and
> prospective next one.

Not everything has a MAC address.  Difference in link local addresses?
This sounds very much like turning Ethernet into token ring <grin>.

> Extra credit: monitor spanning tree events for quick adaption to 
> changing layer 2 topologies.

Ooh, very good call.  I'd not considered this!

> Alternatively, we could add an RA option that administrators can use  to
> tell hosts the jumboframe size the layer 2 network supports. (The  RA
> option doesn't say anything about the capabilities of the  _router_.)
> Then the whole multicast taking turns discovery isn't  necessary, and we
> can suffice with a quick one-to-one verification  before jumboframes are
> used.

This still seems to fall foul of either requiring the administrator to
configure the router or degrading the entire network to the level of the
router.  Since routers are often the bottleneck (eg DSL in NZ is
currently <=~2mbit, so why have a DSL router thats faster than 10mbit?
10mbit still has a low MTU, but hosts within the site should still try
and use jumbograms)

----
[1]: National Lambda Rail claim to be the first national switched
Ethernet (http://www.nlr.net/architecture.html)

--------------------------------------------------------------------
IETF IPv6 working group mailing list
[email protected]
Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6
--------------------------------------------------------------------

Re: jumbo frame of GbE and IPv6 -- A proposal

Reply via email to