On 6/3/2013 4:46 PM, Carlos Pignataro (cpignata) wrote:
Hi, Joe,
On Jun 3, 2013, at 1:43 PM, Joe Touch <[email protected]> wrote:
Hi, Carlos,
On 6/2/2013 12:22 PM, Carlos Pignataro (cpignata) wrote:
Joe,
On May 29, 2013, at 4:31 PM, Joe Touch <[email protected]
<mailto:[email protected]>> wrote:
...
I agree; my general recommendation has always been "the egress should
always clean up any mess created by an ingress" - which means using
"outer" fragmentation rather than "inner".
Why do you characterize this as a "mess"?
The ingress is creating a situation that takes work to correct (the "mess" I
refer to).
A. Using outer fragmentation places that work at the egress.
B. Using inner fragmentation pushes that work to the destination.
Note that fragmentation is a lot cheaper (less work) than reassembly. So (A)
makes it easy for routers to support GRE more cost-effectively, but drains
things like my iPhone battery as a result.
[I think you mean (B) and not (A) in the sentence above]
Yup.
A different way of looking at it is that a Host is generally more
optimized to perform this function than a router.
Some host are; others (like my phone) do this at a heft cost (to me).
Some routers can do this (ones that are designed well); others cannot
(like ones that claim they support GRE but really don't because they
don't implement at-rate reassembly).
Or, if you prefer the
view that tunnel endpoints act as hosts for the Tunnel (delivery), then
a "host" is more "host" than a router for reassembly.
I disagree; the router liked being a host when it encapsulated the
packets. That includes generating unique IPv4 IDs (if DF=0), generating
checksums, etc. - that part it's OK with. If it wants to play host
there, it can play host at the egress too.
Frankly, for the case of draining smartphone battery, you could just use IPv6
or set DF.
I can't set DF for incoming packets, nor can I force use of IPv6 unless
I disable my IPv4 address.
I don't like that optimization. If you make a mess, IMO you should clean it up.
I understand you have a preference and you do not like it. On the
other hand, it is still a valid case. It is documented, for example, in
the second para of S4.1.4 of RFC 3931
http://tools.ietf.org/html/rfc3931#section-4.1.4
There are lots of things documented in many IETF protocols that aren't
always the best way to do things. This may be just another good example
of that.
...
...
Here's the difference for GRE:
- GRE adds N bytes total (GRE header + IP header)
- GRE over IP supports 65536-byte packets
So if a packet arrives that is smaller than 65536-N, IMO GRE ought to fragment
and reassemble it.
If a packet arrives that is larger than that, then GRE *cannot* tunnel it, and
*MUST* drop it and sent a PTB.
I.e., only 65536 is "too big". Everything else is just whining about "bigger than I
want it to be" ;-)
[For scope, this is an IPv4 discussion; with IPv4, the mechanism for this is to
set DF.]
DF means "don't fragment this packet and deliver the fragments". It's
always the choice of a capable link layer to fragment and then
reassemble packets.
E.g., DF=1 can go over ATM cells just fine; ATM fragments and
reassembles the packet within ATM. A tunnel ought to behave the same way.
The exception is when the tunnel - or link - cannot fragment and
reassemble. That's when the source should drop and send PTB.
If DF is not set, which specification defines that a packet cannot
befragmented?
Nothing says you cannot; I claim that you should not unless you HAVE to.
Tunnels like GRE don't have to.
It really is a trade-off; and as such, IMHO, there is not
one-size-fits-all but there are deployment decisions and practices.
Yes, this can cause repeated frag/reassembly, but the alternative is
to shift work to the end host, which I think is inappropriate.
I do not understand why it is "inappropriate" -- when thinking of the
tunnel as a link.
See above.
I am still not sure how "inappropriate" is a technical term. I do
understand your preference and I do follow the logic. My point is that
it does not seem to be a mandated behavior by a spec., and seems to be a
trade-off to be documented with pros and cons for this case.
The spec for DF behavior has a purpose - when DF=1, it's to help
endpoints discover ways to adjust their MTU so packets can traverse
links that *cannot* otherwise be traversed by larger packets.
When DF=0, it's to help endpoints traverse paths by allowing a packet
that *cannot* traverse a link to be broken down and reassembled at the
destination.
Nothing about DF says "hey, fragment anytime you feel like it". RFC1812
in specific says you're not supposed to fragment unless you have to.
(that's the relevant RFC from the viewpoint of the traversing packet -
the tunnel looks like a link to a router).
My complaint for GRE is that there are two different MTUs - a 'native'
MTU that may describe efficiency but does NOT limit traversal, and an
true MTU (payload - N, where N is the size of the GRE+IP header). Using
the latter is the only one that complies with RFC1812.
Let me try a different example, and let me know which step you disagree with :-)
0. GRE Tunnel between R1 and R2, tunneling IPv4 from end hosts.
OK.
1. The delivery (encapsulating) header sets DF (as allowed by
RFC791,for example because R2 does not have sufficient resources to reassemble
internet fragments.)
This is the incorrect step; either DF=0 for IPv4 in general, or DF=1 and
you run PLMTUD so you can generate (outer) fragments that will traverse
the path correctly (e.g., as would be needed for IPv6, but would also be
useful for IPv4).
2. The GRE Tunnel is realized as a logical interface.
Sure.
3. A 2000 octet IPv4 datagram from an end host has the GRE tunnel as
the next-hop; R1 encapsulates in GRE, then IPv4 with DF set, and then
when trying to send the datagram sees that it is larger than the
physical MTU of the out interface.
There are two cases:
- you are running PLMTUD in background, at which point you already knew
this was going to happen at the GRE ingress, and you should have
fragmented correctly at the ingress (IPv6-like situation)
or
- your ingress should have set DF=0
4. R1 sends an ICMPv4 PTB to itself and drops the datagram, and the tunnel
learns its MTU.
Only if the ICMP error is correctly received.
5. Another datagram is received for the tunnel, with the DF bit
clear,what should R1 do? -> fragment the encapsulated packet and encapsulate it.
It should have done the same thing with both packets.
There is also a potential mischaracterization when you say "shift work
to the end host", because that can lead to assumptions that there is "a
single" end host (singular). In the case of a p2p tunnel, the challenge
is that there is a single pair of endpoints in the tunnel, but a
multitude of hosts behind and before them. A different view would be
that it's more efficient to distribute reassembly to many endpoints
instead of attacking the tunnel tailend and making it work on behalf of
many hosts/
It would be an attack, except that you're talking about a tunnel.
Atunnel is an ingress and an egress. Are you suggesting that making an
egress do its job is an "attack"?
A different perspective: If the tunnel is a link in a logical
topology, then this preference would be equivalent to asking every link
to perform data-link-level fragmentation and reassembly and allow any
size packets.
Up to the size it *cannot* handle as payload, yes.
Joe
_______________________________________________
Int-area mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/int-area