There has been a bunch of discussion about the reassembly of
fragmented packets. Because fragmentation and hop-by-hop reliability
are related, we've been considering them both together. Here is a
proposal for how to address these and other related issues. Although
some of this is in normative text, much of it is descriptive and
normative text will need to be written for the draft.
Bruce
Topics that need to be addressed include:
* Via List changes
* hop-by-hop fragmenting, end-to-end-reassembly of fragmented messages
* MTU discovery
* TFRC and semi-reliable retransmission for unreliable links
* framing for stream-based links (TLS/TCP)
* queueing overlay messages
* link failure detection
* end-to-end retransmission
* removal of route-logging
* negotiating new overlay link protocols
* Via List changes
The current via-list structure will be changed. The shortest via
list entry (other than for the origination and destination) is a
15-bit opaque identifier with meaning only to the forwarding peer.
This 15-bit information is encoded in only 16 bits to minimize the
size of the via list. We have chosen to go with this very tight
encoding because the entire forwarding header must be present in each
fragment, and the protocol must work well over links with very small
MTUs.
The remainder of the encoding will remain as it is now, although a
variable-length opaque identifier will also be allowed for situations
where the implementer or deployer knows that more than 15 bits of
information is required. An implementation SHOULD use the shortest
possible entry for its application.
If the first bit is 0, then the Destination is 16 bits long, and the
destination_data is an opaque 15-bit identifier.
(these structure definitions are conceptual only and intended for
discussion. the final representation may be different.)
struct {
uint8 isLong:1; // =0
uint16 dest_info:15;
}
struct {
uint8 isLong:1; // =1
uint8 length:7
DestinationType type;
DestinationData destination_data;
}
add a DestinationType for opaque id;
One possible encoding of the dest_info opaque identifier is to encode
an index into a connection table. To avoid misrouting responses in
the event a response is delayed and the connection table entry has
changed, the identifier should be split between an index and a
generation counter for that index. At startup, the generation
counters should be initialized to random values. An implementation
could use 12 bits for the connection table index and 3 bits for the
generation counter. (Note that this does not suggest a 4096 entry
connection table for every node, only the ability to encode for a
larger connection table.) When a connection table slot is used for a
new connection, the generation counter is incremented (with wrapping).
Connection table slots are used on a rotating basis to maximize the
time interval between using the same slot for different connections.
When routing a message to an entry in the destination list encoding a
connection table entry, the node confirms that the generation counter
matches the current generation counter of that index before forwarding
the message. If it does not match, the message is silently dropped.
Regardless of how the dest_info field or opaque DestinationType is
used, the encoding MUST include a generation counter designed to
prevent misrouting of responses due to the connection table entry
having changed since the request message was originally forwarded.
* hop-by-hop fragmenting, end-to-end-reassembly of fragmented messages
Any node along the path can fragment the message but only the final
destination reassembles the fragments. When a node takes a packet and
fragments it, each fragment has a full copy of the Forwarding Header
but the data after the Forwarding Header is broken up in appropriate
sized chunks. The size of the payload chunks needs to take into
account space to allow the via and lists to grow. Each fragment MUST
contain a full copy of the via and destination list and MUST contain
at least 256 bytes of the message body. If the via and destination
list are so large that this is not possible, RELOAD fragmentation is
not performed and IP-layer fragmentation is allowed to occur. When a
message must be fragmented, it SHOULD be split into equal-sized
fragments that are no larger than the PMTU of the next overlay link
minus 32 bytes. This is to allow the via list to grow before further
fragmentation is required.
Note that this fragmentation is not optimal for the end-to-end
path---a message may be refragmented multiple times as it traverses
the overlay. This option has been chosen as it is far easier to
implement than e2e PMTU discovery across an ever-changing overlay, and
effectively addresses the reliability issues of relying on IP-layer
fragmentation. However, PING will be extended to allow e2e PMTU to be
implemented if desired.
The final node that was receiving the data would need to reassemble
it. Any fragment older than 15 seconds SHOULD be discarded. However,
it does need to store all the fragments it receives and once it has
all the fragments for a messages. The amount of buffer space it needs
for this is proportional to whatever bit rate it is wiling to precess
messages at. Will have to specify something allowing excess fragments
to be dropped in case of attack.
* MTU discovery
MTU discovery is required both at the overlay link level and
end-to-end across the overlay (being the minimum of the overlay link
MTUs along the path). We will refer to these as overlay link MTU and
end-to-end MTU.
PING is used for overlay link MTU discovery. After establishing a new
connection, a node uses PING sent to this neighbor with the DF bit set
and a NULL (need to define) forwarding option header that extends the
message to the desired length. A series of these probes is used to
identify the overlay link MTU for this connection. See RFC4821 for
more details on this algorithm.
If a node choses not to perform overlay link MTU discovery, or before
it has determined a larger overlay link MTU via a PING, it SHOULD use
a default overlay link MTU of the overlay link of 576 bytes for IPv4
and 1280 bytes for IPv6 (RFC5405).
For end-to-end MTU discovery, the PING response will have an
attribute added indicating the largest fragment of the PING request
that was received. Calculating this value is OPTIONAL and if not
implemented, 0 should be reported in this field. Note that use of
this field measures a lower-bound, and not an upper bound on the
end-to-end MTU. It also does not ensure that each hop along the path
did not use IP-layer fragmentation. End-to-end MTU discovery is not
required nor specified by this document.
* TFRC and semi-reliable retransmission for unreliable links
Goal of algorithm specified in document is:
- provide a stop and wait algorithm that is as simple to implement as
possible
- provide a somewhat more complicated AIMD algorithm that is also more
conservative than TFRC and allows pipelining
- allow TFRC to be implemented
- do this with a single simple receiver, so all changes to allow more
complicated congestion control are done on the sender side.
Much of this is the same as what is in the current draft, but there
are some clarifications here.
Each packet sent has a sequence number. A sequence number is never
reused (until wraparound), if a retransmission occurs it receives a
new sequence number. It is up to the sender to keep track of that.
When the receiver sends an ACK, it ACKs a particular sequence number
and sets bits in the mask for recently received packets that it has a
record of. A bit not being set in the mask is not a NACK. Nodes
SHOULD remember the previous 32 sequence numbers. A receiver MUST NOT
use any form of delayed ACKs.
A receiver does not implement any form of a buffer or memory of which
packets it has or has not received outside of the most recent 32. It
is purely the job for the sender to implement semi-reliability.
The hop-by-hop link retransmission is intended to provide
semi-reliable service. (good-effort service.) It does not, and
cannot, provide total reliability. Among other problems, an overlay
network may experience congestion or stability problems, so attempts
to achieve perfect hop-by-hop reliability will not achieve the desired
effect of perfect message reliability.
Now details of the sender algorithms:
stop-and-wait:
Stop and wait allows only one fragment to be in transit per DTLS
connection. It has abysmal performance for WAN connections, but is
typically acceptable for LAN performance (for low-bandwidth overlays).
A peer SHOULD retransmit a fragment if it has not received an ACK for
that fragment starting with an interval of RTO ("Retransmission
TimeOut"), doubling after each retransmission. In each retransmission,
the sequence number is incremented. The RTO is an estimate of the
round-trip time (RTT), and is computed as described in RFC 2988
[RFC2988], with two exceptions. First, the initial value for RTO
SHOULD be calculated from the STUN connectivity checks used to
establish the connection. If that is not possible, the initial value
for RTO SHOULD be configurable (rather than the 3 s recommended in RFC
2988) and SHOULD be greater than 500 ms. In fixed-line access links,
a value of 500 ms is RECOMMENDED. Second, the value of RTO SHOULD NOT
be rounded up to the nearest second. Rather, a 1 ms accuracy SHOULD be
maintained. As with TCP, the usage of Karn's algorithm is RECOMMENDED
[TODO REF KARN87] which means that RTT estimates SHOULD NOT be
computed from transactions that result in the retransmission of a
request. The value for RTO is calculated separately for each DTLS
session.
Retransmissions continue until a response is received, or until a
total of Rc requests have been sent or there has been a hard ICMP
error [RFC1122]. The RECOMMENDED value of Rc is 5 and it MUST be at
least 3. The receiver knows a responses was received by receiving and
ACK with a sequence number that indicates it is a response to one of
the transmissions of this fragment. For example, assuming an RTO of
500 ms and Rc of 5, requests would be sent at times 0 ms, 500 ms, 1500
ms, 3500 ms, and 7500 ms. If all retransmissions for a fragment fail,
the DTLS connection SHOULD be closed.
Once an ACK has been received for a fragment, the next fragment can be
sent but the peer SHOULD ensure that there is at least 10 ms between
sending any two fragments.
AIMD:
A sender can implement TCP or TCP-like algorithms provided that they
meet the aggressiveness requirement. The algorithm here is only the
AIMD portion of TCP. All other features are restricted to simplify
the implementation, i.e. no slow start (initial window is 1), no fast
retransmission, and no fast recovery. An implementation may use one
timer per fragment or a single timer for only the oldest outstanding
fragment.
AIMD extends stop and wait by allowing multiple fragments to be
pending at the same time. The sender allows w unacknowledged
fragments to be outstanding at any given time. w is initially set to
one. Every RTO interval that no retransmissions occur, w is increased
by one. When a loss occurs, w is halved. After halving w, if there
are more than w fragments for which an ACK is pending, no further
retransmissions of the most recently initiated fragments are performed
until they fit in the window w, at which point they begin the
retransmission algorithm again. w is held fixed for one RTO. After that
point, if additional retransmissions occur, it will be halved again,
otherwise it may be incremented after an additional RTO without loss.
If w drops to one and the one pending fragment is not ACKed by the
other side after 5 requests are sent, the link is considered to have
failed. Otherwise, unACKed fragments are simply dropped.
TFRC:
TFRC (RFC5348) can be implemented in the Sender-Based Variant format
with the same receiver algorithm. This implementation requires the
sender to maintain precise timestamps in ms of the transmission time
of each sequence number as well as the segment sizes. That, combined
with the ACKs (and that the ACKs are not delayed) allows the sender to
calculate the performance required by TFRC, including information
calculated by the receiver in the conventional form of TFRC.
TFRC is used for congestion control. For reliability, an individual
fragment is retransmitted up to twice at RTO intervals, pending the
availability of room in the congestion window. If it is not
ACKed after another RTO following its last retransmission, it is dropped.
* framing for stream-based links (TLS/TCP)
A framing header is specified for TCP links. The framing headers are
used to calculate RTT and detect when the TCP link has failed.
* queueing overlay messages
For all link types, a peer MUST implement an appropriate queueing
discipline for determining which messages are forwarded. For all link
types, the peer MUST queue at least 5 messages. For overlay link
algorithms that implement a congestion window, the peer SHOULD queue
the greater of 5 messages or the current size of the congestion
window. For overlay link algorithms that rely on an underlying
buffer, such as TCP, the peer SHOULD queue 5 messages beyond what fits
in the socket buffer. Tuning for performance of the overlay link
SHOULD be accomplished by adjusting the size of the socket buffer
appropriately (ref). In no case should a message be queued longer than
500ms.
A node MAY implement a separate queue for internally originated
messages, but when operating in a congestion-limited mode, the node
MUST NOT allow its internally originated traffic to dominate the link.
* link failure detection
Links are detected to have failed under these conditions:
- periodic STUN keepalive fails (need to specify when these are used)
- no new TCP or UDP framing acknowledgements are received for 10RTT
(Note that ICE and ICE-TCP both require the use of keepalives. A node
MAY elect to substitute receiving framing acknowledgements for
separate active STUN keepalives.)
* end-to-end retransmission
End-to-end retransmission will be accomplished using exponential
backoff. An implementation SHOULD track round-trip times of messages
sent on the overlay and use set end-to-end RTO to 1.5 times the
average RTT. An implementation MAY track RTTs and set RTOs on a
per-destination basis. If an implementation does not track end-to-end
RTT, then RTO SHOULD default to 3000ms.
* removal of route-logging
route log will be removed from the forwarding header
* negotiating new overlay link protocols
a protocol ID will be added to the ICE negotiation to specify what
link protocol is being run between nodes, analogous to how media is
negotiated in SDP. To avoid offer/answer headaches, we will not
import the concept of an offer-less attach, and there will be a
mandatory protocol that must be included in each offer so there cannot
be a failure to share a protocol. (optionally include "better" base
option in overlay config?)
_______________________________________________
P2PSIP mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/p2psip