Re: [P2PSIP] proposal for reload congestion control, fragmentation, and retransmission

Saverio Mascolo Fri, 27 Feb 2009 04:26:31 -0800

dear bruce,

at your best  knowledge    is TFRC the only non-TCP protocol considered for
multimedia transport? i would say that, as a matter of fact, TCP is
currently used by all P2P applications and video streaming applications...


do we need a rate-based protocol? and if yes, wht TFRC seems not
appropriate?

saverio

On Fri, Feb 27, 2009 at 12:58 PM, Bruce Lowekamp <[email protected]> wrote:

> There has been a bunch of discussion about the reassembly of
> fragmented packets.  Because fragmentation and hop-by-hop reliability
> are related, we've been considering them both together. Here is a
> proposal for how to address these and other related issues.  Although
> some of this is in normative text, much of it is descriptive and
> normative text will need to be written for the draft.
>
> Bruce
>
> Topics that need to be addressed include:
>
> * Via List changes
> * hop-by-hop fragmenting, end-to-end-reassembly of fragmented messages
> * MTU discovery
> * TFRC and semi-reliable retransmission for unreliable links
> * framing for stream-based links (TLS/TCP)
> * queueing overlay messages
> * link failure detection
> * end-to-end retransmission
> * removal of route-logging
> * negotiating new overlay link protocols
>
>
> * Via List changes
>
> The current via-list structure will be changed.  The shortest via
> list entry (other than for the origination and destination) is a
> 15-bit opaque identifier with meaning only to the forwarding peer.
> This 15-bit information is encoded in only 16 bits to minimize the
> size of the via list.  We have chosen to go with this very tight
> encoding because the entire forwarding header must be present in each
> fragment, and the protocol must work well over links with very small
> MTUs.
>
> The remainder of the encoding will remain as it is now, although a
> variable-length opaque identifier will also be allowed for situations
> where the implementer or deployer knows that more than 15 bits of
> information is required.  An implementation SHOULD use the shortest
> possible entry for its application.
>
> If the first bit is 0, then the Destination is 16 bits long, and the
> destination_data is an opaque 15-bit identifier.
>
> (these structure definitions are conceptual only and intended for
> discussion.  the final representation may be different.)
>
> struct {
>  uint8 isLong:1; // =0
>  uint16 dest_info:15;
> }
>
> struct {
>  uint8 isLong:1; // =1
>  uint8 length:7
>  DestinationType type;
>  DestinationData destination_data;
> }
>
> add a DestinationType for opaque id;
>
>
> One possible encoding of the dest_info opaque identifier is to encode
> an index into a connection table.  To avoid misrouting responses in
> the event a response is delayed and the connection table entry has
> changed, the identifier should be split between an index and a
> generation counter for that index.  At startup, the generation
> counters should be initialized to random values.  An implementation
> could use 12 bits for the connection table index and 3 bits for the
> generation counter.  (Note that this does not suggest a 4096 entry
> connection table for every node, only the ability to encode for a
> larger connection table.)  When a connection table slot is used for a
> new connection, the generation counter is incremented (with wrapping).
> Connection table slots are used on a rotating basis to maximize the
> time interval between using the same slot for different connections.
> When routing a message to an entry in the destination list encoding a
> connection table entry, the node confirms that the generation counter
> matches the current generation counter of that index before forwarding
> the message.  If it does not match, the message is silently dropped.
>
> Regardless of how the dest_info field or opaque DestinationType is
> used, the encoding MUST include a generation counter designed to
> prevent misrouting of responses due to the connection table entry
> having changed since the request message was originally forwarded.
>
>
> * hop-by-hop fragmenting, end-to-end-reassembly of fragmented messages
>
>
> Any node along the path can fragment the message but only the final
> destination reassembles the fragments. When a node takes a packet and
> fragments it, each fragment has a full copy of the Forwarding Header
> but the data after the Forwarding Header is broken up in appropriate
> sized chunks. The size of the payload chunks needs to take into
> account space to allow the via and lists to grow.  Each fragment MUST
> contain a full copy of the via and destination list and MUST contain
> at least 256 bytes of the message body.  If the via and destination
> list are so large that this is not possible, RELOAD fragmentation is
> not performed and IP-layer fragmentation is allowed to occur.  When a
> message must be fragmented, it SHOULD be split into equal-sized
> fragments that are no larger than the PMTU of the next overlay link
> minus 32 bytes.  This is to allow the via list to grow before further
> fragmentation is required.
>
> Note that this fragmentation is not optimal for the end-to-end
> path---a message may be refragmented multiple times as it traverses
> the overlay.  This option has been chosen as it is far easier to
> implement than e2e PMTU discovery across an ever-changing overlay, and
> effectively addresses the reliability issues of relying on IP-layer
> fragmentation.  However, PING will be extended to allow e2e PMTU to be
> implemented if desired.
>
> The final node that was receiving the data would need to reassemble
> it. Any fragment older than 15 seconds SHOULD be discarded. However,
> it does need to store all the fragments it receives and once it has
> all the fragments for a messages.  The amount of buffer space it needs
> for this is proportional to whatever bit rate it is wiling to precess
> messages at.  Will have to specify something allowing excess fragments
> to be dropped in case of attack.
>
>
> * MTU discovery
>
> MTU discovery is required both at the overlay link level and
> end-to-end across the overlay (being the minimum of the overlay link
> MTUs along the path).  We will refer to these as overlay link MTU and
> end-to-end MTU.
>
> PING is used for overlay link MTU discovery.  After establishing a new
> connection, a node uses PING sent to this neighbor with the DF bit set
> and a NULL (need to define) forwarding option header that extends the
> message to the desired length.  A series of these probes is used to
> identify the overlay link MTU for this connection.  See RFC4821 for
> more details on this algorithm.
>
> If a node choses not to perform overlay link MTU discovery, or before
> it has determined a larger overlay link MTU via a PING, it SHOULD use
> a default overlay link MTU of the overlay link of 576 bytes for IPv4
> and 1280 bytes for IPv6 (RFC5405).
>
> For end-to-end MTU discovery, the PING response will have an
> attribute added indicating the largest fragment of the PING request
> that was received.  Calculating this value is OPTIONAL and if not
> implemented, 0 should be reported in this field.  Note that use of
> this field measures a lower-bound, and not an upper bound on the
> end-to-end MTU.  It also does not ensure that each hop along the path
> did not use IP-layer fragmentation.  End-to-end MTU discovery is not
> required nor specified by this document.
>
> * TFRC and semi-reliable retransmission for unreliable links
>
> Goal of algorithm specified in document is:
>
> - provide a stop and wait algorithm that is as simple to implement as
>  possible
> - provide a somewhat more complicated AIMD algorithm that is also more
>  conservative than TFRC and allows pipelining
> - allow TFRC to be implemented
> - do this with a single simple receiver, so all changes to allow more
>  complicated congestion control are done on the sender side.
>
> Much of this is the same as what is in the current draft, but there
> are some clarifications here.
>
> Each packet sent has a sequence number.  A sequence number is never
> reused (until wraparound), if a retransmission occurs it receives a
> new sequence number.  It is up to the sender to keep track of that.
>
> When the receiver sends an ACK, it ACKs a particular sequence number
> and sets bits in the mask for recently received packets that it has a
> record of.  A bit not being set in the mask is not a NACK.  Nodes
> SHOULD remember the previous 32 sequence numbers.  A receiver MUST NOT
> use any form of delayed ACKs.
>
> A receiver does not implement any form of a buffer or memory of which
> packets it has or has not received outside of the most recent 32.  It
> is purely the job for the sender to implement semi-reliability.
>
> The hop-by-hop link retransmission is intended to provide
> semi-reliable service.  (good-effort service.)  It does not, and
> cannot, provide total reliability. Among other problems, an overlay
> network may experience congestion or stability problems, so attempts
> to achieve perfect hop-by-hop reliability will not achieve the desired
> effect of perfect message reliability.
>
> Now details of the sender algorithms:
>
> stop-and-wait:
>
> Stop and wait allows only one fragment to be in transit per DTLS
> connection.  It has abysmal performance for WAN connections, but is
> typically acceptable for LAN performance (for low-bandwidth overlays).
>
> A peer SHOULD retransmit a fragment if it has not received an ACK for
> that fragment starting with an interval of RTO ("Retransmission
> TimeOut"), doubling after each retransmission. In each retransmission,
> the sequence number is incremented. The RTO is an estimate of the
> round-trip time (RTT), and is computed as described in RFC 2988
> [RFC2988], with two exceptions. First, the initial value for RTO
> SHOULD be calculated from the STUN connectivity checks used to
> establish the connection.  If that is not possible, the initial value
> for RTO SHOULD be configurable (rather than the 3 s recommended in RFC
> 2988) and SHOULD be greater than 500 ms.  In fixed-line access links,
> a value of 500 ms is RECOMMENDED. Second, the value of RTO SHOULD NOT
> be rounded up to the nearest second. Rather, a 1 ms accuracy SHOULD be
> maintained. As with TCP, the usage of Karn's algorithm is RECOMMENDED
> [TODO REF KARN87] which means that RTT estimates SHOULD NOT be
> computed from transactions that result in the retransmission of a
> request. The value for RTO is calculated separately for each DTLS
> session.
>
> Retransmissions continue until a response is received, or until a
> total of Rc requests have been sent or there has been a hard ICMP
> error [RFC1122]. The RECOMMENDED value of Rc is 5 and it MUST be at
> least 3.  The receiver knows a responses was received by receiving and
> ACK with a sequence number that indicates it is a response to one of
> the transmissions of this fragment. For example, assuming an RTO of
> 500 ms and Rc of 5, requests would be sent at times 0 ms, 500 ms, 1500
> ms, 3500 ms, and 7500 ms. If all retransmissions for a fragment fail,
> the DTLS connection SHOULD be closed.
>
> Once an ACK has been received for a fragment, the next fragment can be
> sent but the peer SHOULD ensure that there is at least 10 ms between
> sending any two fragments.
>
>
> AIMD:
>
> A sender can implement TCP or TCP-like algorithms provided that they
> meet the aggressiveness requirement.  The algorithm here is only the
> AIMD portion of TCP.  All other features are restricted to simplify
> the implementation, i.e. no slow start (initial window is 1), no fast
> retransmission, and no fast recovery.  An implementation may use one
> timer per fragment or a single timer for only the oldest outstanding
> fragment.
>
> AIMD extends stop and wait by allowing multiple fragments to be
> pending at the same time.  The sender allows w unacknowledged
> fragments to be outstanding at any given time.  w is initially set to
> one.  Every RTO interval that no retransmissions occur, w is increased
> by one.  When a loss occurs, w is halved.  After halving w, if there
> are more than w fragments for which an ACK is pending, no further
> retransmissions of the most recently initiated fragments are performed
> until they fit in the window w, at which point they begin the
> retransmission algorithm again.  w is held fixed for one RTO.  After that
> point, if additional retransmissions occur, it will be halved again,
> otherwise it may be incremented after an additional RTO without loss.
>
> If w drops to one and the one pending fragment is not ACKed by the
> other side after 5 requests are sent, the link is considered to have
> failed.  Otherwise, unACKed fragments are simply dropped.
>
> TFRC:
>
> TFRC (RFC5348) can be implemented in the Sender-Based Variant format
> with the same receiver algorithm.  This implementation requires the
> sender to maintain precise timestamps in ms of the transmission time
> of each sequence number as well as the segment sizes.  That, combined
> with the ACKs (and that the ACKs are not delayed) allows the sender to
> calculate the performance required by TFRC, including information
> calculated by the receiver in the conventional form of TFRC.
>
> TFRC is used for congestion control.  For reliability, an individual
> fragment is retransmitted up to twice at RTO intervals, pending the
> availability of room in the congestion window.  If it is not
> ACKed after another RTO following its last retransmission, it is dropped.
>
>
> * framing for stream-based links (TLS/TCP)
>
> A framing header is specified for TCP links.  The framing headers are
> used to calculate RTT and detect when the TCP link has failed.
>
>
> * queueing overlay messages
>
> For all link types, a peer MUST implement an appropriate queueing
> discipline for determining which messages are forwarded.  For all link
> types, the peer MUST queue at least 5 messages.  For overlay link
> algorithms that implement a congestion window, the peer SHOULD queue
> the greater of 5 messages or the current size of the congestion
> window.  For overlay link algorithms that rely on an underlying
> buffer, such as TCP, the peer SHOULD queue 5 messages beyond what fits
> in the socket buffer.  Tuning for performance of the overlay link
> SHOULD be accomplished by adjusting the size of the socket buffer
> appropriately (ref).  In no case should a message be queued longer than
> 500ms.
>
> A node MAY implement a separate queue for internally originated
> messages, but when operating in a congestion-limited mode, the node
> MUST NOT allow its internally originated traffic to dominate the link.
>
> * link failure detection
>
> Links are detected to have failed under these conditions:
>
> - periodic STUN keepalive fails (need to specify when these are used)
> - no new TCP or UDP framing acknowledgements are received for 10RTT
>
> (Note that ICE and ICE-TCP both require the use of keepalives.  A node
> MAY elect to substitute receiving framing acknowledgements for
> separate active STUN keepalives.)
>
> * end-to-end retransmission
>
> End-to-end retransmission will be accomplished using exponential
> backoff.  An implementation SHOULD track round-trip times of messages
> sent on the overlay and use set end-to-end RTO to 1.5 times the
> average RTT.  An implementation MAY track RTTs and set RTOs on a
> per-destination basis.  If an implementation does not track end-to-end
> RTT, then RTO SHOULD default to 3000ms.
>
> * removal of route-logging
>
> route log will be removed from the forwarding header
>
> * negotiating new overlay link protocols
>
> a protocol ID will be added to the ICE negotiation to specify what
> link protocol is being run between nodes, analogous to how media is
> negotiated in SDP.  To avoid offer/answer headaches, we will not
> import the concept of an offer-less attach, and there will be a
> mandatory protocol that must be included in each offer so there cannot
> be a failure to share a protocol.  (optionally include "better" base
> option in overlay config?)
> _______________________________________________
> P2PSIP mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/p2psip
>



-- 
Prof. Saverio Mascolo
Dipartimento di Elettrotecnica ed Elettronica
Politecnico di Bari
Via Orabona 4
70125 Bari
Italy
Tel. +39 080 5963621
Fax. +39 080 5963410
email:[email protected] <email%[email protected]>

http://c3lab.poliba.it


=================================
This message may contain confidential and/or legally privileged information.
 If you are not the intended recipient of the message, please destroy it.
Any unauthorized dissemination, distribution, or copying of the material in
this message, and any attachments to the message, is strictly forbidden.

_______________________________________________
P2PSIP mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/p2psip

Re: [P2PSIP] proposal for reload congestion control, fragmentation, and retransmission

Reply via email to