Gerrit Renker
Wed, 29 Nov 2006 07:15:18 -0800
As promised, I sat down and wrote up conditions and current problems with
some of the calculations in the CCID 3 module. Changes are required only for
the sender.
In a nutshell, the strategy solves
* the problem that dividing-by-seconds is not handled correctly
* defines a sufficient resolution for tracking sending rates
* gets rid of the overflow problem
The cost is that
* at least 1 and ideally 3 structure elements of tfrc_tx_info make it to __u64
===> Arnaldo are you OK with this?
but this has the advantage of tackling both overflow and the missing resolution
of
lower send rates.
The detailed derivation is below, I am putting this into patches at the moment.
It
would be good to have this codified before Ian starts testing.
Reasonable assumptions for and a corresponding scheme for the
calculation of CCID 3 rates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Problem:
========
The CCID 3 specification uses bytes, seconds, and bytes/second as units.
Additionally, the
timestamp options (Timestamp, Timestamp Echo, and Elapsed Time) require a
representation of time
in units of 10 * microseconds. The kernel internally uses microseconds to
represent time.
This results in many divisions of the type `s * 1E6 / t' where s is a packet
size in bytes and
t is a time unit in microseconds.
(a) Problem 1: Overflow
-------------------
Since currently receive/sending rates are stored as u32, this leads to
overflow.
There are several ways of tackling this:
* the patch by Nishida-san implements a subset of rational
arithmetic where
all numbers are expressed as a
struct fix-point { long long num;
long long denom;
};
The problem with this is that it has to be carried through for
each single
elementary operation (+,-,*,==,>,<,/,%), which is messy.
* the current Linux implementation uses a function
usecs_div(u32 a, u32 b) = a * 1E6 / b
which computes a divisor in such a way that u32 overflow is
avoided for a.
While this goes a some way towards avoiding overflow, it comes
at a price:
since a piecewise-linear function is used, a constant is
substituted
for many different values. The result is a substantial
computation error
- this can be as high as 100% (I have a proof and test
program).
(b) Problem 2: What to do with non-microsecond time units
-----------------------------------------------------
There are a few computations and constants in RFC 3448 and RFC 4342 which
are not based
on microsecond time units. The following table lists these
s/t_mbi 4.3 bytes/second 0 if s < 64
s/(2*t_mbi) 4.4 bytes/second 0 if s < 32
X_recv = X_calc/4 4.4 bytes/second 0 if X_calc < 4
bytes/second
X/2 4.4 bytes second
These expressions are currently not handled correctly - this is a BUG in
CCID 3.
Solution:
=========
A brute-force idea is to scaled everything by 1E6 and use bytes/microsecond
(or 1e6 bytes/second)
as basis for sending rates. But this introduces overflow for many
calculations - even on 64 bit.
i) Minimal granularity for sending/receiving rate
--------------------------------------------------
A high granularity is not necessary:
* [RFC 3448, 4.3] specifies the minimum sending rate as 1 packet in 64
seconds
* the packet size `s' is never smaller than 1 byte
and never larger than 2^32 - 1 (4 GByte), since
this is
(a) the largest size of an IPv6 jumbogram (RFC 2765) and
(b) the type limit of the current representation
* hence,
1/64 <= X <= 2^32 - 1
It is therefore fully sufficient to store X (X_recv, X_calc) in units of
64bytes/second.
Internet speeds are growing fast, having a resolution of speeds higher
than 1/64 is thus
a truly wasted effort.
ii) Implications for storage size
-----------------------------------
Currently X, X_calc, X_recv are all stored as u32. Due to the above this
is no longer
sufficient: we are now at least 6 bits short.
* the receiver does not necessarily need a modification since it sends
X_recv as u32 anyway
* at the sender it is however better to scale all X quantities. It is
like enlarging a
picture - everything needs to keep its proportions.
* hence the changes will look like:
struct tfrc_tx_info {
__u64 tfrctx_x; /* this is really
required */
__u64 tfrctx_x_recv; /* would be better to
have as 64-bit */
__u64 tfrctx_x_calc; /* ditto
*/
/* ... */
};
* the advantages are that
(a) usecs_div is now much simpler
(b) it would mean a headache and a mess (but it is possible) to
keep the sending
rate in different sizes (scaled/unscaled); but with the
above we don't need to
(c) u64 is sufficient even when scaling both by 64 and by 1E6
iii) Implications for algorithm
-------------------------------
This is a clerical job but needs to be done systematically: at the
sender, all operations
and comparisons have to be changed with regard to the scaled format. This
is now done below.
For the receiver, no changes are required.
* Initialisation [RFC 3448, 4.2]
Set X to 64 * 1 packet / second (i.e. X = s << 6)
* Sender behaviour when feedback is received [RFC 3448, 4.3 - step (4)]
==> When X_recv is computed set
X_recv = hctx->packet_options_x_recv << 6;
If (p > 0) {
X_calc = calcX(R, s, p); /* we leave that as it is for
the moment */
X_calc << 6; /* scale by 64
*/
X = max(min(X_calc, 2*X_recv), /* X_recv is also scaled */
(s << 6) / t_mbi) ); /* now it is correct */
} Else If(t_now - tld >= R) {
X = max(2 * min(X, X_recv), scaled_div(s << 6, R); /* now
also works */
tld = t_now
}
Where the analogue of usecs_div now is renamed into:
u64 scaled_div(u64 size, u32 time) { return (size * 1000000) /
time; }
==> step (5): the nofeedback timer now expires in
max(4 * R, 2 * scaled_div(s, X >> 6)) /* to obtain
microseconds */
This use is why I would like to suggest a renaming into scaled_div
- we are not
dividing by microseconds here.
* Expiration of nofeedback timer [RFC 3448, 4.4]
(a) If the sender already has got some feedback:
If (X_calc > 2 * X_recv) /* comparison
is in proportion */
X_recv = max(X_recv/2, (s << 6)/(2*t_mbi)); /* now works as
expected */
Else
X_recv = X_calc/4
==> expire the nofeedback timer after
max(4 * R, 2 * scaled_div(s, X >> 6)) /* to obtain
microseconds */
(b) If the sender does not yet have any feedback then
X = max(X/2, (s << 6)/t_mbi) /* now also
works as expected */
==> expire the nofeedback timer after 2 seconds
* Preventing oscillations [RFC 3448, 4.5]
This is currently not supported, but the solution is
forward-compatible, since the scaled
X can be kept as it is.
* Scheduling of Packet Transmissions [RFC 3448, 4.6]
t_ipi = scaled_div(s, X >> 6) /* yes,
microseconds */
* Larger initial windows [RFC 4342, sec. 5]
w_init = min(4 * s, max(2 * s, 4380))
X = scaled_div(w_init << 6, R)
That's all !
-
To unsubscribe from this list: send the line "unsubscribe dccp" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html