As a result of these problems Rx was periodically not sending the
anticipated acknowledgment packet which in turn resulted in a timeout
and retransmission. The Rx stack was also frequently finding itself
out of free packets and was forced to block on a global lock while
additional packets structures were allocated from the process'
memory pool. The end result was a performance improvement of greater
than 9.5% when comparing the Rx performance of 1.4.8 over 1.4.7.
Rough tests show that the 1.4.8 Rx stack is capable of 124MBytes/second
over a 10Gbit link. There is still a long way to go to fill a 10Gbit
pipe but it is a start. Now we are only off by one order of magnitude.
Having in the past repeatedly dug into the RX code (without spotting
those problems!) I am of course very interested and will try the new
code as soon as possible!
Just a few findings on RX from my previous (vain) attempts to make it
"lightning fast" - perhaps they trigger ideas for whoever is still
working on it or corrections from those who know better:
1. as latency grows when crossing routers or even public networks the
default window of 32 packets is too small. On the other hand, the
handling of the transmission queue grows with n**2, and even fast
processors are quickly overwhelmed. Here's where "oprofile" is a
valuable tool. Some of this can be reduced with queue hints, wisely
posting retransmit events and trying to avoid scanning the whole queue
in several places;
2. jumbograms are a pain: years ago we had a research network dropping
fragmented packets and spent weeks on pinning that down. Currently we
suspect another one. Firewalls choke on them. They also increase
complexity for access list in routers. And of course the probability
increases of having to retransmit the whole jumbogram because one
fragment got lost. What makes me frown is that it is apparently
faster if the kernels split and reassemble jumbograms on the fly, than
by RX doing it with much more knowledge about the state;
3. the path for handling an ACK packet is very long, I measured on the
order of 10 microseconds on average on a modern processor. At over
100 MB/s you'd be handling ~50000 ACKs per second in a non-jumbogram
configuration and have hardly any time left to send out new packets. A
lot is spent on waiting for the call-lock: even when that one is
released quickly (which it isn't in the standard implementation, as
the code leisurely walks around with it for extended periods, but I
experimented with a "release" flag), the detour through the scheduler
slows down things dramatically. The lock structure should probably be
revisited to make contention between ack recv & transmit threads less
likely;
4. slow start is implemented state-of-the-art, fast recovery however
looks odd to me (actually: "inexisting" but I may be fooled by some
jumbogram smoke). When it comes to congestion avoidance, a lot of the
research that went into TCP in the last ten years is obviously
missing. I started experimenting with CUBIC in the hope that it helps
to reduce retransmits and keeping a constant flow, let's see;
5. earlier this year we mentioned handling of new calls, which is
again a quadratic problem due to the mixture of service classes. This
makes it impractical to allow for thousands of waiting calls, creating
a problem on a cluster with thousands of nodes.
With those observations... does rx-over-tcp look like a solution? On
the packet-transmission side probably, but the encapsulation very
likely still demands significant processing power. And running a
server with 10000 or 20000 TCP connections does not sound that obvious
either.
Voilà... my 0.02 €. Sorry for being verbose, I couldn't resist.
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985 Fax: +41 22 767 7155
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info