As a result of these problems Rx was periodically not sending the anticipated acknowledgment packet which in turn resulted in a timeout
and retransmission.  The Rx stack was also frequently finding itself
out of free packets and was forced to block on a global lock while
additional packets structures were allocated from the process' memory pool. The end result was a performance improvement of greater than 9.5% when comparing the Rx performance of 1.4.8 over 1.4.7.
Rough tests show that the 1.4.8 Rx stack is capable of 124MBytes/second
over a 10Gbit link.  There is still a long way to go to fill a 10Gbit
pipe but it is a start.  Now we are only off by one order of magnitude.


Having in the past repeatedly dug into the RX code (without spotting those problems!) I am of course very interested and will try the new code as soon as possible!

Just a few findings on RX from my previous (vain) attempts to make it "lightning fast" - perhaps they trigger ideas for whoever is still working on it or corrections from those who know better:

1. as latency grows when crossing routers or even public networks the default window of 32 packets is too small. On the other hand, the handling of the transmission queue grows with n**2, and even fast processors are quickly overwhelmed. Here's where "oprofile" is a valuable tool. Some of this can be reduced with queue hints, wisely posting retransmit events and trying to avoid scanning the whole queue in several places;

2. jumbograms are a pain: years ago we had a research network dropping fragmented packets and spent weeks on pinning that down. Currently we suspect another one. Firewalls choke on them. They also increase complexity for access list in routers. And of course the probability increases of having to retransmit the whole jumbogram because one fragment got lost. What makes me frown is that it is apparently faster if the kernels split and reassemble jumbograms on the fly, than by RX doing it with much more knowledge about the state;

3. the path for handling an ACK packet is very long, I measured on the order of 10 microseconds on average on a modern processor. At over 100 MB/s you'd be handling ~50000 ACKs per second in a non-jumbogram configuration and have hardly any time left to send out new packets. A lot is spent on waiting for the call-lock: even when that one is released quickly (which it isn't in the standard implementation, as the code leisurely walks around with it for extended periods, but I experimented with a "release" flag), the detour through the scheduler slows down things dramatically. The lock structure should probably be revisited to make contention between ack recv & transmit threads less likely;

4. slow start is implemented state-of-the-art, fast recovery however looks odd to me (actually: "inexisting" but I may be fooled by some jumbogram smoke). When it comes to congestion avoidance, a lot of the research that went into TCP in the last ten years is obviously missing. I started experimenting with CUBIC in the hope that it helps to reduce retransmits and keeping a constant flow, let's see;

5. earlier this year we mentioned handling of new calls, which is again a quadratic problem due to the mixture of service classes. This makes it impractical to allow for thousands of waiting calls, creating a problem on a cluster with thousands of nodes.

With those observations... does rx-over-tcp look like a solution? On the packet-transmission side probably, but the encapsulation very likely still demands significant processing power. And running a server with 10000 or 20000 TCP connections does not sound that obvious either.

Voilà... my 0.02 €. Sorry for being verbose, I couldn't resist.


--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155
_______________________________________________
OpenAFS-info mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-info

Reply via email to