While I've seen this 350 ms delay oddity about a year ago during tests, I have not been able to reproduce the problem. At the time I was convinced that it was caused by ACKs being lost occasionally, actually the "ack every other packet" algorithm.

Lately however we've run RX tests again, worried about the pthreaded stack's performance which is significantly worse than the lwp one.

. There are a few more places in the protocol that need a "dpf" macro in order to make the RX trace useful. A lock (...), the current thread ID in the output and microsecond resolution in rx_debugPrint are a must for any serious work.

. In order to work against performance drops due to high latency one might be tempted to increase the window sizes, however the way the (single) send queue is organized this causes repeated traversals (in order to recalculate the timeouts for example) starting to take macroscopic amounts of time under locks. I worked on this a little, with so far the only result being more timeouts... ;-)

. The maximum window size is 255 (or 254...) due to the way the ACKs work.

. With bigger windows, and a routed network, the windows of 350 ms for ACKs is actually low, the price for retransmits is high. Here is makes sense to increase the timeout.

. Allocating new packets is done under a lock. As a result incoming ACKs get processed late and contribute to keeping the queue size high. I introduced a "hint" in the call which causes the alloc to release and re-grab the lock between packets. That helped quite a lot.

. In the past free packets were queued instead of stacked... something which is level-2-cache counter-productive (for the headers). With the new allocation system this might be different, I haven't checked.

. I'm currently trying to understand another puzzling case of ACKs being received but processed only about a millisecond later. Probably yet another locking problem.

I manage to fill the GigE interface with about 100-110 MB/s (megabytes) when machines are on the same switch, more the 50-60 you see when crossing a router. This is admittedly my own RX application, not rxperf.

Performance however drops dramatically once the sending end should do something in addition, such as reading a disk. No matter what double buffering tricks, if you're slow in producing data the send queue is empty whereas if you're fast it's not better either, with sweet-spots depending on 16 / 32 / 48 packet window sizes. Again, I suspect the implementation of a single send/resend queue degrades once it fills up.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Rainer Toebbicke
European Laboratory for Particle Physics(CERN) - Geneva, Switzerland
Phone: +41 22 767 8985       Fax: +41 22 767 7155
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to