On 8/7/2019 9:35 PM, xg...@reliancememory.com wrote:
> Hello,
> 
> Can someone kindly explain again the possible reasons why Rx is so painfully
> slow for a high latency (~230ms) link? 

As Simon Wilkinson said on slide 5 of "RX Performance"

  https://indico.desy.de/indico/event/4756/session/2/contribution/22

  "There's only two things wrong with RX
    * The protocol
    * The implementation"

This presentation was given at DESY on 5 Oct 2011.  Although there have
been some improvements in the OpenAFS RX implementation since then the
fundamental issues described in that presentation still remain.

To explain slides 3 and 4.  Prior to the 1.5.53 release the following
commit was merged which increased the default maximum window size from
32 packets to 64 packets.

  commit 3feee9278bc8d0a22630508f3aca10835bf52866
  Date:   Thu May 8 22:24:52 2008 +0000

    rx-retain-windowing-per-peer-20080508

    we learned about the peer in a previous connection... retain the
    information and keep using it. widen the available window.
    makes rx perform better over high latency wans. needs to be present
    in both sides for maximal effect.

Then prior to 1.5.66 this commit raised the maximum window size to 128

  commit 310cec9933d1ff3a74bcbe716dba5ade9cc28d15
  Date:   Tue Sep 29 05:34:30 2009 -0400

    rx window size increase

    window size was previously pushed to 64; push to 128.

and then prior to 1.5.78 which was just before the 1.6 release:

  commit a99e616d445d8b713934194ded2e23fe20777f9a
  Date:   Thu Sep 23 17:41:47 2010 +0100

    rx: Big windows make us sad

    The commit which took our Window size to 128 caused rxperf to run
    40 times slower than before. All of the recent rx improvements have
    reduced this to being around 2x slower than before, but we're still
    not ready for large window sizes.

    As 1.6 is nearing release, reset back to the old, fast, window size
    of 32. We can revist this as further performance improvements and
    restructuring happen on master.

After 1.6 AuriStor Inc. (then Your File System Inc.) continued to work
on reducing the overhead of RX packet processing.  Some of the results
were presented in Simon Wilkinson's 16 October 2012 talk entitled "AFS
Performance" slides 25 to 30

  http://conferences.inf.ed.ac.uk/eakc2012/

The performance of OpenAFS 1.8 RX is roughly the same as the OpenAFS
master performance from slide 28.  The Experimental RX numbers were the
AuriStor RX stack at the time which was not contributed to OpenAFS.

Since 2012 AuriStor has addressed many of the issues raised in
the "RX Performance" presentation

 0. Per-packet processing expense
 1. Bogus RTT calculations
 2. Bogus RTO implmentation
 3. Lack of Congestion avoidance
 4. Incorrect window estimation when retransmitting
 5. Incorrect window handling during loss recovery
 6. Lock contention

The current AuriStor RX state machine implements SACK based loss
recovery as documented in RFC6675, with elements of New Reno from
RFC5682 on top of TCP-style congestion control elements as documented in
RFC5681. The new RX also implements RFC2861 style congestion window
validation.

When sending data the RX peer implementing these changes will be more
likely to sustain the maximum available throughput while at the same
time improving fairness towards competing network data flows. The
improved estimation of available pipe capacity permits an increase in
the default maximum window size from 60 packets (84.6 KB) to 128 packets
(180.5 KB). The larger window size increases the per call theoretical
maximum throughput on a 1ms RTT link from 693 mbit/sec to 1478 mbit/sec
and on a 30ms RTT link from 23.1 mbit/sec to 49.39 mbit/sec.

AuriStor RX also includes experimental support for RX windows larger
than 255 packets (360KB). This release extends the RX flow control state
machine to support windows larger than the Selective Acknowledgment
table. The new maximum of 65535 packets (90MB) could theoretically fill
a 100 gbit/second pipe provided that the packet allocator and packet
queue management strategies could keep up.  Hint: at present, they don't.

To saturate a 60 Mbit/sec link with 230ms latency with rxmaxmtu set to
1344 requires a window size of approximately 1284 packets.

> From a user perspective, I wonder if there is any *quick Rx code hacking*
> that could help reduce the throughput gap of (iperf2 = 30Mb/s vs rxperf =
> 800Kb/s) for the following specific case. 

Probably not.  AuriStor's RX is significant re-implementation of the
protocol with one eye focused on backward compatibility and the other on
the future.

> We are considering the possibility of including two hosts ~230ms RTT apart
> as server and client. I used iperf2 and rxperf to test throughput between
> the two. There is no other connection competing with the test. So this is
> different from a low-latency, thread or udp buffer exhaustion scenario. 
> 
> iperf2's UDP test shows a bandwidth of ~30Mb/s without packet loss, though
> some of them have been re-ordered at the receiver side. Below 5 Mb/s, the
> receiver sees no packet re-ordering.  Above 30 Mb/s, packet loss is seen by
> the receiver. Test result is pretty consistent at multiple time points
> within 24 hours. UDP buffer size used by iperf is 208 KB. Write length is
> set at 1300 (-l 1300) which is below the path MTU. 

Out of order packet delivery and packet loss have significant
performance impacts on OpenAFS RX.

> Interestingly, a quick skim through the iperf2 source code suggests that an
> iperf sender does not wait for the receiver's ack. It simply keeps
> write(mSettins->mSock, mBuf, mSettings->mBufLen) and timing it to extract
> the numerical value for the throughput. It only checks, in the end, to see
> if the receiver complains about packet loss. 

This is because iperf2 is not attempting to perform any flow control,
any error recovery and no fairness model.  RX calls are sequenced data
flows that are modeled on the same principals as TCP.

> rxperf, on the other hand, only gets ~800 Kb/s. What makes it worse is that
> it does not seem to be dependent on the window size (-W 32~255), or udpsize
> (-u default~512*1024). I tried to re-compile rxperf that has #define
> RXPERF_BUFSIZE (1024 * 1024 * 64) instead of the original (512 * 1024). I
> did not see a throughput improvement from going above -u 512K. Occasionally
> some packets are re-transmitted. If I reduce -W or -u to very small values,
> I see some penalty. 

Changing RXPERF_BUFSIZE to 64MB is not going to help when the total
bytes being sent per call is 1000KB.  Given how little data is being
sent per call and the fact that each RX call begins in "slow start" I
suspect that your test isn't growing the window size to 32 packets let
alone 255.
>
>[snip]
>
> The theory goes if I have a 32-packet recv/send window (Ack Count) with 1344
> bytes of packet size and RTT=230ms, I should expect a theoretical upper
> bound of 32 x 8 x 1344 / 0.23 / 1000000 =  1.5 Mb/s. If the AFS-implemented
> Rx windows size (32) is really the limiting factor of the throughput, then
> the throughput should increase when I increase the window size (-w) above 32
> and configure a sufficiently big kernel socket buffer size.

The fact that OpenAFS RX requires large kernel socket buffers to get
reasonable performance is bad indication.  It means that for OpenAFS RX
it is better to deliver packets with long delays than to drop them and
permit timely congestion detection.

> I did not see either of the predictions by the theory above. I wonder if
> some light could be shed on:
> 
> 1. What else may be the limiting factor in my case

Not enough data is being sent per call.  30MB are being sent by iperf2
and rxperf is sending 1000KB.  Its not an equivalent comparison.

> 2. If there is a quick way to increase recv/send window from 32 to 255 in Rx
> code without breaking other parts of AFS. 

As shown in the commits specified above, it doesn't take much to
increase the default maximum window size.  However, performance is
unlikely to increase unless the root causes are addressed.

> 3. If there is any quick (maybe dirty) way to leverage the iperf2
> observation, relax the wait for ack as long as the received packets are in
> order and not lost (that is, get me up to 5Mb/s...)

Not without further violating the TCP Fairness principal.

> Thank you in advance.
> ==========================
> Ximeng (Simon) Guan, Ph.D.
> Director of Device Technology
> Reliance Memory
> ==========================

Jeffrey Altman

> 

<<attachment: jaltman.vcf>>

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to