Since 20 July 2021, AuriStor has presented on the RX Extended SACK Protocol at the Fall 2021 HEPiX
https://indico.cern.ch/event/1078853/contributions/4583101/ and privately received positive feedback on the implementation from three parties familiar with the RX protocol. AuriStor has also extended its implementation of RX Extended SACK to three extra SACK tables for a total of 8192 packets; approximately 11MB. No changes to the proposed protocol extension were required subsequent to the 20 July 2021 update to https://gerrit.openafs.org/#/c/14693. AuriStor intends to ship these extension to end users next month and will offer to deliver an updated version of the Fall 2021 HEPiX presentation at the June 2022 AFS Technology Workshop. Jeffrey Altman On 7/20/2021 2:14 AM, Jeffrey E Altman (jalt...@auristor.com) wrote: > Throughout the history of AFS there has been recognition that growing > the Rx window size is necessary to increase the throughput on high > latency or fat pipes where the meaning of "high-latency" and "fat" have > changed over time as networks have become faster. The maximum window > sizes were increased in both IBM AFS 3.4 and 3.5 resulting in the > current default OpenAFS Rx window size of 32 packets (44KB). Prior to > the release of OpenAFS 1.6, there were efforts to grow the default Rx > window size to 64 packets (88KB) in May 2008 and then to 128 packets > (176KB) in Sept 2009 with the expectation that there would be an > increase in throughput. These changes were reverted in Sept 2010 after > the late Andrei Maslennikov presented his findings in Pilsen that > OpenAFS 1.5.77 was 50-60% slower than 1.4.12. > > At DESY in 2011 Simon Wilkinson presented his findings and the > improvements that were subsequently made to OpenAFS Rx to slightly > improve the situation. Simon said at the time, "There's only two things > wrong with RX: the protocol and the implementation". To sustain a > 10gbit/second flow Rx needs to consistently process 175,000 DATA > packets/second as well the matching ACK packets. That requires not only > highly efficient packet processing but it also requires the ability to > maintain a full network pipe instead of stalling each time the DATA > sender has filled the peer's advertised receive window. > > Over the last decade AuriStor has continued to invest in its Rx > implementation in order to reduce the costs associated with DATA and ACK > packet processing, more effectively measure the pipe's congestion > window, more efficiently recover from packet loss, and improve > fairness. These efforts have paid off in that AuriStor has been able > to increase the default window size to 60 packets (82KB) in 2014, 128 > packets (176KB) in 2018, and 255 packets (351KB) in 2021. > > One of the reasons that filesystems such as Lustre and GPFS can achieve > high throughput is because they support TCP window sizes of 8MB or > larger. In order for AFS to match their performance Rx needs to > support windows sizes on the order of 6000 packets. The ACK packet's > receiveWindow field has ample room to advertise larger window sizes as > its an unsigned 32-bit integer. In 2018 AuriStor removed the > restriction that the maximum window size be restricted by the number of > packets that can be represented in the ACK packet's Selective > Acknowledgement (SACK) table. There is TCP research that describes how > to perform congestion avoidance when the SACK provides limited > visibility into the state of the in-flight packets. However, it is > always preferred to have access to SACK data for all of the in-flight > packets. > > AuriStor is therefore proposing a backward compatible protocol extension > which will permit incrementally growing the ACK packet's SACK table and > address two other design weaknesses in the ACK packet: the inconsistent > use of the 'previousPacket' field which makes it unusable and the lack > of a count for the number of ACK trailer fields. > > There are three commits in OpenAFS Gerrit. > > "rx: compare RX_ACK_TYPE_ACK as a bit-field" > https://gerrit.openafs.org/#/c/14465/ is a code change that ensures that > OpenAFS Rx will only examine Bit-0 of each SACK table element. This > permits Bit-1 through Bit-7 of each SACK element to be defined for > future use when the rx_maxWindow is increased above 255 packets. > AuriStor Rx already implements this behavior. > > "doc: rx-spec Update for accuracy with current Rx implementations" > https://gerrit.openafs.org/#/c/14692/2 is an update to Nickolai > Zeldovich's Rx Specification. I hope it improves the description of the > protocol correcting a number of misconceptions and explains how it > should be used. The Historical Implementation Notes section is > particularly important in the context of ACK packet processing and > possible extensions. > > "doc: rx-spec Document the Extended SACK Table protocol extension" > https://gerrit.openafs.org/#/c/14693/2 describes the proposed > EXTENDED-SACK ACK packet protocol extension which defines ACK packet > Flags Bit-3 as EXTENDED-SACK when set in an ACK packet; Bit-3 currently > only has meaning for DATA packets (MORE-PACKETS). When the > EXTENDED-SACK flag is set the following is true: > > * The previousPacket field must be the largest DATA packet sequence > number > accepted by the peer. This allows (previousPacket - firstPacket + > 1) to > represent the number of DATA packets that should be represented in SACK > tables. > > * The SACK table can grow up to 256 octets instead of 255 octets by > leveraging > one of the three unused octets between the SACK and the first trailer. > > * The SACK table can represent the ACK/NACK state for up to 2048 DATA > packets > using horizontal striping. > > * The second unused octet between the SACK and the first trailer is > used for > a count of the number of unsigned 32-bit trailer fields. This > will permit > future extensibility. The current value for this field is 4. > > * The third unused octet is a count of the number of additional SACK > tables > which are appended after the final trailer field. Each SACK is > variable > length and can grow up to 256 octets representing up to 2048 DATA > packets. > > With these changes up to 2048 DATA packets can be represented by an ACK > packet that fits within the minimum IPv4 MTU size and up to 8192 DATA > packets can be represented by an ACK packet that fits within the minimum > IPv6 MTU size. Larger window sizes can be represented with larger ACK > packet but 8192 DATA packets is 11MB which should be more than > sufficient for now. > > Even though it is unlikely that OpenAFS Rx will be able to increase the > default window sizes to benefit from these changes in the near term, > there are still benefits to OpenAFS Rx implementing the EXTENDED-SACK > flag and its associated meanings of previousPacket and the unused > octets. As documented by gerrit 14692 the prior usage of > previousPacket makes the field unusable as a means of detecting > out-of-sequence ACK packets and having an accurate view of the leading > edge of the in-flight window that has been received by the peer. The > trailer and extra SACK counts provide much needed clarity of the ACK > packet size before Path MTU discovery padding. > > AuriStor has implemented the EXTENDED-SACK proposal with up to one extra > SACK table or 4096 DATA packets (5.5MB). With these changes AuriStor > is prepared to ship a default window size of 4096 in our September 2021 > release provided that there is review from and consensus with the > OpenAFS community. > > Your review and feedback will be appreciated. AuriStor is prepared to > make changes as needed. > > Sincerely, > > Jeffrey Altman > > >
smime.p7s
Description: S/MIME Cryptographic Signature