In discussions during 2007 with the HEPiX community, it was made clear to the gatekeepers that identifying and correcting trouble spots within Rx was one of the most important areas that OpenAFS needed to improve in order to maintain the existing deployments within that community of users. No one had the resources to put towards such a pursuit and it was suggested that OpenAFS apply for a United States Small Business Innovative Research (SBIR) grant to fund the work.
There are two problems with such an approach. First, OpenAFS does
not legally exist and even when it does have a legal Foundation, it
would not be eligible for an SBIR grant due to its not-for-profit
status. Since we did not have any other source of funding to perform
the work it was suggested that one of the existing commercial support
companies submit a grant application.
In October 2007 I founded Your File System Inc. as a for-profit company
that would be eligible to receive an SBIR grant and use the funding
to accomplish two goals. First, to benefit the OpenAFS community by
documenting the existing architectures and protocols used by OpenAFS
as well as engage in profiling and performance analysis that could be
used as input to the development of next generation implementations.
Secondly, because the company is receiving SBIR funding, to develop
a sustainable business model that could support the development of a
next generation distributed storage system.
The SBIR grant has provided funding for developer hours as well
as test equipment. In particular, the SBIR grant has provided
a 10GBit/second network testbed which is being used for Rx profiling.
I am pleased to announce the first public benefit to the OpenAFS
community as a result of the SBIR grant with the expectation
that there will be much more to come in the future.
There have been many efforts over the last five years to improve Rx.
Tom Keiser implemented per thread free packet queues to reduce the
contention for the global lock protecting the free packet queue.
Other work has been performed to reduce the dependency on global
locks. Rx hot threads have been implemented on a broader range of
platforms. Various bug fixes have been accepted as they have been
validated. Still with all of this work, Rx still has experienced
noticeable performance problems. In November 2006 there was
discussion regarding a 350ms hiccup that was experienced repeatedly
and was significantly hampering performance. Several folks have
tried to pin it down over the years unsuccessfully.
Funded by the SBIR grant there have been efforts over the last couple
of months to analyze Rx performance data from a number of sources.
There were several symptoms identified that it was unclear were related
to the hiccup but were worth investigating. First there was a periodic
out of memory error experienced in Windows test clients. Second, there
was a consistent lack of free packets. Third, there were a much larger
number of retries than could be explained due to packets lossage on the
network.
What the investigations uncovered were a related set of problems;
some of which affect all implementations of Rx derived from the Transarc
implementation. The problems fall into several categories:
1. Resetting a Call object emptied packet queues without adding the
packets to the free packet queue. rxi_ResetCall() would call
queue_Init() on queues with active rx_packets on them. once the
queues were cleared the packets were leaked and any acknowledgment
of receipt or transmission of other outgoing data would be lost.
Instead of initializing the queues the contents of the queues should
simply be freed either by a call to rxi_FreePackets() or by setting
the force flag on rxi_ClearTransmitQueue() and rxi_ClearReceiveQueue().
2. Packets queued for transmission would not be sent.
In rx.c there were two instances of RX_GLOBAL_RXLOCK_KERNEL which
should have been AFS_GLOBAL_RXLOCK_KERNEL. This oversight
resulted in rx_calls that were actively transmitting packets to
reset the call prematurely and leak the outgoing packets.
3. Packets would be leaked while read operations were progressing.
rxi_ReadProc()/rxi_ReadProc32() failed to remove the currentPacket
and put it on the call's iov queue when all of its data was read.
This resulted in the packet being lost either when the next read
packet was fetched, when the next packet was transmitted, or when
the call was reset.
4. The algorithm in OpenAFS which is used to allocate additional
packets when there were no free packets was overly aggressive. It
was based on the overall number of packets that had been
previously allocated. Each allocation would increase a larger
number than the previous one.
The side effects of these issues have been present in AFS for a very
long time and have been seen in both clients and servers. Corrections
for these errors have been integrated into 1.5.53 and 1.4.8-pre1.
As a result of these problems Rx was periodically not sending the
anticipated acknowledgment packet which in turn resulted in a timeout
and retransmission. The Rx stack was also frequently finding itself
out of free packets and was forced to block on a global lock while
additional packets structures were allocated from the process'
memory pool. The end result was a performance improvement of greater
than 9.5% when comparing the Rx performance of 1.4.8 over 1.4.7.
Rough tests show that the 1.4.8 Rx stack is capable of 124MBytes/second
over a 10Gbit link. There is still a long way to go to fill a 10Gbit
pipe but it is a start. Now we are only off by one order of magnitude.
Some might ask, "how is it that these bugs remained present in the OpenAFS
source tree for all of these years?" The answer is quite simple. "No
one ever thought to look for packet leaks." Many organizations still
perform weekly server restarts and never noticed the memory leaks and
the rxdebug -rxstats output lists the free packet count but no one ever
thought that it was important to report the number of allocated packets.
As a result no one noticed that the reason there were free packets available
was because packets were constantly being allocated instead of recycled.
Over the years many individuals have noticed the extra resends. Its just
that no one was able to identify why they were being sent. The resends
did not prevent the system from functioning. It was just slower than
it should be.
As these changes become available for both clients and servers I am
expecting users to see a much improved throughput rate and several
previously unexplained server and client crashes will now being a thing
of the past.
In order to get OpenAFS 1.4.8 released we need the assistance of the
community to test the pre-releases. 1.4.8-pre1 was announced yesterday.
The best way to move the release process along is for organizations
that deploy OpenAFS to test the pre-releases and send e-mails to this
mailing list confirming what works. Silence cannot be interpreted by
the gatekeepers as all is well.
I look forward to your reports of success and to reporting future
grant funded contributions in the future.
Jeffrey Altman
smime.p7s
Description: S/MIME Cryptographic Signature
