Re: [OpenAFS-devel] read performance 1.5.74 Linux

Jeffrey Altman Tue, 18 May 2010 11:51:26 -0700

On 5/18/2010 12:25 PM, Jeffrey Hutzelman wrote:
> --On Tuesday, May 18, 2010 12:03:46 PM -0400 Jeffrey Altman
> <[email protected]> wrote:
> 
>> On 5/18/2010 11:16 AM, Jeffrey Hutzelman wrote:
>>> I'm concerned here that this might mean you are lying about window
>>> sizes.
>>
>> I (personally) am not lying about anything.  I really wish that *you*
>> could make a distinction between *me* and the open source code when
>> making comments.
> 
> Apologies.  Of course I meant that the rx implementation was lying.
> And apparently I also misinterpreted what you wrote, because it sounded
> to me like you were describing changes you and Derrick made which
> resulted in dropping received in-window data.


Thank you.  I didn't really want to single you out but I do think that
we have to be careful about what we say.  I find that software engineers
all too often when describing interactions among software interfaces use
personal pronouns.  This has the negative consequence of blurring the
lines between the ability to say that the code does not meet its
requirements and that the person that wrote the code is bad in some way.

OpenAFS used to be a very small community of close friends.  We now have
more than 250 contributors over the last decade and more than 50 in the
last year.  Those numbers are continuously increasing as OpenAFS gains
more exposure.

It is critical to our ability to recruit and retain new developers that
we communicate in a highly respectful manner with one another.  Not just
for the sake of the feelings of the people involved in the discussion at
the time but to ensure that lurkers feel comfortable speaking up and
becoming an active part of our community.

I have certainly slipped as well.  One thing working with Google Summer
of Code has taught me is that its a lesson that is easy to learn and a
habit that is hard to break.

>>> The reason some data buffers are allocated in advance is because you
>>> must be prepared to receive any data that can be in flight according to
>>> the advertised window size _without blocking_ or at least without
>>> blocking in a way that prevents traffic in another stream from being
>>> received and processed.
>>
>> A packet is made up of multiple data buffers which are themselves
>> packets.  The window size in Rx is not measured in bytes.  It is
>> measured in packets and we have no idea how large the incoming packet
>> might be.  It can be as large as RX_MAX_PACKET_SIZE.  As such, before
>> any receive operation is performed the library must ensure that the full
>> number of data buffers has been attached to the packet.
> 
> Well, it has to have someplace to put the received UDP datagram.  That
> doesn't necessarily mean "attached to the packet", which I gather is the
> part that created the concurrency problem you saw in July 2009.  But
> doing it some other way isn't necessarily easier, since you still need
> to account for the total number of buffers that must be available to
> meet window commitments.

If the Rx window size was tracked by the number of bytes in the window,
then it would be a lot easier to ensure that the right number of buffers
were available for any outstanding connection/call.  As you are aware,
the receiver never wants to be in a position where it has to allocate
memory or reclaim packet buffers at the moment it is supposed to be
pulling data off the socket.

>>> Tearing apart other packets to reclaim buffers
>>> is acceptable, but not if it means you need to wait for a buffer to
>>> drain before you can receive more packets.
>>
>> It is acceptable but it is also extremely inefficient because most
>> packets are not jumbo packets and as such only require a single data
>> buffer beyond the buffer used for the header.
> 
> Yes.
> 
> 
> 
>>>  Dropping received data on
>>> the floor when it was received within the advertised window is _not_
>>> acceptable; that breaks flow control and exacerbates congestion.
>>
>> Of course it is but I believe that the early developers made a wise
>> choice being causing a kernel panic and being inefficient on the wire.
>> If you have to choose one, drop the data on the floor and let it be
>> retransmitted.
> 
> Well, sure.  But the "right" way to be inefficient here is to advertise
> a smaller window, so that you don't get data that has to be
> retransmitted. Nonetheless, this is a decision that was made ages ago,
> and now that I realize that, there's not much benefit in debating its
> merits.

And here is where the bug that was just fixed comes into play.  When the
window size is increased, there is a request for more packets that is
signaled by setting rxi_NeedMorePackets.  The assumption is that this
event will trigger the allocation of more packet buffers when it is safe
/ efficient to do so.  Unfortunately, the only function that tests to
see if rxi_NeedMorePackets is non-zero, rx_CheckPackets() was never
called except during rx_InitHost().

This wasn't a big problem for userland implementations because they
would just allocate packets as the free packet queue became empty.
However, in the kernel implementations this is not possible.  The #ifdef
KERNEL code in rxi_ReceiveDataPacket() is a last ditch effort to keep
the overall system running when the number of free packets gets too low.
 What the code does is blow away the current call, reclaim the data
buffers, and sets the 'rxi_NeedMorePackets' flag to TRUE.  In the hopes
that once this call terminates that rx_CheckPackets() would notice that
more packets need to be allocated.  Since rx_CheckPackets() was never
called, more packets were never allocated and without performing the
rx_TrimDataBuffers dance the system would eventually panic.

>> The goal is to ensure that we never get into this case which is why if
>> the rxi_NeedMorePackets global variable is TRUE we must actually go and
>> allocate more packets the next time it is safe to do so.
>>
>> The patch that was committed today does that for the first time in the
>> history of Rx.
> 
> Now I'm really going to have to go back and reread things, because I
> examined this fairly closely a couple of months ago when I was working
> out fileserver tuning, and one of the conclusions I came to at the time
> was that at least in user mode, Rx would always allocate more packets
> when needed, so setting the fileserver's -rxpck parameter should never
> be necessary.

The rxpck parameter should not be required anywhere.  It was added at a
time when no one had time to profile and performance test the
implementation.  When Derrick and I removed the rx_TimeDataBuffers()
dance we tested in userland.  It was only after Hartmut noticed this
problem with his kernel testing that I went back and noticed that packet
counts never increase in the kernel after the initial allocation.  This
problem should now be created and in the future, increases in the window
size should behave appropriately.

As an aside, I find that when I'm working on code that was written years
ago and has been in deployment for decades that I tend to make the
assumption that the code must be working as designed.  It never occurred
to me to check whether or not rx_CheckPackets() was doing the job it was
written for.  rx_CheckPackets() and the rxi_NeedMorePackets variable
were added in AFS 3.6.  It is unfortunate that rx_CheckPackets() was not
actually called back then.  If someone has the RCS data from the commit
that added it, I would be very interested in seeing what the log entry says.


Jeffrey Altman

smime.p7s
Description: S/MIME Cryptographic Signature

Re: [OpenAFS-devel] read performance 1.5.74 Linux

Reply via email to