On 5/18/2010 12:25 PM, Jeffrey Hutzelman wrote: > --On Tuesday, May 18, 2010 12:03:46 PM -0400 Jeffrey Altman > <[email protected]> wrote: > >> On 5/18/2010 11:16 AM, Jeffrey Hutzelman wrote: >>> I'm concerned here that this might mean you are lying about window >>> sizes. >> >> I (personally) am not lying about anything. I really wish that *you* >> could make a distinction between *me* and the open source code when >> making comments. > > Apologies. Of course I meant that the rx implementation was lying. > And apparently I also misinterpreted what you wrote, because it sounded > to me like you were describing changes you and Derrick made which > resulted in dropping received in-window data.
Thank you. I didn't really want to single you out but I do think that we have to be careful about what we say. I find that software engineers all too often when describing interactions among software interfaces use personal pronouns. This has the negative consequence of blurring the lines between the ability to say that the code does not meet its requirements and that the person that wrote the code is bad in some way. OpenAFS used to be a very small community of close friends. We now have more than 250 contributors over the last decade and more than 50 in the last year. Those numbers are continuously increasing as OpenAFS gains more exposure. It is critical to our ability to recruit and retain new developers that we communicate in a highly respectful manner with one another. Not just for the sake of the feelings of the people involved in the discussion at the time but to ensure that lurkers feel comfortable speaking up and becoming an active part of our community. I have certainly slipped as well. One thing working with Google Summer of Code has taught me is that its a lesson that is easy to learn and a habit that is hard to break. >>> The reason some data buffers are allocated in advance is because you >>> must be prepared to receive any data that can be in flight according to >>> the advertised window size _without blocking_ or at least without >>> blocking in a way that prevents traffic in another stream from being >>> received and processed. >> >> A packet is made up of multiple data buffers which are themselves >> packets. The window size in Rx is not measured in bytes. It is >> measured in packets and we have no idea how large the incoming packet >> might be. It can be as large as RX_MAX_PACKET_SIZE. As such, before >> any receive operation is performed the library must ensure that the full >> number of data buffers has been attached to the packet. > > Well, it has to have someplace to put the received UDP datagram. That > doesn't necessarily mean "attached to the packet", which I gather is the > part that created the concurrency problem you saw in July 2009. But > doing it some other way isn't necessarily easier, since you still need > to account for the total number of buffers that must be available to > meet window commitments. If the Rx window size was tracked by the number of bytes in the window, then it would be a lot easier to ensure that the right number of buffers were available for any outstanding connection/call. As you are aware, the receiver never wants to be in a position where it has to allocate memory or reclaim packet buffers at the moment it is supposed to be pulling data off the socket. >>> Tearing apart other packets to reclaim buffers >>> is acceptable, but not if it means you need to wait for a buffer to >>> drain before you can receive more packets. >> >> It is acceptable but it is also extremely inefficient because most >> packets are not jumbo packets and as such only require a single data >> buffer beyond the buffer used for the header. > > Yes. > > > >>> Dropping received data on >>> the floor when it was received within the advertised window is _not_ >>> acceptable; that breaks flow control and exacerbates congestion. >> >> Of course it is but I believe that the early developers made a wise >> choice being causing a kernel panic and being inefficient on the wire. >> If you have to choose one, drop the data on the floor and let it be >> retransmitted. > > Well, sure. But the "right" way to be inefficient here is to advertise > a smaller window, so that you don't get data that has to be > retransmitted. Nonetheless, this is a decision that was made ages ago, > and now that I realize that, there's not much benefit in debating its > merits. And here is where the bug that was just fixed comes into play. When the window size is increased, there is a request for more packets that is signaled by setting rxi_NeedMorePackets. The assumption is that this event will trigger the allocation of more packet buffers when it is safe / efficient to do so. Unfortunately, the only function that tests to see if rxi_NeedMorePackets is non-zero, rx_CheckPackets() was never called except during rx_InitHost(). This wasn't a big problem for userland implementations because they would just allocate packets as the free packet queue became empty. However, in the kernel implementations this is not possible. The #ifdef KERNEL code in rxi_ReceiveDataPacket() is a last ditch effort to keep the overall system running when the number of free packets gets too low. What the code does is blow away the current call, reclaim the data buffers, and sets the 'rxi_NeedMorePackets' flag to TRUE. In the hopes that once this call terminates that rx_CheckPackets() would notice that more packets need to be allocated. Since rx_CheckPackets() was never called, more packets were never allocated and without performing the rx_TrimDataBuffers dance the system would eventually panic. >> The goal is to ensure that we never get into this case which is why if >> the rxi_NeedMorePackets global variable is TRUE we must actually go and >> allocate more packets the next time it is safe to do so. >> >> The patch that was committed today does that for the first time in the >> history of Rx. > > Now I'm really going to have to go back and reread things, because I > examined this fairly closely a couple of months ago when I was working > out fileserver tuning, and one of the conclusions I came to at the time > was that at least in user mode, Rx would always allocate more packets > when needed, so setting the fileserver's -rxpck parameter should never > be necessary. The rxpck parameter should not be required anywhere. It was added at a time when no one had time to profile and performance test the implementation. When Derrick and I removed the rx_TimeDataBuffers() dance we tested in userland. It was only after Hartmut noticed this problem with his kernel testing that I went back and noticed that packet counts never increase in the kernel after the initial allocation. This problem should now be created and in the future, increases in the window size should behave appropriately. As an aside, I find that when I'm working on code that was written years ago and has been in deployment for decades that I tend to make the assumption that the code must be working as designed. It never occurred to me to check whether or not rx_CheckPackets() was doing the job it was written for. rx_CheckPackets() and the rxi_NeedMorePackets variable were added in AFS 3.6. It is unfortunate that rx_CheckPackets() was not actually called back then. If someone has the RCS data from the commit that added it, I would be very interested in seeing what the log entry says. Jeffrey Altman
smime.p7s
Description: S/MIME Cryptographic Signature
