"Kenneth D. Merry" wrote: > If the receive ring for that packet size is full, it will hold off on > DMAs. If all receive rings are full, there's no reason to send more > interrupts.
I think that this does nothing, in the FreeBSD case, since the data from the card will generally be drained much faster than it accrues, into the input queue. Whether it gets processed out of there before you run out of mbufs is another matter. [ ... ] > Anyway, if all three rings fill up, then yes, there won't be a reason to > send receive interrupts. I think this can't really happen, since interrupt processing has the highest priority, compared to stack processing or application level processing. 8-(. > > OK, assuming you meant that the copies would stall, and the > > data not be copied (which is technically the right thing to > > do, assuming a source quench style livelock avoidance, which > > doesn't currently exist)... > > The data isn't copied, it's DMAed from the card to host memory. The card > will save incoming packets to a point, but once it runs out of memory to > store them it starts dropping packets altogether. I think that the DMA will not be stalled, at least as the driver currently exists; you and I agreed on that already (see below). My concern in this case is that, if the card is using the bus to copy packets from card memory to the receive ring, then the bus isn't available for other work, which is bad. It's better to drop the packets before putting them in card memory (FIFO drop fails to avoid the case where a continuous attack pushes all good packets out). > > The problem is still that you end up doing interrupt processing > > until you run out of mbufs, and then you have the problem of > > not being able to transmit responses, for lack of mbufs. > > In theory you would have configured your system with enough mbufs > to handle the situation, and the slowness of the system would cause > the windows on the sender to fill up, so they'll stop sending data > until the receiver starts responding again. That's the whole purpose > of backoff and slow start -- to find a happy medium for the > transmitter and receiver so that data flows at a constant rate. In practice, mbuf memory is just as overcommitted as all other memory, and given a connection count target, you are talking a full transmit and full receive window worth of data at 16k a pop -- 32k per connection. Even a modest maximum connection count of ~30,000 connections -- something even an unpatches 4.3 FreeBSD could handle -- means that you need 1G of RAM for the connections alone, if you disallow overcommit. In practice, that would mean ~20,000 connections, when you count page table entries, open file table entries, vnodes, inpcb's, tcpcb's, etc.. And that's a generaous estimate, which assumes that you tweak your kernel properly. One approach to this is to control the window sizes based on th amount of free reserve you have available, but this will actually damage overall throughput, particularly on links with a higher latency. > > In the ti driver case, the inability to get another mbuf to > > replace the one that will be taken out of the ring means that > > the mbuf gets reused for more data -- NOT that the data flow > > in the form of DMA from the card ends up being halted until > > mbufs become available. > > True. This is actually very bad: you want to drop packets before you insert them into the queue, rather than after they are in the queue. This is because you want the probability of the drop (assuming the queue is not maxed out: otherwise, the probabilty should be 100%) to be proportional to the exponential moving average of the queue depth, after that depth exceeds a drop threshold. In other words, you want to use RED. > > Please look at what happens in the case of an allocation > > failure, for any driver that does not allow shrinking the > > ring of receive mbufs (the ti is one example). > > It doesn't spam things, which is what you were suggesting before, but > as you pointed out, it will effectively drop packets if it can't get new > mbufs. Maybe I'm being harsh in calling it "spam'ming". It does the wrong thing, by dropping the oldest unprocessed packets first. A FIFO drop is absolutely the wrong thing to do in an attack or overload case, when you want to shed load. I consider the packet that is being dropped to have been "spam'med" by the card replacing it with another packet, rather than dropping the replacement packet instead. The real place for this drop is "before it gets to card memory", not "after it is in host memory"; Floyd, Jacobsen, Mogul, etc., all agree on that. > Yes, it could shrink the pool, by just not replacing those mbufs in the > ring (and therefore not notifying the card that that slot is available > again), but then it would likely need some mechanism to allow it to be > notified that another buffer is available for it, so it can then allocate > receive buffers again. > > In practice I haven't found the number of mbufs in the system to be a > problem that would really make this a big issue. I generally configure > the number of mbufs to be high enough that it isn't a problem in the > first place. I have a nice test that I would be happy to run for you in the lab. It loads a server up with 100,000 simultaneous long duration downloads, replacing each client with a new one when that client's download is complete. In the case that you run out of mbufs, FreeBSD just locks up solid, unless you make modifications to the way that packets are processed (or maintain a "transmit mbufs free reserve" to endure against the deadly embrace deadlock). If you have another approach to resolving the deadlock, I'm all ears... I'll repeat it for you in the lab, if you are willing to stare at it with me... we would probably be willing to put you on full time, if you were able to do something about the problem, other than what I've done. 8-). > > The driver does it on purpose, by not giving away the mbuf > > in the receive ring, until it has an mbuf to replace it. > > The driver does effectively drop packets, but it doesn't spam > over another packet that has already gone up the stack. It wastes a DMA, by DMA'ing over the already DMA'ed packet, so we eat unnecessary bus bandwidth lossage as a result (see above, as to why I called it "spam"). > > Maybe this should be rewritten to not mark the packet as > > received, and thus allow the card to overwrite it. > > That wouldn't really work, since the card knows it has DMAed into that > slot, and won't DMA another packet in its place. The current approach is > the equivalent, really. The driver tells the card the packet is received, > and if it can't allocate another mbuf to replace that mbuf, it just puts > that mbuf back in place. So the card will end up overwriting that packet. I'd actually prefer to avoid the other DMA; I'd also like to avoid the packet receipt order change that results from DMA'ing over the previous contents, in the case that an mbuf can't be allocated. I'd rather just let good packets in with a low (but non-zero) statistical probability, relative to a slew of bad packets, rather than letting a lot of bad packets from a persistant attacker push my good data out with the bad. [ ... ] > The main thing I would see is that when an interrupt handler takes > a long time to complete, it's going to hold off other devices > sharing that interrupt. (Or interrupt mask, perhaps.) If you are sharing interrupts at this network load, then you are doing the wrong thing in your system setup. If you don't have such a high load, then it's not a problem; either way, you can avoid the problem, for the most part. > This may have changed in -current with interrupt threads, though. It hasn't, as far as I can tell; in fact, the move to a seperate kernel thread in order to process the NETISR makes things worse, from what I can see. The thing you have to do to reintroduce fairness is to make a decision to _not_ reenable the interrupts. The most simplistic way to do this is to maintain queue depth counts for amount of data on its way to user space, and then make a conscious decision at the high watermark to _not_ reenable interrupts. The draining of the queue then looks for the low watermark, and when you hit it, reenables the interrupts. This is a really crude form of what's called "Weighted Fair Share Queueing". There's actually a good paper on it off the main page of the QLinux project (second hit on a Google search for "QLinux"). > > Is this a request for me to resend the diffs? > > Yes. OK, I will rediff and generate context diffs; expect them to be sent in 24 hours or so from now. > > Sure. So you set the DF bit, and then start with honking big > > packets, sending progressively smaller ones, until you get > > a response. > > Generally the ICMP response tells you how big the maximum MTU is, so you > don't have to guess. Maybe it's the ICMP response; I still haven't had a chance to hold Michael down and drag the information out of him. 8-). [ ... ] > The two Product X boxes use TCP connections between each other, and happily > negotiate a MSS of 8960 or so. They start sending data packets, but > nothing gets through. > > How would product X detect this situation? Most switches I've seen don't > send back ICMP packets to tell the sender to change his route MTU. They > just drop the packets. In that situation, though, you can't tell the > difference between the other end being down, the cable getting pulled, > switch getting powered off or the MTU on the switch being too small. Cicso boxes detect "black hole" routes; I'd have to read the white paper, rather than just its abstract, to tell you how, though... > It's a lot easier to just have the user configure the MTU. Not for the user. Maybe making gigabit cards not need twisty cables to wire them together has just set expectations too high... 8-). > So, what if you did try to back off and retransmit at progressively smaller > sizes? That won't work in all cases. If you're the receiver, and the > sender isn't one of your boxes, you have no way of knowing whether the > sender is down or what, and you have no way of guessing that his problem is > that the switch doesn't support the large MSS you've negotiated. There's > also no way for you to back off, since you're not the one transmitting the > data, and your acks get through just fine. It's ugly, but possible. You can alway detect this by dicking with the hop count, etc.. Actually, given gigabit speeds, you should be able to sink an incredible amount of processing into it, and still get done relatively quickly after the carrier goes on. In any case, Intel cards appear to do it, and so do Tigon III's. > > Ugh. I was unaware that FreeBSD's stack would not honor a 9k MTU. > > It will receive packets that size, but won't send them because it's less > efficient that way. Why bother, anyway? Windows boxes will send packets > that are 9K-header length, but that's generally less efficient from a > buffer management standpoint. Depends on the buffer management. Yeah, it'll waste 3/4 of a page, but the FreeBSD implementation isn't too concerned about page sizing mbufs anyway, or we wouldn't need a cluster for describing them. > On i386 (4K pages), three pages plus a header mbuf are allocated and put > into an extended jumbo receive BD. > > I'd really be more worried about performance than memory waste nowadays, > though. If you're wasting a little bit of space in an mbuf on a 4GB > machine, who cares? I'd suggest going with larger mbufs if anything, that > way you can minimize list traversal. mbuf size, at least, is tweakable, so > individual users can tune it for their typical packet size. Yep. My thoughts exactly. Plus, the faster you push out the data, the faster you get the memory back, and the lower your pool retention time for any given allocation set. In practice, the packet sizes could also be calculated based on the amount of data you intend to send; so a 12k send might take 8k + 4k instead of 9k + 3k. Working that would depend on whether you ended up being CPU bound or I/O bound. Dropping interrupt overhead would definitely help tip that balance, though... ;-). -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message