On 02/25/15 13:02, Bruce Richardson wrote: > On Wed, Feb 25, 2015 at 11:40:36AM +0200, Vlad Zolotarov wrote: >> Hi, I have a question about the "scattered Rx" feature: why enabling it >> disabled "bulk allocation" feature? > The "bulk-allocation" feature is one where a more optimized RX code path is > used. For the sake of performance, when doing that code path, certain > assumptions > were made, one of which was that packets would fit inside a single mbuf. Not > having this assumption makes the receiving of packets much more complicated > and > therefore slower. [For similar reasons, the optimized TX routines e.g. vector > TX, are only used if it is guaranteed that no hardware offload features are > going to be used]. > > Now, it is possible, though challenging, to write optimized code for these > more > complicated cases, such as scattered RX, or TX with offloads or scattered > packets. > In general, we will always want separate routines for the simple case and the > complicated cases, as the performance hit of checking for the offloads, or > multi-mbuf packets will be significant enough to hit our performance badly > when > they are not needed. In the case of the vector PMD for ixgbe - our highest > performance path right now - we have indeed two receive routines, for simple > and scattered cases. For TX, we only have an optimized path for the simple > case, > but that is not to say that at some point someone may provide one for the > offload case too. > > A final note on scattered packets in particular: if packets are too big to fit > in a single mbuf, then they are not small packets, and the processing time per > packet available is, by definition, larger than for packets that fit in a > single mbuf. For 64-byte packets, the packet arrival rate is 67ns @ 10G, or > approx 200 cycles at 3GHz. If we assume a standard 2k mbuf, then a packet > which > spans two mbufs takes at least 1654ns, and therefore a 3GHz CPU has nearly > 5000 > cycles to process that same packet. Therefore, since the processing budget is > so much bigger the need to optimize is much less. Therefore it's more > important > to focus on the small packet case, which is what we have done.
Sure. I'm doing my best not to harm the existing code paths: the RSC handler is a separate function (i first patched the scalar scattered function but now I'm rewriting it as a stand alone routine), I don't change the igb_rx_entry (leave it to be a pointer) and keep the additional info in separate descriptors in a separate ring that is not accessed in a non-RSC flow. > >> There is some unclear comment in the ixgbe_recv_scattered_pkts(): >> >> /* >> * Descriptor done. >> * >> * Allocate a new mbuf to replenish the RX ring descriptor. >> * If the allocation fails: >> * - arrange for that RX descriptor to be the first one >> * being parsed the next time the receive function is >> * invoked [on the same queue]. >> * >> * - Stop parsing the RX ring and return immediately. >> * >> * This policy does not drop the packet received in the RX >> * descriptor for which the allocation of a new mbuf failed. >> * Thus, it allows that packet to be later retrieved if >> * mbuf have been freed in the mean time. >> * As a side effect, holding RX descriptors instead of >> * systematically giving them back to the NIC may lead to >> * RX ring exhaustion situations. >> * However, the NIC can gracefully prevent such situations >> * to happen by sending specific "back-pressure" flow control >> * frames to its peer(s). >> */ >> >> Why the same "policy" can't be done in the bulk-context allocation? - Don't >> advance the RDT until u've refilled the ring. What do I miss here? > A lot of the optimizations done in other code paths, such as bulk alloc, may > well > be applicable here, it's just that the work has not been done yet, as the > focus > is elsewhere. For vector PMD RX, we have now routines that work on both > regular > and scattered packets, and both perform much better than the scalar > equivalents. > Also to note that in every RX (and TX) routine, the NIC tail pointer update is > always done just once at the end of the function. I see. Thanks for an educated clarification. Although I've spent some time with DPDK I still feel sometimes that I don't I fully understand the original author's idea and the clarifications like your really help. I looked at the vectored receive function (_recv_raw_pkts_vec()) and it is one cryptic piece of a code! ;) Since u've brought it up - could u direct me to the measurements comparing the vectored and scalar DPDK data paths please? I wonder how working without CSUM offload for instance may be faster even for small packets like u mentioned above? One will have to calculate it in SW in that case and I'm puzzled how this may be faster than letting HW do it... > >> Another question is about the LRO feature - is there a reason why it's not >> implemented? I've implemented the LRO support in ixgbe PMD to begin with - I >> used a "scattered Rx" as a template and now I'm tuning it (things like the >> stuff above). >> >> Is there any philosophical reason why it hasn't been implemented in *any* >> PMD so far? ;) > I'm not aware of any philosophical reasons why it hasn't been done. Patches > are welcome, as always. :-) Great! So, i'll send what I have once it's ready... ;) Again, thank for a great clarification. > > /Bruce >