On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend <john.fastab...@gmail.com> wrote: > On 11/13/2017 06:18 PM, Michael Ma wrote: >> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>: >>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> >>> wrote: >>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger >>>>>> <step...@networkplumber.org>: >>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800 >>>>>>> Michael Ma <make0...@gmail.com> wrote: >>>>>>> >>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any >>>>>>>> early feedback. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Michael >>>>>>>> >>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from >>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which >>>>>>>>> avoids out-of-order problem. >>>>>>>>> >>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple >>>>>>>>> per-cpu queues for a single TC class and do busy polling from a >>>>>>>>> per-class thread to drain these queues. If we can do this frequently >>>>>>>>> enough the out-of-order situation in this polling thread should not be >>>>>>>>> that bad. >>>>>>>>> >>>>>>>>> To give more details - in the send path we introduce per-cpu per-class >>>>>>>>> queues so that packets from the same class and same core will be >>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues >>>>>>>>> belonging to its class from all the cpus and aggregate them into >>>>>>>>> another per-class queue. This can effectively reduce contention but >>>>>>>>> inevitably introduces potential out-of-order issue. >>>>>>>>> >>>>>>>>> Any concern/suggestion for working towards this direction? >>>>>>> >>>>>>> In general, there is no meta design discussions in Linux development >>>>>>> Several developers have tried to do lockless >>>>>>> qdisc and similar things in the past. >>>>>>> >>>>>>> The devil is in the details, show us the code. >>>>>> >>>>>> Thanks for the response, Stephen. The code is fairly straightforward, >>>>>> we have a per-cpu per-class queue defined as this: >>>>>> >>>>>> struct bandwidth_group >>>>>> { >>>>>> struct skb_list queues[MAX_CPU_COUNT]; >>>>>> struct skb_list drain; >>>>>> } >>>>>> >>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the >>>>>> same class. In the enqueue function, we determine the cpu where the >>>>>> packet is processed and enqueue it to the corresponding per-cpu queue: >>>>>> >>>>>> int cpu; >>>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid]; >>>>>> >>>>>> cpu = get_cpu(); >>>>>> skb_list_append(&bwg->queues[cpu], skb); >>>>>> >>>>>> Here we don't check the flow of the packet so if there is task >>>>>> migration or multiple threads sending packets through the same flow we >>>>>> theoretically can have packets enqueued to different queues and >>>>>> aggregated to the "drain" queue out of order. >>>>>> >>>>>> Also AFAIK there is no lockless htb-like qdisc implementation >>>>>> currently, however if there is already similar effort ongoing please >>>>>> let me know. >>>>> >>>>> The question I would have is how would this differ from using XPS w/ >>>>> mqprio? Would this be a classful qdisc like HTB or a classless one >>>>> like mqprio? >>>>> >>>>> From what I can tell XPS would be able to get you your per-cpu >>>>> functionality, the benefit of it though would be that it would avoid >>>>> out-of-order issues for sockets originating on the local system. The >>>>> only thing I see as an issue right now is that the rate limiting with >>>>> mqprio is assumed to be handled via hardware due to mechanisms such as >>>>> DCB. >>>> >>>> I think one of the key point was in : " do busy polling from a per-class >>>> thread to drain these queues." >>>> >>>> I mentioned this idea in TX path of : >>>> >>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf >>> >>> I think this is a bit different from that idea in that the busy >>> polling is transferring packets from a per-cpu qdisc to a per traffic >>> class queueing discipline. Basically it would be a busy_poll version >>> of qdisc_run that would be transferring packets from one qdisc onto >>> another instead of attempting to transmit them directly. >> >> The idea is to have the whole part implemented as one classful qdisc >> and remove the qdisc root lock since all the synchronization will be >> handled internally (let's put aside that other filters/actions/qdiscs >> might still require a root lock for now). >>> >>> What I think is tripping me up is that I don't think this is even >>> meant to work with a multiqueue device. The description seems more >>> like a mqprio implementation feeding into a prio qdisc which is then >>> used for dequeue. To me it seems like this solution would be pulling >>> complexity off of the enqueue side and just adding it to the dequeue >>> side. The use of the "busy poll" is just to try to simplify things as >>> the agent would then be consolidating traffic from multiple per-cpu >>> queues onto one drain queue. >> >> We're essentially trying to spread the complexity from enqueue to >> different stages such as enqueue/aggregation and rate >> limiting/dequeue. Each stage will have different parallelisms. It >> should work with multi-queue device since txq selection can be the >> same as today. However our concern is that between enqueue and >> aggregation we have a small window which can allow packet oob, which >> is a sacrifice to better concurrency. >> > > So OOO will happen when application cpu migrates presumably? This is > normally prevented with skb ooo flag but it looks like you plan to > violate this somehow. I think a design using ptr_rings/skb_arrays > with bulk dequeue and a good concurrent token bucket ring would > suffice and also not introduce OOO packets. > > But don't completely understand your design so might be missing > something. > I'm missing here something also. The check for ooo_okay is straightforward to see it we can change the sk TX queue. Does the design not use TX queue any more?
Tom >>> >>> Structure wise this ends up looking not too different from mqprio, the >>> main difference though would be that this would be a classful qdisc >>> and that the virtual qdiscs we have for the traffic classes would need >>> to be replaced with actual qdiscs for handling the "drain" aspect of >>> this. >> >> Structure wise it's similar to mqprio + rate limiting qdisc without >> root lock, and replacing txq/flow level parallelism by cpu level >> parallelism. I'm actually not sure about the similarity with busy >> polling that Eric mentioned since I haven't read the slides yet. > > I pushed lockless qdisc patches again today and will repost when > net-next opens. These plus a lockless version of tbf might be > what you need. At one point I had a lockless tbf I can probably > dig that up as well if its useful. > > https://www.mail-archive.com/netdev@vger.kernel.org/msg200244.html > > Thanks, > John