Re: Per-CPU Queueing for QoS

Tom Herbert Mon, 13 Nov 2017 20:54:07 -0800

On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend
<[email protected]> wrote:
> On 11/13/2017 06:18 PM, Michael Ma wrote:
>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <[email protected]>:
>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <[email protected]> 
>>> wrote:
>>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <[email protected]> wrote:
>>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
>>>>>> <[email protected]>:
>>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>>> Michael Ma <[email protected]> wrote:
>>>>>>>
>>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>>>>>>> early feedback.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>>> avoids out-of-order problem.
>>>>>>>>>
>>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>>> enough the out-of-order situation in this polling thread should not be
>>>>>>>>> that bad.
>>>>>>>>>
>>>>>>>>> To give more details - in the send path we introduce per-cpu per-class
>>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>>
>>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>>
>>>>>>> In general, there is no meta design discussions in Linux development
>>>>>>> Several developers have tried to do lockless
>>>>>>> qdisc and similar things in the past.
>>>>>>>
>>>>>>> The devil is in the details, show us the code.
>>>>>>
>>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>>> we have a per-cpu per-class queue defined as this:
>>>>>>
>>>>>> struct bandwidth_group
>>>>>> {
>>>>>>     struct skb_list queues[MAX_CPU_COUNT];
>>>>>>     struct skb_list drain;
>>>>>> }
>>>>>>
>>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>>
>>>>>> int cpu;
>>>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>>>>>
>>>>>> cpu = get_cpu();
>>>>>> skb_list_append(&bwg->queues[cpu], skb);
>>>>>>
>>>>>> Here we don't check the flow of the packet so if there is task
>>>>>> migration or multiple threads sending packets through the same flow we
>>>>>> theoretically can have packets enqueued to different queues and
>>>>>> aggregated to the "drain" queue out of order.
>>>>>>
>>>>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>>>>> currently, however if there is already similar effort ongoing please
>>>>>> let me know.
>>>>>
>>>>> The question I would have is how would this differ from using XPS w/
>>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>>> like mqprio?
>>>>>
>>>>> From what I can tell XPS would be able to get you your per-cpu
>>>>> functionality, the benefit of it though would be that it would avoid
>>>>> out-of-order issues for sockets originating on the local system. The
>>>>> only thing I see as an issue right now is that the rate limiting with
>>>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>>>> DCB.
>>>>
>>>> I think one of the key point was in : " do busy polling from a per-class
>>>> thread to drain these queues."
>>>>
>>>> I mentioned this idea in TX path of :
>>>>
>>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>>
>>> I think this is a bit different from that idea in that the busy
>>> polling is transferring packets from a per-cpu qdisc to a per traffic
>>> class queueing discipline. Basically it would be a busy_poll version
>>> of qdisc_run that would be transferring packets from one qdisc onto
>>> another instead of attempting to transmit them directly.
>>
>> The idea is to have the whole part implemented as one classful qdisc
>> and remove the qdisc root lock since all the synchronization will be
>> handled internally (let's put aside that other filters/actions/qdiscs
>> might still require a root lock for now).
>>>
>>> What I think is tripping me up is that I don't think this is even
>>> meant to work with a multiqueue device. The description seems more
>>> like a mqprio implementation feeding into a prio qdisc which is then
>>> used for dequeue. To me it seems like this solution would be pulling
>>> complexity off of the enqueue side and just adding it to the dequeue
>>> side. The use of the "busy poll" is just to try to simplify things as
>>> the agent would then be consolidating traffic from multiple per-cpu
>>> queues onto one drain queue.
>>
>> We're essentially trying to spread the complexity from enqueue to
>> different stages such as enqueue/aggregation and rate
>> limiting/dequeue. Each stage will have different parallelisms. It
>> should work with multi-queue device since txq selection can be the
>> same as today. However our concern is that between enqueue and
>> aggregation we have a small window which can allow packet oob, which
>> is a sacrifice to better concurrency.
>>
>
> So OOO will happen when application cpu migrates presumably? This is
> normally prevented with skb ooo flag but it looks like you plan to
> violate this somehow. I think a design using ptr_rings/skb_arrays
> with bulk dequeue and a good concurrent token bucket ring would
> suffice and also not introduce OOO packets.
>
> But don't completely understand your design so might be missing
> something.
>
I'm missing here something also. The check for ooo_okay is
straightforward to see it we can change the sk TX queue. Does the
design not use TX queue any more?


Tom

>>>
>>> Structure wise this ends up looking not too different from mqprio, the
>>> main difference though would be that this would be a classful qdisc
>>> and that the virtual qdiscs we have for the traffic classes would need
>>> to be replaced with actual qdiscs for handling the "drain" aspect of
>>> this.
>>
>> Structure wise it's similar to mqprio + rate limiting qdisc without
>> root lock, and replacing txq/flow level parallelism by cpu level
>> parallelism. I'm actually not sure about the similarity with busy
>> polling that Eric mentioned since I haven't read the slides yet.
>
> I pushed lockless qdisc patches again today and will repost when
> net-next opens. These plus a lockless version of tbf might be
> what you need. At one point I had a lockless tbf I can probably
> dig that up as well if its useful.
>
> https://www.mail-archive.com/[email protected]/msg200244.html
>
> Thanks,
> John

Re: Per-CPU Queueing for QoS

Reply via email to