Re: [ovs-dev] [External] Re: [PATCH] add pmd scale configuration for phy port rxq

Kevin Traynor Thu, 28 Apr 2022 08:03:39 -0700

Hi, sorry for the slow reply, I was out on PTO.

On 08/04/2022 15:34, Wan Junjie wrote:

On Sat, Apr 2, 2022 at 12:25 AM Kevin Traynor <[email protected]> wrote:


On 29/03/2022 04:26, Wan Junjie wrote:

Hi Kevin,

Thanks for your comments.

On Fri, Mar 25, 2022 at 8:17 PM Kevin Traynor <[email protected]> wrote:


Hi Wan Junjie,

Not a full code review, but comments on the feature below.

On 02/03/2022 10:59, Wan Junjie wrote:

       A pmd would poll all rxqs with no weight. When a pmd has one rxq
       from phy port and several from vhu port, and high loads run for
       both rx and tx, then the vhu can get polled more. This will cause
       the polling for rx of phy port much less than the vhu port. The
       loads for tx/rx will lose balance. With traffic to both directions,
       rx will be limited to a very low rate as phy port get polled less.


That's an interesting observation.

       For example, originally poll list for each pmd is like below:
       pmd 0  phy0_0 vhu0_0 vhu0_4 vhu0_8 vhu0_12
       pmd 1  phy0_1 vhu0_1 vhu0_5 vhu0_9 vhu0_13
       pmd 2  phy0_2 vhu0_2 vhu0_6 vhu0_10 vhu0_14
       pmd 3  phy0_3 vhu0_3 vhu0_7 vhu0_11 vhu0_15

       With traffic to both directions, rx will be limited to 2Mpps and
       tx is 9Mpps.


Can you explain this a bit more.

Are you saying you have 2 independent paths ?
- an ingress path of phy->ovs->vm, 2 Mpps
- an separate egress path of vm->ovs->phy, 9 Mpps


Yes. There are two independent paths.
The pps numbers‘ ratio is nearly the same with the queues' ratio 1:4.


Don't forget if you poll the nic rxq more often and increase the
throughput for ingress, then there will be less time for the pmd core to
spend polling and processing packets from the vm rxqs, so you will
reduce the throughput for the egress.

The main problem is about fairness. In this case, reducing the
throughput of egress is what we want.

The throughput might be more level between the interfaces, but it should
be a lot lower than the 9 Mpps.

The numbers are real. This is a VM2VM test. While receiving, start a
stream pressure on egress. And they illustrate that the PPS limit is
in accordance with the ratio of phy/vhost.

If this was a PVP test, then egress would be limited by the ingress, so
it should balance out there too.

       This patch provide an option to reinforce the phy port polling.
       Add a configuration for rxq schedule, which will try to balance
       the poll for phy and vhu port. It will increase the poll times
       for a phy port, and interlace the phy rxq and vhu rxq in the poll
       list.

       scale the rxq poll list:
       pmd 0  phy0_0 vhu0_0 phy0_0 vhu0_4 phy0_0 vhu0_8 phy0_0 vhu0_12
       pmd 1  phy0_1 vhu0_1 phy0_1 vhu0_5 phy0_1 vhu0_9 phy0_1 vhu0_13
       pmd 2  phy0_2 vhu0_2 phy0_2 vhu0_6 phy0_2 vhu0_10 phy0_2 vhu0_14
       pmd 3  phy0_3 vhu0_3 phy0_3 vhu0_7 phy0_3 vhu0_11 phy0_3 vhu0_15


This looks a custom scheme that might be suited to only a very limited
config, but it is not clear what that config needs to be, or how it
should work with a different config. The scheme should be more generic
and clear what the behavior will be with any other config
(ports/rxqs/core/pinning etc), or forbidden with other configs.

I did some testing and changed a few config items and it stopped polling
several rxqs which results in no traffic being passed. My testing notes
are below [0].

You mentioned rxq weights was missing in the commit message above and it
sounded like allowing the user to set port weights would be a more
generic version of what you are proposing. However, weights would only
be relevant to the rxqs on a particular pmd, and not between rxqs on
different pmds, so i'm not sure it would work.

thanks,
Kevin.


Thanks for your detailed tests. Some tests did fail. In this patch, I did not
take more queues from phy port into account. And more than one phy port
seems not working yet. If this patch's solution is good to continue,
I'll fix it in v2.


I think it needs a fuller definition of what this feature is, the
benefit and how it scales to different configs before it is worth
spending the time to write and test a v2. At the moment it's not clear
to me.

The main idea is about fairness of the ports instead of queues.
Especially when the ports play different roles.
This can benefit at least 3 aspects.
First, when we have one phy port and some VMs on a host. Usually vm
would have multiple queues. In an extreme case, there would be
hundreds of VM queues on one PMD with one queue of phy port. Then the
phy port would get polled less and less when we create more and more
VMs. This would cause the port ingress throughput decay while the
number of vm grows. But if polling of phy port is scalable, the
ingress of the phy port will not decay.
Second, consider the traffic for all the vms. If one VM has more
queues than the others on one pmd, it will consume more cpu cycles. So
a VM with 2 queues on this pmd will perform better than a VM with 1
queue on this pmd. But if one of them transmits at a high speed, the
receiving of the other will get affected.
Third, for now, we could enhance phy port polling simply by setting
the n_rxq to a bigger number. This will increase the polling of phy
port. Setting in this patch will keep the original queue number of phy
port on each pmd while increasing the polling will help improve the
performance for an elephant flow.

I see your point. But a key thing is that you are in overload/underresourced condition when those things may happen. While all rxq/portscan have all their packets processed in time with no drops, it can beconsidered fair for every port.

So the case you are talking about is only in an overload condition, inwhich some packets from some interface are going to be dropped in everycase. This indicates that at least some PMD core resource is notsufficient and dropped packets is a much bigger issue than fairness IMO.So that is the main thing to correct for the setup, either through morePMD cores or ALB.

I agree in the overloaded state some ports may be more impacted thanothers. I'm not sure how that can be avoided when the PMD resources aresplit independently between rxqs and there is no restriction to whichrxqs are polled by a PMD (other than numa).

If you really wanted to do what you are suggesting, I think you wouldhave to centralise it and do some form of rxq to PMD core scheduling/poll order adjustments algorithm based on catering for this case.

But that would conflict with the current schemes which tries balance theload. So you could end up hitting the overload condition (dropping pkts)sooner than you would with the existing schemes.

Another point is that just because there is an rxq, it does not mean itwill have traffic. That will depend on NIC RSS or the VM. A pollingorder type scheme would not account for that.

You could also consider some type of Rate Limiting [0] on the ports forfairness during overload but that would be hard to tune and correlatewith PMD resource usage.


thanks,
Kevin.

[0] https://docs.openvswitch.org/en/latest/topics/dpdk/qos/

And yes, I agree that there should be a more generic scheme.
Weights would only be effective on queues from the same pmd, I think this is
the right target for the case of one phy port.  No matter how many vhu ports
or phy queues we have on one pmd, the weights setting should work as a
limitation when there are flows from both directions. If there is only
traffic from
one direction, the setting won't be a bottleneck as well.


You are assuming that the phy rxq will be distributed evenly over pmd
cores. The only limitation OVS has in assigning those phy rxqs is by
NUMA (currently) and isolated pmds.

So as you say the weight/prio would only be relative to pmd, but all
high prio rxqs could be on the same pmd core. That would mean from a
user perspective they have set weight/prio on all interfaces but it is
only being respected by OVS on a random groups or rxqs.

The weight coefficient is respected per pmd. If a pmd has one or more
queues for a phy port, the weight will be calculated according to the
vhu queues the pmd has. From the user perspective, the isolated or
group policy will be respected with different coefficients.

You might be able to implement something like what you are describing in
a limited case if you took action like manually pinning one high
weight/prio rxq per pmd core with non-isolate=true and allowing OVS to
assign other low weight/prio rxqs. High weight/prio rxqs would be on
different cores then and could be polled more often than lower prio rxqs.

Other options would be to add a new assignment method based on weight
and incorprate a scale option for high prio rxqs, but that would be at
the expenses of basing it on processing cycles required so could lead to
bottlnecks on some pmd cores.

These are just some thoughts about how you can achieve something close
to what you have implemented but you still have issues with scale in
either of those cases. However, above all I think it needs a clearer
definition of the benefit etc. before considering how it could be done.

Thanks for your comments. I will reconsider it.

If we have more than one phy port, we can not predict the distribution of the
loads. So I don't think this is the right solution for this case.
Maybe we should
forbid the setting in this case or change the setting as per phy port instead.
What's your opinion on this case?


I think it is a very common to have multiple DPDK phy nic ports.

Then we probably should take all phy ports into the scalable list.

HTH,
Kevin.

Regards,
Wan Junjie

       to enable it, run:
       'ovs-vsctl set open . other_config:pmd-rxq-schedule=scaling'
       to disable it, remove the setting or set it to 'single'
       And it works fairly well when n_rxq of dpdk phy port equals to
       number of dpdk pmds, which means one n_rxq for one pmd

Signed-off-by: Wan Junjie <[email protected]>
Reviewed-by: He Peng <[email protected]>
---
    lib/dpif-netdev.c | 133 ++++++++++++++++++++++++++++++++++++++++++----
    1 file changed, 122 insertions(+), 11 deletions(-)


[0]
- enable 'scale' amd add myport dpdk phy nic and vhost port

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  1 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  2 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  3 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %


dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 1 rxq dpdkvhost0 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 2 rxq myport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 3 rxq dpdkvhost0 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 4 rxq myport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 5 rxq dpdkvhost0 3
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 6 rxq myport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 7 rxq dpdkvhost0 0


- Add another phy nic

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  1 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  2 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  3 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  0 (rescaled)  pmd usage:  0 %

dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 0 rxq urport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 1 rxq dpdkvhost0 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 2 rxq myport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 3 rxq dpdkvhost0 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 4 rxq urport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 5 rxq dpdkvhost0 3
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 6 rxq myport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 7 rxq dpdkvhost0 0


- change number of rxqs for myport, n_rxq=4

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  1 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  2 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  3 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  3 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  0 (rescaled)  pmd usage:  0 %


dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 0 rxq myport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 1 rxq dpdkvhost0 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 2 rxq urport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 3 rxq dpdkvhost0 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 4 rxq myport 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 5 rxq dpdkvhost0 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 6 rxq myport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 7 rxq dpdkvhost0 3

*queue 3 for myport is not polled


- change number of rxq for urport, n_rxq=4

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  1 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  2 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  3 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  3 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  3 (rescaled)  pmd usage:  0 %


dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 0 rxq urport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 1 rxq dpdkvhost0 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 2 rxq myport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 3 rxq dpdkvhost0 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 4 rxq urport 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 5 rxq dpdkvhost0 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 6 rxq urport 3
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 7 rxq dpdkvhost0 3

*myport queues 0,2,3 not polled, urport queues 0 not polled


- shutdown vm

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  3 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  3 (rescaled)  pmd usage:  0 %


dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 0 rxq urport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 1 rxq myport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 2 rxq urport 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 3 rxq urport 3
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 4 rxq urport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 5 rxq myport 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 6 rxq myport 0
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 7 rxq myport 3

*dpdkvhost0 no longer polled

- change vm to 1 rxq and start vm

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  1 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  2 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  3 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  3 (rescaled)  pmd usage:  0 %


no debug as same vm rxq as when vm shutdown.

*rxq for vhost not polled


- change vm to 3 rxq

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  1 (enabled)   pmd usage:  0 %
     port: dpdkvhost0        queue-id:  2 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  3 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: urport            queue-id:  3 (rescaled)  pmd usage:  0 %


dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 0 rxq urport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 1 rxq dpdkvhost0 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 2 rxq myport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 3 rxq dpdkvhost0 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 4 rxq urport 2
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 5 rxq dpdkvhost0 0


*myport queues 0,2,3 not polled, urport queues 0,3 not polled

- Remove urport

pmd thread numa_id 0 core_id 8:
     isolated : false
     port: dpdkvhost0        queue-id:  0 (enabled)   pmd usage:  0 %
     port: myport            queue-id:  0 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  1 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  2 (rescaled)  pmd usage:  0 %
     port: myport            queue-id:  3 (rescaled)  pmd usage:  0 %


dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 0 rxq myport 1
dpif_netdev(pmd-c08/id:8)|DBG|PMD 8: 1 rxq dpdkvhost0 0

*myport queues 0,2,3 not being polled

Maybe just debug issue? chck with gdb

(gdb) p poll_cnt
$31 = 2
(gdb) p poll_list[0]->rxq->rx->netdev->name
$32 = 0x3bfade0 "myport"
(gdb) p poll_list[0]->rxq->rx->queue_id
$33 = 1
(gdb) p poll_list[1]->rxq->rx->netdev->name
$34 = 0x3bd6ce0 "dpdkvhost0"
(gdb) p poll_list[1]->rxq->rx->queue_id
$35 = 0


_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [External] Re: [PATCH] add pmd scale configuration for phy port rxq

Reply via email to