On Fri, Apr 12, 2019 at 2:12 PM Ilya Maximets <[email protected]>
wrote:
> On 11.04.2019 17:00, David Marchand wrote:
> > We currently poll all available queues based on the max queue count
> > exchanged with the vhost peer and rely on the vhost library in DPDK to
> > check the vring status beneath.
> > This can lead to some overhead when we have a lot of unused queues.
> >
> > To enhance the situation, we can skip the disabled queues.
> > On rxq notifications, we make use of the netdev's change_seq number so
> > that the pmd thread main loop can cache the queue state periodically.
> >
> > $ ovs-appctl dpif-netdev/pmd-rxq-show
> > pmd thread numa_id 0 core_id 1:
> > isolated : true
> > port: dpdk0 queue-id: 0 pmd usage: 0 % polling: enabled
> > pmd thread numa_id 0 core_id 2:
> > isolated : true
> > port: vhost1 queue-id: 0 pmd usage: 0 % polling:
> disabled
> > port: vhost3 queue-id: 0 pmd usage: 0 % polling:
> disabled
> > pmd thread numa_id 0 core_id 15:
> > isolated : true
> > port: dpdk1 queue-id: 0 pmd usage: 0 % polling: enabled
> > pmd thread numa_id 0 core_id 16:
> > isolated : true
> > port: vhost0 queue-id: 0 pmd usage: 0 % polling:
> disabled
> > port: vhost2 queue-id: 0 pmd usage: 0 % polling:
> disabled
> >
> > $ while true; do
> > ovs-appctl dpif-netdev/pmd-rxq-show |awk '
> > /polling: / {
> > tot++;
> > if ($NF == "enabled") {
> > en++;
> > }
> > }
> > END {
> > print "total: " tot ", enabled: " en
> > }'
> > sleep 1
> > done
> >
> > total: 6, enabled: 2
> > total: 6, enabled: 2
> > ...
> >
> > # Started vm, virtio devices are bound to kernel driver which enables
> > # F_MQ + all queue pairs
> > total: 6, enabled: 2
> > total: 66, enabled: 66
> > ...
> >
> > # Unbound vhost0 and vhost1 from the kernel driver
> > total: 66, enabled: 66
> > total: 66, enabled: 34
> > ...
> >
> > # Configured kernel bound devices to use only 1 queue pair
> > total: 66, enabled: 34
> > total: 66, enabled: 19
> > total: 66, enabled: 4
> > ...
> >
> > # While rebooting the vm
> > total: 66, enabled: 4
> > total: 66, enabled: 2
> > ...
> > total: 66, enabled: 66
> > ...
> >
> > # After shutting down the vm
> > total: 66, enabled: 66
> > total: 66, enabled: 2
> >
> > Signed-off-by: David Marchand <[email protected]>
> > ---
>
> Hi.
> One important issue:
> You need to enable queue #0 in case there was no state changes.
> We're doing this for Tx queues in 'dpdk_vhost_reconfigure_helper'.
> This is because if F_PROTOCOL_FEATURES not negotiated, vrings are
> created in enabled state. Required to support older QEMU or legacy
> drivers.
>
Ah ok, that explains this txq0 special treatment.
Thanks.
> I'm working on DPDK patch to trigger 'vring_state_changed' in this
> case, however it could take long or never be accepted to LTS.
> I'd also like not to check for negotiated features in OVS code.
> I think this kind of stuff should be hidden by vhost library from
> the application.
>
Ok, we will have to keep this workaround anyway.
>
> More comments inline.
>
> Also, you might want to add Ian to CC list while sending userspace
> datapath related patches.
>
I dropped the previous Cc: list by mistake.
Sure will keep him Cc.
> > lib/dpif-netdev.c | 27 ++++++++++++++++++++
> > lib/netdev-dpdk.c | 71
> +++++++++++++++++++++++++++++++++++++++++----------
> > lib/netdev-provider.h | 5 ++++
> > lib/netdev.c | 10 ++++++++
> > lib/netdev.h | 1 +
> > 5 files changed, 101 insertions(+), 13 deletions(-)
> >
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 4d6d0c3..c95612f 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -591,6 +591,8 @@ struct polled_queue {
> > struct dp_netdev_rxq *rxq;
> > odp_port_t port_no;
> > bool emc_enabled;
> > + bool enabled;
> > + uint64_t change_seq;
> > };
> >
> > /* Contained by struct dp_netdev_pmd_thread's 'poll_list' member. */
> > @@ -1171,6 +1173,9 @@ pmd_info_show_rxq(struct ds *reply, struct
> dp_netdev_pmd_thread *pmd)
> > } else {
> > ds_put_format(reply, "%s", "NOT AVAIL");
> > }
> > + ds_put_format(reply, " polling: %s",
> > + (netdev_rxq_enabled(list[i].rxq->rx))
> > + ? "enabled" : "disabled");
> > ds_put_cstr(reply, "\n");
> > }
> > ovs_mutex_unlock(&pmd->port_mutex);
> > @@ -5198,6 +5203,11 @@ dpif_netdev_run(struct dpif *dpif)
> > }
> >
> > for (i = 0; i < port->n_rxq; i++) {
> > +
> > + if (!netdev_rxq_enabled(port->rxqs[i].rx)) {
> > + continue;
> > + }
> > +
> > if (dp_netdev_process_rxq_port(non_pmd,
> > &port->rxqs[i],
> > port->port_no)) {
> > @@ -5371,6 +5381,9 @@ pmd_load_queues_and_ports(struct
> dp_netdev_pmd_thread *pmd,
> > poll_list[i].rxq = poll->rxq;
> > poll_list[i].port_no = poll->rxq->port->port_no;
> > poll_list[i].emc_enabled = poll->rxq->port->emc_enabled;
> > + poll_list[i].enabled = netdev_rxq_enabled(poll->rxq->rx);
> > + poll_list[i].change_seq =
> > + netdev_get_change_seq(poll->rxq->port->netdev);
> > i++;
> > }
> >
> > @@ -5436,6 +5449,10 @@ reload:
> >
> > for (i = 0; i < poll_cnt; i++) {
> >
> > + if (!poll_list[i].enabled) {
> > + continue;
> > + }
> > +
> > if (poll_list[i].emc_enabled) {
> > atomic_read_relaxed(&pmd->dp->emc_insert_min,
> > &pmd->ctx.emc_insert_min);
> > @@ -5472,6 +5489,16 @@ reload:
> > if (reload) {
> > break;
> > }
> > +
> > + for (i = 0; i < poll_cnt; i++) {
> > + uint64_t current_seq =
> > +
> netdev_get_change_seq(poll_list[i].rxq->port->netdev);
> > + if (poll_list[i].change_seq != current_seq) {
> > + poll_list[i].change_seq = current_seq;
> > + poll_list[i].enabled =
> > +
> netdev_rxq_enabled(poll_list[i].rxq->rx);
> > + }
> > + }
> > }
> > pmd_perf_end_iteration(s, rx_packets, tx_packets,
> > pmd_perf_metrics_enabled(pmd));
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 47153dc..9088ff4 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -317,6 +317,10 @@ struct dpdk_mp {
> > struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> > };
> >
> > +struct dpdk_rx_queue {
> > + bool enabled;
> > +};
>
> I'm thinking if we need to mimic tx_q here.
> rxq_enable is vhost specific. We could call an array like
> 'vhost_rxq_enabled'
> and drop the structure. 'free' of this array should happen in
> '*vhost_destruct',
> not in 'common_destruct'. Clearing could be done with 'memset' in this
> case.
>
> What do you think?
>
I had something similar in my first test, dropped it thinking I would align
on tx_q.
But indeed, tx_q is generic, while rx_q is really vhost specific.
We have no reason for a struct at the moment, so let's go with a simple
array.
> > +
> > /* There should be one 'struct dpdk_tx_queue' created for
> > * each cpu core. */
> > struct dpdk_tx_queue {
> > @@ -424,6 +428,8 @@ struct netdev_dpdk {
> > OVSRCU_TYPE(struct ingress_policer *) ingress_policer;
> > uint32_t policer_rate;
> > uint32_t policer_burst;
> > +
> > + struct dpdk_rx_queue *rx_q;
>
> Comment would be nice here.
>
Sure.
> > );
> >
> > PADDED_MEMBERS(CACHE_LINE_SIZE,
> > @@ -1109,6 +1115,12 @@ netdev_dpdk_alloc(void)
> > return NULL;
> > }
> >
> > +static struct dpdk_rx_queue *
> > +netdev_dpdk_alloc_rxq(unsigned int n_rxqs)
> > +{
> > + return dpdk_rte_mzalloc(n_rxqs * sizeof(struct dpdk_rx_queue));
> > +}
> > +
> > static struct dpdk_tx_queue *
> > netdev_dpdk_alloc_txq(unsigned int n_txqs)
> > {
> > @@ -1235,6 +1247,10 @@ vhost_common_construct(struct netdev *netdev)
> > int socket_id = rte_lcore_to_socket_id(rte_get_master_lcore());
> > struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> >
> > + dev->rx_q = netdev_dpdk_alloc_rxq(OVS_VHOST_MAX_QUEUE_NUM);
>
> Do we need a function for this? Allocating inplace you'll be able to
> use more OVS-style sizeof call:
>
> dev->rx_q = dpdk_rte_mzalloc(OVS_VHOST_MAX_QUEUE_NUM * sizeof
> *dev->rx_q);
>
Ok.
> > + if (!dev->rx_q) {
> > + return ENOMEM;
> > + }
> > dev->tx_q = netdev_dpdk_alloc_txq(OVS_VHOST_MAX_QUEUE_NUM);
> > if (!dev->tx_q) {
>
> Need to free 'rx_q' before return.
>
Indeed.
> > return ENOMEM;
> > @@ -1354,6 +1370,7 @@ common_destruct(struct netdev_dpdk *dev)
> > OVS_REQUIRES(dpdk_mutex)
> > OVS_EXCLUDED(dev->mutex)
> > {
> > + rte_free(dev->rx_q);
> > rte_free(dev->tx_q);
> > dpdk_mp_put(dev->dpdk_mp);
> >
> > @@ -2201,6 +2218,14 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> > }
> >
> > static int
> > +netdev_dpdk_vhost_rxq_enabled(struct netdev_rxq *rxq)
> > +{
> > + struct netdev_dpdk *dev = netdev_dpdk_cast(rxq->netdev);
> > +
> > + return dev->rx_q[rxq->queue_id].enabled;
> > +}
> > +
> > +static int
> > netdev_dpdk_rxq_recv(struct netdev_rxq *rxq, struct dp_packet_batch
> *batch,
> > int *qfill)
> > {
> > @@ -3531,6 +3556,17 @@ new_device(int vid)
> >
> > /* Clears mapping for all available queues of vhost interface. */
> > static void
> > +netdev_dpdk_rxq_map_clear(struct netdev_dpdk *dev)
>
> This isn't a 'map' actually.
>
I'll go with a simple memset as you suggested.
> > + OVS_REQUIRES(dev->mutex)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < dev->up.n_rxq; i++) {
> > + dev->rx_q[i].enabled = false;
> > + }
> > +}
> > +
> > +static void
> > netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
> > OVS_REQUIRES(dev->mutex)
> > {
> > @@ -3563,6 +3599,7 @@ destroy_device(int vid)
> > ovs_mutex_lock(&dev->mutex);
> > dev->vhost_reconfigured = false;
> > ovsrcu_index_set(&dev->vid, -1);
> > + netdev_dpdk_rxq_map_clear(dev);
> > netdev_dpdk_txq_map_clear(dev);
> >
> > netdev_change_seq_changed(&dev->up);
> > @@ -3597,24 +3634,30 @@ vring_state_changed(int vid, uint16_t queue_id,
> int enable)
> > struct netdev_dpdk *dev;
> > bool exists = false;
> > int qid = queue_id / VIRTIO_QNUM;
> > + bool is_rx = (queue_id % VIRTIO_QNUM) == VIRTIO_TXQ;
> > char ifname[IF_NAME_SZ];
> >
> > rte_vhost_get_ifname(vid, ifname, sizeof ifname);
> >
> > - if (queue_id % VIRTIO_QNUM == VIRTIO_TXQ) {
> > - return 0;
> > - }
> > -
> > ovs_mutex_lock(&dpdk_mutex);
> > LIST_FOR_EACH (dev, list_node, &dpdk_list) {
> > ovs_mutex_lock(&dev->mutex);
> > if (nullable_string_is_equal(ifname, dev->vhost_id)) {
> > - if (enable) {
> > - dev->tx_q[qid].map = qid;
> > + if (is_rx) {
> > + bool enabled = dev->rx_q[qid].enabled;
> > +
> > + dev->rx_q[qid].enabled = enable != 0;
> > + if (enabled ^ dev->rx_q[qid].enabled) {
>
> Bit ops on a boolean seems not a good practice. It's better to just use
> '!='.
>
Ok.
> > + netdev_change_seq_changed(&dev->up);
> > + }
> > } else {
> > - dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
> > + if (enable) {
> > + dev->tx_q[qid].map = qid;
> > + } else {
> > + dev->tx_q[qid].map = OVS_VHOST_QUEUE_DISABLED;
> > + }
> > + netdev_dpdk_remap_txqs(dev);
> > }
> > - netdev_dpdk_remap_txqs(dev);
> > exists = true;
> > ovs_mutex_unlock(&dev->mutex);
> > break;
> > @@ -3624,9 +3667,9 @@ vring_state_changed(int vid, uint16_t queue_id,
> int enable)
> > ovs_mutex_unlock(&dpdk_mutex);
> >
> > if (exists) {
> > - VLOG_INFO("State of queue %d ( tx_qid %d ) of vhost device '%s'
> "
> > - "changed to \'%s\'", queue_id, qid, ifname,
> > - (enable == 1) ? "enabled" : "disabled");
> > + VLOG_INFO("State of queue %d ( %s_qid %d ) of vhost device '%s'
> "
> > + "changed to \'%s\'", queue_id, is_rx == true ? "rx" :
> "tx",
> > + qid, ifname, (enable == 1) ? "enabled" : "disabled");
> > } else {
> > VLOG_INFO("vHost Device '%s' not found", ifname);
> > return -1;
> > @@ -4297,7 +4340,8 @@ static const struct netdev_class dpdk_vhost_class
> = {
> > .get_stats = netdev_dpdk_vhost_get_stats,
> > .get_status = netdev_dpdk_vhost_user_get_status,
> > .reconfigure = netdev_dpdk_vhost_reconfigure,
> > - .rxq_recv = netdev_dpdk_vhost_rxq_recv
> > + .rxq_recv = netdev_dpdk_vhost_rxq_recv,
> > + .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
> > };
> >
> > static const struct netdev_class dpdk_vhost_client_class = {
> > @@ -4311,7 +4355,8 @@ static const struct netdev_class
> dpdk_vhost_client_class = {
> > .get_stats = netdev_dpdk_vhost_get_stats,
> > .get_status = netdev_dpdk_vhost_user_get_status,
> > .reconfigure = netdev_dpdk_vhost_client_reconfigure,
> > - .rxq_recv = netdev_dpdk_vhost_rxq_recv
> > + .rxq_recv = netdev_dpdk_vhost_rxq_recv,
> > + .rxq_enabled = netdev_dpdk_vhost_rxq_enabled,
> > };
> >
> > void
> > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > index fb0c27e..39354db 100644
> > --- a/lib/netdev-provider.h
> > +++ b/lib/netdev-provider.h
> > @@ -789,6 +789,11 @@ struct netdev_class {
> > void (*rxq_destruct)(struct netdev_rxq *);
> > void (*rxq_dealloc)(struct netdev_rxq *);
> >
> > + /* A netdev can report if a queue won't get traffic and should be
> excluded
> > + * from polling (no callback implicitely means that the queue is
> enabled).
> > + */
> > + int (*rxq_enabled)(struct netdev_rxq *);
> > +
> > /* Attempts to receive a batch of packets from 'rx'. In 'batch',
> the
> > * caller supplies 'packets' as the pointer to the beginning of an
> array
> > * of NETDEV_MAX_BURST pointers to dp_packet. If successful, the
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 7d7ecf6..88db6c7 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -682,6 +682,16 @@ netdev_rxq_close(struct netdev_rxq *rx)
> > }
> > }
> >
> > +int netdev_rxq_enabled(struct netdev_rxq *rx)
> > +{
> > + bool enabled = true;
> > +
> > + if (rx->netdev->netdev_class->rxq_enabled) {
> > + enabled = rx->netdev->netdev_class->rxq_enabled(rx);
> > + }
> > + return enabled;
>
> You're mixing and converting 'int' and 'bool' all the time.
> I think, it's better to change the function prototype to return 'bool'.
>
Yes.
> > +}
> > +
> > /* Attempts to receive a batch of packets from 'rx'. 'batch' should
> point to
> > * the beginning of an array of NETDEV_MAX_BURST pointers to
> dp_packet. If
> > * successful, this function stores pointers to up to NETDEV_MAX_BURST
> > diff --git a/lib/netdev.h b/lib/netdev.h
> > index d94817f..859f5ef 100644
> > --- a/lib/netdev.h
> > +++ b/lib/netdev.h
> > @@ -183,6 +183,7 @@ enum netdev_pt_mode netdev_get_pt_mode(const struct
> netdev *);
> > /* Packet reception. */
> > int netdev_rxq_open(struct netdev *, struct netdev_rxq **, int id);
> > void netdev_rxq_close(struct netdev_rxq *);
> > +int netdev_rxq_enabled(struct netdev_rxq *);
> >
> > const char *netdev_rxq_get_name(const struct netdev_rxq *);
> > int netdev_rxq_get_queue_id(const struct netdev_rxq *);
> >
>
Thanks a lot for the review, will send v2 on monday if all goes well.
--
David Marchand
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev