Re: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

Stokes, Ian Mon, 10 Jul 2017 04:02:48 -0700

> On 10.07.2017 13:42, O Mahony, Billy wrote:
> >
> >
> >> -----Original Message-----
> >> From: Stokes, Ian
> >> Sent: Monday, July 10, 2017 10:41 AM
> >> To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy
> >> <billy.o.mah...@intel.com>; d...@openvswitch.org
> >> Cc: db...@vmare.com
> >> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds
> >> on non- local numa node.
> >>
> >>> On 08.07.2017 22:09, Stokes, Ian wrote:
> >>>>> Previously if there is no available (non-isolated) pmd on the numa
> >>>>> node for a port then the port is not polled at all. This can
> >>>>> result in a non- operational system until such time as nics are
> >>>>> physically repositioned. It is preferable to operate with a pmd on
> the 'wrong'
> >>>>> numa node albeit with lower performance. Local pmds are still
> >>>>> chosen
> >>> when available.
> >>>>>
> >>>>> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> >>>>> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com>
> >>>>> Co-authored-by: Ilya Maximets <i.maxim...@samsung.com>
> >>>>> ---
> >>>>> v9: v8 missed some comments on v7
> >>>>> v8: Some coding style issues; doc tweak
> >>>>> v7: Incorporate review comments on docs and implementation
> >>>>> v6: Change 'port' to 'queue' in a warning msg
> >>>>> v5: Fix warning msg; Update same in docs
> >>>>> v4: Fix a checkpatch error
> >>>>> v3: Fix warning messages not appearing when using multiqueue
> >>>>> v2: Add details of warning messages into docs
> >>>>>
> >>>>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> >>>>>  lib/dpif-netdev.c                    | 41
> >>>>> +++++++++++++++++++++++++++++++++---
> >>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/Documentation/intro/install/dpdk.rst
> >>>>> b/Documentation/intro/install/dpdk.rst
> >>>>> index e83f852..89775d6 100644
> >>>>> --- a/Documentation/intro/install/dpdk.rst
> >>>>> +++ b/Documentation/intro/install/dpdk.rst
> >>>>> @@ -449,7 +449,7 @@ affinitized accordingly.
> >>>>>
> >>>>>    A poll mode driver (pmd) thread handles the I/O of all DPDK
> >>> interfaces
> >>>>>    assigned to it. A pmd thread shall poll the ports for incoming
> >>>>> packets,
> >>>>> -  switch the packets and send to tx port.  pmd thread is CPU
> >>>>> bound, and needs
> >>>>> +  switch the packets and send to tx port.  A pmd thread is CPU
> >>>>> + bound, and needs
> >>>>>    to be affinitized to isolated cores for optimum performance.
> >>>>>
> >>>>>    By setting a bit in the mask, a pmd thread is created and
> >>>>> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly.
> >>>>>        $ ovs-vsctl set Open_vSwitch .
> >>>>> other_config:pmd-cpu-mask=0x4
> >>>>>
> >>>>>    .. note::
> >>>>> -    pmd thread on a NUMA node is only created if there is at least
> one
> >>>>> DPDK
> >>>>> -    interface from that NUMA node added to OVS.
> >>>>> +    A pmd thread on a NUMA node is only created if there is at
> >>>>> + least one
> >>>>> DPDK
> >>>>> +    interface from that NUMA node added to OVS.  A pmd thread is
> >>>>> + created
> >>>>> by
> >>>>> +    default on a core of a NUMA node or when a specified
> >>>>> + pmd-cpu-mask
> >>> has
> >>>>> +    indicated so.  Even though a PMD thread may exist, the thread
> >>>>> + only
> >>>>> starts
> >>>>> +    consuming CPU cycles if there is least one receive queue
> >>>>> + assigned
> >>> to
> >>>>> +    the pmd.
> >>>>> +
> >>>>> +  .. note::
> >>>>> +    On NUMA systems PCI devices are also local to a NUMA node.
> >>>>> + Unbound
> >>>>> rx
> >>>>> +    queues for a PCI device will assigned to a pmd on it's local
> >>>>> + NUMA
> >>>>
> >>>> Minor point but should read 'will be assigned'
> >
> > [[BO'M]]
> > +1
> >
> >>>>> node if a
> >>>>> +    non-isolated PMD exists on that NUMA node.  If not, the queue
> >>>>> + will
> >>> be
> >>>>> +    assigned to a non-isolated pmd on a remote NUMA node.  This
> >>>>> + will
> >>>>> result in
> >>>>> +    reduced maximum throughput on that device and possibly on
> >>>>> + other
> >>>>> devices
> >>>>> +    assigned to that pmd thread. In the case such, a queue
> >>>>> + assignment is
> >>>>> made a
> >>>>> +    warning message will be logged: "There's no available
> >>>>> + (non-isolated)
> >>>>> pmd
> >>>>
> >>>> Above should read 'In the case where such a queue assignment is
> >>>> made, a
> >>> warning message will be logged'
> >
> > [[BO'M]]
> > Suggesting a simpler:
> > 'If such a queue assignment is made a warning message ..."
> >
> >>>>> +    thread on numa node N. Queue Q on port P will be assigned to
> >>>>> + the pmd
> >>>>> on
> >>>>> +    core C (numa node N'). Expect reduced performance."
> >>>>>
> >>>>>  - QEMU vCPU thread Affinity
> >>>>>
> >>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >>>>> 4e29085..7557f32
> >>>>> 100644
> >>>>> --- a/lib/dpif-netdev.c
> >>>>> +++ b/lib/dpif-netdev.c
> >>>>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list
> >>>>> *rr, int
> >>>>> numa_id)
> >>>>>      return NULL;
> >>>>>  }
> >>>>>
> >>>>> +/* Returns next NUMA from rr list in round-robin fashion. Returns
> >>>>> +the first
> >>>>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if
> >>>>> +list is empty. */ static struct rr_numa *
> >>>>> +rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa
> >>>>> +*numa) {
> >>>>
> >>>> The comment above can be tidied up a little to better clarify the
> >>> behavior of this function.
> >>>> I ended up reading the comments for hmap_next() and hmap_first()
> >>>> before
> >>> it made sense, and even then it's a bit ambiguous, it ends up being
> >>> the code that explains the comments.
> >>>>
> >>>> You could clarify the following 2 statements:
> >>>>
> >>>> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
> >>>> assume
> >>> you mean the function parameter 'const struct rr_numa *numa' but
> >>> it's not clear on first reading.
> >>>>
> >>>> (2) " or the last node passed" - again this makes sense only when
> >>>> you
> >>> look into the behavior of the call 'hmap_next(&rr->numas, &numa-
> >>> node)'.
> >>>>
> >>>> You could say something like:
> >>>>
> >>>> "Attempt to return the next NUMA from a numa list in a round robin
> >>> fashion. Return the first NUMA node if the struct rr_numa *numa
> >>> argument passed to the function is NULL or if the numa node argument
> >>> passed to hmap_next is already the last node. Return NULL if the
> >>> numa list is empty."
> >>>
> >>> I'm not sure that references to implementation is a good way to
> >>> write comments (I mean 'passed to hmap_next' part).
> >>>
> >>> How about this:
> >>> """
> >>> /* Returns the next node in numa list following 'numa' in
> >>> round-robin fashion.
> >>>  * Returns first node if 'numa' is a null pointer or the last node
> >>> in 'rr'. */ """
> >>>
> >>> or
> >>>
> >>> """
> >>> /* The next node in numa list following 'numa' in round-robin fashion.
> >>>  * Returns:
> >>>  *     - 'NULL' if 'rr' is an empty numa list.
> >>>  *     - First node in 'rr' if 'numa' is a null pointer.
> >>>  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
> >>>  *     - Otherwise, the next node in numa list following 'numa'. */
> >>> """
> >>>
> >>> ?
> >>
> >> I'm happy with the first option you provided above, could you append
> >> returning NULL if the list is empty then I think we're good.
> >>
> >> /* Returns the next node in numa list following 'numa' in round-robin
> >> fashion.
> >>  * Returns first node if 'numa' is a null pointer or the last node in
> 'rr'.
> >>  * Returns NULL if 'rr' numa list is empty. */
> >
> > [[BO'M]]
> > Sounds good to me. Anyone object to this wording?
> 
> I don't like three 'Returns' in a row but LGTM in general.
>


Once these changes are made for the next patch version feel free to add Tested 
by and acked tags for myself.

Thanks
Ian
> >>
> >> Thanks
> >> Ian
> >>>
> >>>>
> >>>>> +    struct hmap_node *node = NULL;
> >>>>> +
> >>>>> +    if (numa) {
> >>>>> +        node = hmap_next(&rr->numas, &numa->node);
> >>>>> +    }
> >>>>> +    if (!node) {
> >>>>> +        node = hmap_first(&rr->numas);
> >>>>> +    }
> >>>>> +
> >>>>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
> >>>>> + NULL; }
> >>>>> +
> >>>>>  static void
> >>>>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
> >>>>> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp,
> >>>>> bool
> >>>>> pinned)
> >>>>> OVS_REQUIRES(dp->port_mutex)  {
> >>>>>      struct dp_netdev_port *port;
> >>>>>      struct rr_numa_list rr;
> >>>>> +    struct rr_numa *non_local_numa = NULL;
> >>>>>
> >>>>>      rr_numa_list_populate(dp, &rr);
> >>>>>
> >>>>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
> >>>>> pinned)
> >>>>> OVS_REQUIRES(dp->port_mutex)
> >>>>>                  }
> >>>>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> >>>>>                  if (!numa) {
> >>>>> -                    VLOG_WARN("There's no available (non isolated)
> pmd
> >>>>> thread "
> >>>>> +                    /* There are no pmds on the queue's local
> >>>>> + NUMA
> >>> node.
> >>>>> +                       Round-robin on the NUMA nodes that do have
> >>> pmds.
> >>>>> */
> >>>>> +                    non_local_numa = rr_numa_list_next(&rr,
> >>>>> non_local_numa);
> >>>>> +                    if (!non_local_numa) {
> >>>>> +                        VLOG_ERR("There is no available
> >>>>> + (non-isolated)
> >>>>> pmd "
> >>>>> +                                 "thread for port \'%s\' queue %d.
> >>>>> + This
> >>>>> queue "
> >>>>> +                                 "will not be polled. Is
> >>>>> + pmd-cpu-mask set
> >>>>> to "
> >>>>> +                                 "zero? Or are all PMDs isolated
> >>>>> + to other
> >>>>> "
> >>>>> +                                 "queues?", netdev_get_name(port-
> >>>>>> netdev),
> >>>>> +                                 qid);
> >>>>> +                        continue;
> >>>>> +                    }
> >>>>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> >>>>> +                    VLOG_WARN("There's no available
> >>>>> + (non-isolated) pmd
> >>>>> thread "
> >>>>>                                "on numa node %d. Queue %d on port
> >>> \'%s\'
> >>>>> will "
> >>>>> -                              "not be polled.",
> >>>>> -                              numa_id, qid, netdev_get_name(port-
> >>>>>> netdev));
> >>>>> +                              "be assigned to the pmd on core %d "
> >>>>> +                              "(numa node %d). Expect reduced
> >>>>> performance.",
> >>>>> +                              numa_id, qid, netdev_get_name(port-
> >>>>>> netdev),
> >>>>> +                              q->pmd->core_id, q->pmd->numa_id);
> >>>>>                  } else {
> >>>>> +                    /* Assign queue to the next (round-robin) PMD
> >>>>> + on it's
> >>>>> local
> >>>>> +                       NUMA node. */
> >>>>>                      q->pmd = rr_numa_get_pmd(numa);
> >>>>>                  }
> >>>>>              }
> >>>>> --
> >>>>> 2.7.4
> >>>> This tested fine for me, tested with multiple rxqs distributed and
> >>> isolated over pmds on 2 different numa nodes with varying pmd masks.
> >>> Also passed sanity checks (clang, sparse compilation etc.).
> >>>>
> >>>> You can add the tested by tag for me but I'd like to see the
> >>>> changes for
> >>> the documentation and function comments above before acking.
> >>>>
> >>>> Tested-by: Ian Stokes <ian.sto...@intel.com>
> >>>>>
> >>>>> _______________________________________________
> >>>>> dev mailing list
> >>>>> d...@openvswitch.org
> >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>>>
> >>>>
> >>>>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

Reply via email to