> > -----Original Message-----
> > From: Stokes, Ian
> > Sent: Monday, July 10, 2017 10:41 AM
> > To: Ilya Maximets <i.maxim...@samsung.com>; O Mahony, Billy
> > <billy.o.mah...@intel.com>; d...@openvswitch.org
> > Cc: db...@vmare.com
> > Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on
> > non- local numa node.
> >
> > > On 08.07.2017 22:09, Stokes, Ian wrote:
> > > >> Previously if there is no available (non-isolated) pmd on the
> > > >> numa node for a port then the port is not polled at all. This can
> > > >> result in a non- operational system until such time as nics are
> > > >> physically repositioned. It is preferable to operate with a pmd on
> the 'wrong'
> > > >> numa node albeit with lower performance. Local pmds are still
> > > >> chosen
> > > when available.
> > > >>
> > > >> Signed-off-by: Billy O'Mahony <billy.o.mah...@intel.com>
> > > >> Signed-off-by: Ilya Maximets <i.maxim...@samsung.com>
> > > >> Co-authored-by: Ilya Maximets <i.maxim...@samsung.com>
> > > >> ---
> > > >> v9: v8 missed some comments on v7
> > > >> v8: Some coding style issues; doc tweak
> > > >> v7: Incorporate review comments on docs and implementation
> > > >> v6: Change 'port' to 'queue' in a warning msg
> > > >> v5: Fix warning msg; Update same in docs
> > > >> v4: Fix a checkpatch error
> > > >> v3: Fix warning messages not appearing when using multiqueue
> > > >> v2: Add details of warning messages into docs
> > > >>
> > > >> Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> > > >> lib/dpif-netdev.c | 41
> > > >> +++++++++++++++++++++++++++++++++---
> > > >> 2 files changed, 56 insertions(+), 6 deletions(-)
> > > >>
> > > >> diff --git a/Documentation/intro/install/dpdk.rst
> > > >> b/Documentation/intro/install/dpdk.rst
> > > >> index e83f852..89775d6 100644
> > > >> --- a/Documentation/intro/install/dpdk.rst
> > > >> +++ b/Documentation/intro/install/dpdk.rst
> > > >> @@ -449,7 +449,7 @@ affinitized accordingly.
> > > >>
> > > >> A poll mode driver (pmd) thread handles the I/O of all DPDK
> > > interfaces
> > > >> assigned to it. A pmd thread shall poll the ports for incoming
> > > >> packets,
> > > >> - switch the packets and send to tx port. pmd thread is CPU
> > > >> bound, and needs
> > > >> + switch the packets and send to tx port. A pmd thread is CPU
> > > >> + bound, and needs
> > > >> to be affinitized to isolated cores for optimum performance.
> > > >>
> > > >> By setting a bit in the mask, a pmd thread is created and
> > > >> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly.
> > > >> $ ovs-vsctl set Open_vSwitch .
> > > >> other_config:pmd-cpu-mask=0x4
> > > >>
> > > >> .. note::
> > > >> - pmd thread on a NUMA node is only created if there is at least
> one
> > > >> DPDK
> > > >> - interface from that NUMA node added to OVS.
> > > >> + A pmd thread on a NUMA node is only created if there is at
> > > >> + least one
> > > >> DPDK
> > > >> + interface from that NUMA node added to OVS. A pmd thread is
> > > >> + created
> > > >> by
> > > >> + default on a core of a NUMA node or when a specified
> > > >> + pmd-cpu-mask
> > > has
> > > >> + indicated so. Even though a PMD thread may exist, the
> > > >> + thread only
> > > >> starts
> > > >> + consuming CPU cycles if there is least one receive queue
> > > >> + assigned
> > > to
> > > >> + the pmd.
> > > >> +
> > > >> + .. note::
> > > >> + On NUMA systems PCI devices are also local to a NUMA node.
> > > >> + Unbound
> > > >> rx
> > > >> + queues for a PCI device will assigned to a pmd on it's local
> > > >> + NUMA
> > > >
> > > > Minor point but should read 'will be assigned'
>
> [[BO'M]]
> +1
>
> > > >> node if a
> > > >> + non-isolated PMD exists on that NUMA node. If not, the
> > > >> + queue will
> > > be
> > > >> + assigned to a non-isolated pmd on a remote NUMA node. This
> > > >> + will
> > > >> result in
> > > >> + reduced maximum throughput on that device and possibly on
> > > >> + other
> > > >> devices
> > > >> + assigned to that pmd thread. In the case such, a queue
> > > >> + assignment is
> > > >> made a
> > > >> + warning message will be logged: "There's no available
> > > >> + (non-isolated)
> > > >> pmd
> > > >
> > > > Above should read 'In the case where such a queue assignment is
> > > > made, a
> > > warning message will be logged'
>
> [[BO'M]]
> Suggesting a simpler:
> 'If such a queue assignment is made a warning message ..."
+1, LGTM.
>
> > > >> + thread on numa node N. Queue Q on port P will be assigned to
> > > >> + the pmd
> > > >> on
> > > >> + core C (numa node N'). Expect reduced performance."
> > > >>
> > > >> - QEMU vCPU thread Affinity
> > > >>
> > > >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > >> 4e29085..7557f32
> > > >> 100644
> > > >> --- a/lib/dpif-netdev.c
> > > >> +++ b/lib/dpif-netdev.c
> > > >> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list
> > > >> *rr, int
> > > >> numa_id)
> > > >> return NULL;
> > > >> }
> > > >>
> > > >> +/* Returns next NUMA from rr list in round-robin fashion.
> > > >> +Returns the first
> > > >> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if
> > > >> +list is empty. */ static struct rr_numa *
> > > >> +rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa
> > > >> +*numa) {
> > > >
> > > > The comment above can be tidied up a little to better clarify the
> > > behavior of this function.
> > > > I ended up reading the comments for hmap_next() and hmap_first()
> > > > before
> > > it made sense, and even then it's a bit ambiguous, it ends up being
> > > the code that explains the comments.
> > > >
> > > > You could clarify the following 2 statements:
> > > >
> > > > (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
> > > > assume
> > > you mean the function parameter 'const struct rr_numa *numa' but
> > > it's not clear on first reading.
> > > >
> > > > (2) " or the last node passed" - again this makes sense only when
> > > > you
> > > look into the behavior of the call 'hmap_next(&rr->numas, &numa-
> > >node)'.
> > > >
> > > > You could say something like:
> > > >
> > > > "Attempt to return the next NUMA from a numa list in a round robin
> > > fashion. Return the first NUMA node if the struct rr_numa *numa
> > > argument passed to the function is NULL or if the numa node argument
> > > passed to hmap_next is already the last node. Return NULL if the
> > > numa list is empty."
> > >
> > > I'm not sure that references to implementation is a good way to
> > > write comments (I mean 'passed to hmap_next' part).
> > >
> > > How about this:
> > > """
> > > /* Returns the next node in numa list following 'numa' in
> > > round-robin fashion.
> > > * Returns first node if 'numa' is a null pointer or the last node
> > > in 'rr'. */ """
> > >
> > > or
> > >
> > > """
> > > /* The next node in numa list following 'numa' in round-robin fashion.
> > > * Returns:
> > > * - 'NULL' if 'rr' is an empty numa list.
> > > * - First node in 'rr' if 'numa' is a null pointer.
> > > * - First node in 'rr' if 'numa' is the last node in 'rr'.
> > > * - Otherwise, the next node in numa list following 'numa'. */
> > > """
> > >
> > > ?
> >
> > I'm happy with the first option you provided above, could you append
> > returning NULL if the list is empty then I think we're good.
> >
> > /* Returns the next node in numa list following 'numa' in round-robin
> > fashion.
> > * Returns first node if 'numa' is a null pointer or the last node in
> 'rr'.
> > * Returns NULL if 'rr' numa list is empty. */
>
> [[BO'M]]
> Sounds good to me. Anyone object to this wording?
>
> >
> > Thanks
> > Ian
> > >
> > > >
> > > >> + struct hmap_node *node = NULL;
> > > >> +
> > > >> + if (numa) {
> > > >> + node = hmap_next(&rr->numas, &numa->node);
> > > >> + }
> > > >> + if (!node) {
> > > >> + node = hmap_first(&rr->numas);
> > > >> + }
> > > >> +
> > > >> + return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
> > > >> + NULL; }
> > > >> +
> > > >> static void
> > > >> rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
> > > >> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp,
> > > >> bool
> > > >> pinned)
> > > >> OVS_REQUIRES(dp->port_mutex) {
> > > >> struct dp_netdev_port *port;
> > > >> struct rr_numa_list rr;
> > > >> + struct rr_numa *non_local_numa = NULL;
> > > >>
> > > >> rr_numa_list_populate(dp, &rr);
> > > >>
> > > >> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
> > > >> pinned)
> > > >> OVS_REQUIRES(dp->port_mutex)
> > > >> }
> > > >> } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> > > >> if (!numa) {
> > > >> - VLOG_WARN("There's no available (non isolated)
> pmd
> > > >> thread "
> > > >> + /* There are no pmds on the queue's local
> > > >> + NUMA
> > > node.
> > > >> + Round-robin on the NUMA nodes that do
> > > >> + have
> > > pmds.
> > > >> */
> > > >> + non_local_numa = rr_numa_list_next(&rr,
> > > >> non_local_numa);
> > > >> + if (!non_local_numa) {
> > > >> + VLOG_ERR("There is no available
> > > >> + (non-isolated)
> > > >> pmd "
> > > >> + "thread for port \'%s\' queue %d.
> > > >> + This
> > > >> queue "
> > > >> + "will not be polled. Is
> > > >> + pmd-cpu-mask set
> > > >> to "
> > > >> + "zero? Or are all PMDs isolated
> > > >> + to other
> > > >> "
> > > >> + "queues?",
> > > >> + netdev_get_name(port-
> > > >>> netdev),
> > > >> + qid);
> > > >> + continue;
> > > >> + }
> > > >> + q->pmd = rr_numa_get_pmd(non_local_numa);
> > > >> + VLOG_WARN("There's no available
> > > >> + (non-isolated) pmd
> > > >> thread "
> > > >> "on numa node %d. Queue %d on port
> > > \'%s\'
> > > >> will "
> > > >> - "not be polled.",
> > > >> - numa_id, qid, netdev_get_name(port-
> > > >>> netdev));
> > > >> + "be assigned to the pmd on core %d "
> > > >> + "(numa node %d). Expect reduced
> > > >> performance.",
> > > >> + numa_id, qid,
> > > >> + netdev_get_name(port-
> > > >>> netdev),
> > > >> + q->pmd->core_id, q->pmd->numa_id);
> > > >> } else {
> > > >> + /* Assign queue to the next (round-robin)
> > > >> + PMD on it's
> > > >> local
> > > >> + NUMA node. */
> > > >> q->pmd = rr_numa_get_pmd(numa);
> > > >> }
> > > >> }
> > > >> --
> > > >> 2.7.4
> > > > This tested fine for me, tested with multiple rxqs distributed and
> > > isolated over pmds on 2 different numa nodes with varying pmd masks.
> > > Also passed sanity checks (clang, sparse compilation etc.).
> > > >
> > > > You can add the tested by tag for me but I'd like to see the
> > > > changes for
> > > the documentation and function comments above before acking.
> > > >
> > > > Tested-by: Ian Stokes <ian.sto...@intel.com>
> > > >>
> > > >> _______________________________________________
> > > >> dev mailing list
> > > >> d...@openvswitch.org
> > > >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > >
> > > >
> > > >
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev