Re: [ovs-dev] [PATCH v7 2/3] dpif-netdev: Detailed performance stats for PMDs

Ilya Maximets Fri, 19 Jan 2018 05:01:58 -0800

On 19.01.2018 14:04, Ilya Maximets wrote:
> On 18.01.2018 20:21, O Mahony, Billy wrote:
>> Hi All,
>>
>> The thread gets updated before I can finish my review so I'm just going to 
>> jump in to this thread rather than reply to the o/p with a fresh thread. I 
>> hope that's ok.
>>
>> I've run the patch and any issues with previous revisions are fixed.
>>
>> Perhaps add an error message for stats-clear when the pmd indicated by -pmd 
>> flag does not exist.
>>
>> One or two other comments/questions inline, below.
>>
>> Cheers,
>> /Billy. 
>>
>>
>>> -----Original Message-----
>>> From: Jan Scheurich [mailto:[email protected]]
>>> Sent: Thursday, January 18, 2018 4:35 PM
>>> To: Ilya Maximets <[email protected]>; [email protected]
>>> Cc: [email protected]; Stokes, Ian <[email protected]>; O Mahony,
>>> Billy <[email protected]>
>>> Subject: RE: [PATCH v7 2/3] dpif-netdev: Detailed performance stats for
>>> PMDs
>>>
>>> Hi Ilya,
>>>
>>> Thanks for your in-depth review and comments. Please find my answers
>>> below.
>>>
>>> BR, Jan
>>>
>>>> -----Original Message-----
>>>> From: Ilya Maximets [mailto:[email protected]]
>>>> Sent: Wednesday, 17 January, 2018 13:13
>>>> To: Jan Scheurich <[email protected]>; [email protected]
>>>> Cc: [email protected]; [email protected];
>>> [email protected]
>>>> Subject: Re: [PATCH v7 2/3] dpif-netdev: Detailed performance stats for
>>> PMDs
>>>>
>>>> Not a full review.
>>>> Comments inline.
>>>>
>>>> On 16.01.2018 04:51, Jan Scheurich wrote:
>>>>> This patch instruments the dpif-netdev datapath to record detailed
>>>>> statistics of what is happening in every iteration of a PMD thread.
>>>>>
>>>>> The collection of detailed statistics can be controlled by a new
>>>>> Open_vSwitch configuration parameter "other_config:pmd-perf-
>>> metrics".
>>>>> By default it is disabled. The run-time overhead, when enabled, is
>>>>> in the order of 1%.
>>>>>
>>>>> The covered metrics per iteration are:
>>>>>   - cycles
>>>>>   - packets
>>>>>   - (rx) batches
>>>>>   - packets/batch
>>>>>   - max. vhostuser qlen
>>>>>   - upcalls
>>>>>   - cycles spent in upcalls
>>>>>
>>>>> This raw recorded data is used threefold:
>>>>>
>>>>> 1. In histograms for each of the following metrics:
>>>>>    - cycles/iteration (log.)
>>>>>    - packets/iteration (log.)
>>>>>    - cycles/packet
>>>>>    - packets/batch
>>>>>    - max. vhostuser qlen (log.)
>>>>>    - upcalls
>>>>>    - cycles/upcall (log)
>>>>>    The histograms bins are divided linear or logarithmic.
>>>>>
>>>>> 2. A cyclic history of the above statistics for 999 iterations
>>>>>
>>>>> 3. A cyclic history of the cummulative/average values per millisecond
>>>>>    wall clock for the last 1000 milliseconds:
>>>>>    - number of iterations
>>>>>    - avg. cycles/iteration
>>>>>    - packets (Kpps)
>>>>>    - avg. packets/batch
>>>>>    - avg. max vhost qlen
>>>>>    - upcalls
>>>>>    - avg. cycles/upcall
>>>>>
>>>>> The gathered performance metrics can be printed at any time with the
>>>>> new CLI command
>>>>>
>>>>> ovs-appctl dpif-netdev/pmd-perf-show [-nh] [-it iter_len] [-ms ms_len]
>>>>>     [-pmd core | dp]
>>>>>
>>>>> The options are
>>>>>
>>>>> -nh:            Suppress the histograms
>>>>> -it iter_len:   Display the last iter_len iteration stats
>>>>> -ms ms_len:     Display the last ms_len millisecond stats
>>>>> -pmd core:      Display only
>>>>>
>>>>> The performance statistics are reset with the existing
>>>>> dpif-netdev/pmd-stats-clear command.
>>>>>
>>>>> The output always contains the following global PMD statistics,
>>>>> similar to the pmd-stats-show command:
>>>>>
>>>>> Time: 15:24:55.270
>>>>> Measurement duration: 1.008 s
>>>>>
>>>>> pmd thread numa_id 0 core_id 1:
>>>>>
>>>>>   Cycles:            2419034712  (2.40 GHz)
>>>>>   Iterations:            572817  (1.76 us/it)
>>>>>   - idle:                486808  (15.9 % cycles)
>>>>>   - busy:                 86009  (84.1 % cycles)
>>>>>   Packets:              2399607  (2381 Kpps, 848 cycles/pkt)
>>>>>   Datapath passes:      3599415  (1.50 passes/pkt)
>>>>>   - EMC hits:            336472  ( 9.3 %)
>>>>>   - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
>>>>>   - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
>>>>>   - Lost upcalls:             0  ( 0.0 %)
>>>>>
>>>>> Signed-off-by: Jan Scheurich <[email protected]>
>>>>> ---
>>>>>  NEWS                        |   3 +
>>>>>  lib/automake.mk             |   1 +
>>>>>  lib/dp-packet.h             |   1 +
>>>>>  lib/dpif-netdev-perf.c      | 333
>>> +++++++++++++++++++++++++++++++++++++++++++-
>>>>>  lib/dpif-netdev-perf.h      | 239 +++++++++++++++++++++++++++++--
>>>>>  lib/dpif-netdev.c           | 177 +++++++++++++++++++++--
>>>>>  lib/netdev-dpdk.c           |  13 +-
>>>>>  lib/netdev-dpdk.h           |  14 ++
>>>>>  lib/netdev-dpif-unixctl.man | 113 +++++++++++++++
>>>>>  manpages.mk                 |   2 +
>>>>>  vswitchd/ovs-vswitchd.8.in  |  27 +---
>>>>>  vswitchd/vswitch.xml        |  12 ++
>>>>>  12 files changed, 881 insertions(+), 54 deletions(-)
>>>>>  create mode 100644 lib/netdev-dpif-unixctl.man
>>>>>
>>>>> diff --git a/NEWS b/NEWS
>>>>> index 2c28456..743528e 100644
>>>>> --- a/NEWS
>>>>> +++ b/NEWS
>>>>> @@ -44,6 +44,9 @@ Post-v2.8.0
>>>>>            if available (for OpenFlow 1.4+).
>>>>>     - Userspace datapath:
>>>>>       * Output packet batching support.
>>>>> +     * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
>>> single PMD
>>>>> +     * Detailed PMD performance metrics available with new command
>>>>> +         ovs-appctl dpif-netdev/pmd-perf-show
>>>>>     - vswitchd:
>>>>>       * Datapath IDs may now be specified as 0x1 (etc.) instead of 16 
>>>>> digits.
>>>>>       * Configuring a controller, or unconfiguring all controllers, now 
>>>>> deletes
>>>>> diff --git a/lib/automake.mk b/lib/automake.mk
>>>>> index 159319f..d07cbe9 100644
>>>>> --- a/lib/automake.mk
>>>>> +++ b/lib/automake.mk
>>>>> @@ -468,6 +468,7 @@ MAN_FRAGMENTS += \
>>>>>   lib/dpctl.man \
>>>>>   lib/memory-unixctl.man \
>>>>>   lib/netdev-dpdk-unixctl.man \
>>>>> + lib/netdev-dpif-unixctl.man \
>>>>>   lib/ofp-version.man \
>>>>>   lib/ovs.tmac \
>>>>>   lib/service.man \
>>>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>>>>> index b4b721c..3d65088 100644
>>>>> --- a/lib/dp-packet.h
>>>>> +++ b/lib/dp-packet.h
>>>>> @@ -697,6 +697,7 @@ struct dp_packet_batch {
>>>>>      size_t count;
>>>>>      bool trunc; /* true if the batch needs truncate. */
>>>>>      struct dp_packet *packets[NETDEV_MAX_BURST];
>>>>> +
>>>>>  };
>>>>>
>>>>>  static inline void
>>>>> diff --git a/lib/dpif-netdev-perf.c b/lib/dpif-netdev-perf.c
>>>>> index f06991a..e0ef15d 100644
>>>>> --- a/lib/dpif-netdev-perf.c
>>>>> +++ b/lib/dpif-netdev-perf.c
>>>>> @@ -15,6 +15,7 @@
>>>>>   */
>>>>>
>>>>>  #include <config.h>
>>>>> +#include <stdint.h>
>>>>>
>>>>>  #include "openvswitch/dynamic-string.h"
>>>>>  #include "openvswitch/vlog.h"
>>>>> @@ -23,10 +24,299 @@
>>>>>
>>>>>  VLOG_DEFINE_THIS_MODULE(pmd_perf);
>>>>>
>>>>> +#ifdef DPDK_NETDEV
>>>>> +static uint64_t
>>>>> +get_tsc_hz(void)
>>>>> +{
>>>>> +    return rte_get_tsc_hz();
>>>>> +}
>>>>> +#else
>>>>> +/* This function is only invoked from PMD threads which depend on
>>> DPDK.
>>>>> + * A dummy function is sufficient when building without DPDK_NETDEV.
>>> */
>>>>> +static uint64_t
>>>>> +get_tsc_hz(void)
>>>>> +{
>>>>> +    return 1;
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> +/* Histogram functions. */
>>>>> +
>>>>> +static void
>>>>> +histogram_walls_set_lin(struct histogram *hist, uint32_t min, uint32_t
>>> max)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    ovs_assert(min < max);
>>>>> +    for (i = 0; i < NUM_BINS-1; i++) {
>>>>> +        hist->wall[i] = min + (i * (max - min)) / (NUM_BINS - 2);
>>>>> +    }
>>>>> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +histogram_walls_set_log(struct histogram *hist, uint32_t min, uint32_t
>>> max)
>>>>> +{
>>>>> +    int i, start, bins, wall;
>>>>> +    double log_min, log_max;
>>>>> +
>>>>> +    ovs_assert(min < max);
>>>>> +    if (min > 0) {
>>>>> +        log_min = log(min);
>>>>> +        log_max = log(max);
>>>>> +        start = 0;
>>>>> +        bins = NUM_BINS - 1;
>>>>> +    } else {
>>>>> +        hist->wall[0] = 0;
>>>>> +        log_min = log(1);
>>>>> +        log_max = log(max);
>>>>> +        start = 1;
>>>>> +        bins = NUM_BINS - 2;
>>>>> +    }
>>>>> +    wall = start;
>>>>> +    for (i = 0; i < bins; i++) {
>>>>> +        /* Make sure each wall is monotonically increasing. */
>>>>> +        wall = MAX(wall, exp(log_min + (i * (log_max - log_min)) / (bins-
>>> 1)));
>>>>> +        hist->wall[start + i] = wall++;
>>>>> +    }
>>>>> +    if (hist->wall[NUM_BINS-2] < max) {
>>>>> +        hist->wall[NUM_BINS-2] = max;
>>>>> +    }
>>>>> +    hist->wall[NUM_BINS-1] = UINT32_MAX;
>>>>> +}
>>>>> +
>>>>> +uint64_t
>>>>> +histogram_samples(const struct histogram *hist)
>>>>> +{
>>>>> +    uint64_t samples = 0;
>>>>> +
>>>>> +    for (int i = 0; i < NUM_BINS; i++) {
>>>>> +        samples += hist->bin[i];
>>>>> +    }
>>>>> +    return samples;
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +histogram_clear(struct histogram *hist)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    for (i = 0; i < NUM_BINS; i++) {
>>>>> +        hist->bin[i] = 0;
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static void
>>>>> +history_init(struct history *h)
>>>>> +{
>>>>> +    memset(h, 0, sizeof(*h));
>>>>> +}
>>>>> +
>>>>>  void
>>>>>  pmd_perf_stats_init(struct pmd_perf_stats *s)
>>>>>  {
>>>>> -    memset(s, 0 , sizeof(*s));
>>>>> +    memset(s, 0, sizeof(*s));
>>>>> +    histogram_walls_set_log(&s->cycles, 500, 24000000);
>>>>> +    histogram_walls_set_log(&s->pkts, 0, 1000);
>>>>> +    histogram_walls_set_lin(&s->cycles_per_pkt, 100, 30000);
>>>>> +    histogram_walls_set_lin(&s->pkts_per_batch, 0, 32);
>>>>> +    histogram_walls_set_lin(&s->upcalls, 0, 30);
>>>>> +    histogram_walls_set_log(&s->cycles_per_upcall, 1000, 1000000);
>>>>> +    histogram_walls_set_log(&s->max_vhost_qfill, 0, 512);
>>>>> +    s->start_ms = time_msec();
>>>>> +}
>>>>> +
>>>>> +void
>>>>> +pmd_perf_format_overall_stats(struct ds *str, struct pmd_perf_stats
>>> *s,
>>>>> +                              double duration)
>>>>> +{
>>>>> +    uint64_t stats[PMD_N_STATS];
>>>>> +    double us_per_cycle = 1000000.0 / get_tsc_hz();
>>>>> +
>>>>> +    if (duration == 0) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    pmd_perf_read_counters(s, stats);
>>>>> +    uint64_t tot_cycles = stats[PMD_CYCLES_ITER_IDLE] +
>>>>> +                          stats[PMD_CYCLES_ITER_BUSY];
>>>>> +    uint64_t packets = stats[PMD_STAT_RECV];
>>>>> +    uint64_t passes = stats[PMD_STAT_RECV] +
>>>>> +                      stats[PMD_STAT_RECIRC];
>>>>> +    uint64_t upcalls = stats[PMD_STAT_MISS];
>>>>> +    uint64_t upcall_cycles = stats[PMD_CYCLES_UPCALL];
>>>>> +    uint64_t tot_iter = histogram_samples(&s->pkts);
>>>>> +    uint64_t idle_iter = s->pkts.bin[0];
>>>>> +    uint64_t busy_iter = tot_iter >= idle_iter ? tot_iter - idle_iter : 
>>>>> 0;
>>>>> +
>>>>> +    ds_put_format(str,
>>>>> +            "  Cycles:          %12"PRIu64"  (%.2f GHz)\n"
>>>>> +            "  Iterations:      %12"PRIu64"  (%.2f us/it)\n"
>>>>> +            "  - idle:          %12"PRIu64"  (%4.1f %% cycles)\n"
>>>>> +            "  - busy:          %12"PRIu64"  (%4.1f %% cycles)\n",
>>>>> +            tot_cycles, (tot_cycles / duration) / 1E9,
>>>>> +            tot_iter, tot_cycles * us_per_cycle / tot_iter,
>>>>> +            idle_iter,
>>>>> +            100.0 * stats[PMD_CYCLES_ITER_IDLE] / tot_cycles,
>>>>> +            busy_iter,
>>>>> +            100.0 * stats[PMD_CYCLES_ITER_BUSY] / tot_cycles);
>>>>> +    if (packets > 0) {
>>>>> +        ds_put_format(str,
>>>>> +            "  Packets:         %12"PRIu64"  (%.0f Kpps, %.0f 
>>>>> cycles/pkt)\n"
>>>>> +            "  Datapath passes: %12"PRIu64"  (%.2f passes/pkt)\n"
>>>>> +            "  - EMC hits:      %12"PRIu64"  (%4.1f %%)\n"
>>>>> +            "  - Megaflow hits: %12"PRIu64"  (%4.1f %%, %.2f subtbl 
>>>>> lookups/"
>>>>> +                                                                     
>>>>> "hit)\n"
>>>>> +            "  - Upcalls:       %12"PRIu64"  (%4.1f %%, %.1f 
>>>>> us/upcall)\n"
>>>>> +            "  - Lost upcalls:  %12"PRIu64"  (%4.1f %%)\n"
>>>>> +            "\n",
>>>>> +            packets, (packets / duration) / 1000,
>>>>> +            1.0 * stats[PMD_CYCLES_ITER_BUSY] / packets,
>>>>> +            passes, packets ? 1.0 * passes / packets : 0,
>>>>> +            stats[PMD_STAT_EXACT_HIT],
>>>>> +            100.0 * stats[PMD_STAT_EXACT_HIT] / passes,
>>>>> +            stats[PMD_STAT_MASKED_HIT],
>>>>> +            100.0 * stats[PMD_STAT_MASKED_HIT] / passes,
>>>>> +            stats[PMD_STAT_MASKED_HIT]
>>>>> +            ? 1.0 * stats[PMD_STAT_MASKED_LOOKUP] /
>>> stats[PMD_STAT_MASKED_HIT]
>>>>> +            : 0,
>>>>> +            upcalls, 100.0 * upcalls / passes,
>>>>> +            upcalls ? (upcall_cycles * us_per_cycle) / upcalls : 0,
>>>>> +            stats[PMD_STAT_LOST],
>>>>> +            100.0 * stats[PMD_STAT_LOST] / passes);
>>>>> +    } else {
>>>>> +        ds_put_format(str,
>>>>> +                "  Packets:         %12"PRIu64"\n"
>>>>> +                "\n",
>>>>> +                0UL);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +void
>>>>> +pmd_perf_format_histograms(struct ds *str, struct pmd_perf_stats *s)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    ds_put_cstr(str, "Histograms\n");
>>>>> +    ds_put_format(str,
>>>>> +                  "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  %-21s\n",
>>>>> +                  "cycles/it", "packets/it", "cycles/pkt", "pkts/batch",
>>>>> +                  "max vhost qlen", "upcalls/it", "cycles/upcall");
>>>>> +    for (i = 0; i < NUM_BINS-1; i++) {
>>>>> +        ds_put_format(str,
>>>>> +            "   %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
>>>>> +            "  %-9d %-11"PRIu64"  %-9d %-11"PRIu64"  %-9d %-11"PRIu64
>>>>> +            "  %-9d %-11"PRIu64"\n",
>>>>> +            s->cycles.wall[i], s->cycles.bin[i],
>>>>> +            s->pkts.wall[i],s->pkts.bin[i],
>>>>> +            s->cycles_per_pkt.wall[i], s->cycles_per_pkt.bin[i],
>>>>> +            s->pkts_per_batch.wall[i], s->pkts_per_batch.bin[i],
>>>>> +            s->max_vhost_qfill.wall[i], s->max_vhost_qfill.bin[i],
>>>>> +            s->upcalls.wall[i], s->upcalls.bin[i],
>>>>> +            s->cycles_per_upcall.wall[i], s->cycles_per_upcall.bin[i]);
>>>>> +    }
>>>>> +    ds_put_format(str,
>>>>> +                  "   %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s 
>>>>> %-11"PRIu64
>>>>> +                  "  %-9s %-11"PRIu64"  %-9s %-11"PRIu64"  %-9s 
>>>>> %-11"PRIu64
>>>>> +                  "  %-9s %-11"PRIu64"\n",
>>>>> +                  ">", s->cycles.bin[i],
>>>>> +                  ">", s->pkts.bin[i],
>>>>> +                  ">", s->cycles_per_pkt.bin[i],
>>>>> +                  ">", s->pkts_per_batch.bin[i],
>>>>> +                  ">", s->max_vhost_qfill.bin[i],
>>>>> +                  ">", s->upcalls.bin[i],
>>>>> +                  ">", s->cycles_per_upcall.bin[i]);
>>>>> +    if (s->totals.iterations > 0) {
>>>>> +        ds_put_cstr(str,
>>>>> +                    
>>>>> "-----------------------------------------------------"
>>>>> +                    
>>>>> "-----------------------------------------------------"
>>>>> +                    
>>>>> "------------------------------------------------\n");
>>>>> +        ds_put_format(str,
>>>>> +                      "   %-21s  %-21s  %-21s  %-21s  %-21s  %-21s  
>>>>> %-21s\n",
>>>>> +                      "cycles/it", "packets/it", "cycles/pkt", 
>>>>> "pkts/batch",
>>>>> +                      "vhost qlen", "upcalls/it", "cycles/upcall");
>>>>> +        ds_put_format(str,
>>>>> +                      "   %-21"PRIu64"  %-21.5f  %-21"PRIu64
>>>>> +                      "  %-21.5f  %-21.5f  %-21.5f  %-21"PRIu32"\n",
>>>>> +                      s->totals.cycles / s->totals.iterations,
>>>>> +                      1.0 * s->totals.pkts / s->totals.iterations,
>>>>> +                      s->totals.pkts
>>>>> +                          ? s->totals.busy_cycles / s->totals.pkts : 0,
>>>>> +                      s->totals.batches
>>>>> +                          ? 1.0 * s->totals.pkts / s->totals.batches : 0,
>>>>> +                      1.0 * s->totals.max_vhost_qfill / 
>>>>> s->totals.iterations,
>>>>> +                      1.0 * s->totals.upcalls / s->totals.iterations,
>>>>> +                      s->totals.upcalls
>>>>> +                          ? s->totals.upcall_cycles / s->totals.upcalls 
>>>>> : 0);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +void
>>>>> +pmd_perf_format_iteration_history(struct ds *str, struct
>>> pmd_perf_stats *s,
>>>>> +                                  int n_iter)
>>>>> +{
>>>>> +    struct iter_stats *is;
>>>>> +    size_t index;
>>>>> +    int i;
>>>>> +
>>>>> +    if (n_iter == 0) {
>>>>> +        return;
>>>>> +    }
>>>>> +    ds_put_format(str, "   %-17s   %-10s   %-10s   %-10s   %-10s   "
>>>>> +                  "%-10s   %-10s   %-10s\n",
>>>>> +                  "tsc", "cycles", "packets", "cycles/pkt", "pkts/batch",
>>>>> +                  "vhost qlen", "upcalls", "cycles/upcall");
>>>>> +    for (i = 1; i <= n_iter; i++) {
>>>>> +        index = (s->iterations.idx + HISTORY_LEN - i) % HISTORY_LEN;
>>>>> +        is = &s->iterations.sample[index];
>>>>> +        ds_put_format(str,
>>>>> +                      "   %-17"PRIu64"   %-11"PRIu64"  %-11"PRIu32
>>>>> +                      "  %-11"PRIu64"  %-11"PRIu32"  %-11"PRIu32
>>>>> +                      "  %-11"PRIu32"  %-11"PRIu32"\n",
>>>>> +                      is->timestamp,
>>>>> +                      is->cycles,
>>>>> +                      is->pkts,
>>>>> +                      is->pkts ? is->cycles / is->pkts : 0,
>>>>> +                      is->batches ? is->pkts / is->batches : 0,
>>>>> +                      is->max_vhost_qfill,
>>>>> +                      is->upcalls,
>>>>> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +void
>>>>> +pmd_perf_format_ms_history(struct ds *str, struct pmd_perf_stats *s,
>>> int n_ms)
>>>>> +{
>>>>> +    struct iter_stats *is;
>>>>> +    size_t index;
>>>>> +    int i;
>>>>> +
>>>>> +    if (n_ms == 0) {
>>>>> +        return;
>>>>> +    }
>>>>> +    ds_put_format(str,
>>>>> +                  "   %-12s   %-10s   %-10s   %-10s   %-10s"
>>>>> +                  "   %-10s   %-10s   %-10s   %-10s\n",
>>>>> +                  "ms", "iterations", "cycles/it", "Kpps", "cycles/pkt",
>>>>> +                  "pkts/batch", "vhost qlen", "upcalls", 
>>>>> "cycles/upcall");
>>>>> +    for (i = 1; i <= n_ms; i++) {
>>>>> +        index = (s->milliseconds.idx + HISTORY_LEN - i) % HISTORY_LEN;
>>>>> +        is = &s->milliseconds.sample[index];
>>>>> +        ds_put_format(str,
>>>>> +                      "   %-12"PRIu64"   %-11"PRIu32"  %-11"PRIu64
>>>>> +                      "  %-11"PRIu32"  %-11"PRIu64"  %-11"PRIu32
>>>>> +                      "  %-11"PRIu32"  %-11"PRIu32"  %-11"PRIu32"\n",
>>>>> +                      is->timestamp,
>>>>> +                      is->iterations,
>>>>> +                      is->iterations ? is->cycles / is->iterations : 0,
>>>>> +                      is->pkts,
>>>>> +                      is->pkts ? is->busy_cycles / is->pkts : 0,
>>>>> +                      is->batches ? is->pkts / is->batches : 0,
>>>>> +                      is->iterations
>>>>> +                          ? is->max_vhost_qfill / is->iterations : 0,
>>>>> +                      is->upcalls,
>>>>> +                      is->upcalls ? is->upcall_cycles / is->upcalls : 0);
>>>>> +    }
>>>>>  }
>>>>>
>>>>>  void
>>>>> @@ -51,10 +341,49 @@ pmd_perf_read_counters(struct
>>> pmd_perf_stats *s,
>>>>>      }
>>>>>  }
>>>>>
>>>>> +/* This function is executed in the context of the PMD at the start of
>>>>> + * a new iteration when requested through pmd_perf_stats_clear(). */
>>>>>  void
>>>>> -pmd_perf_stats_clear(struct pmd_perf_stats *s)
>>>>> +pmd_perf_stats_clear__(struct pmd_perf_stats *s)
>>>>>  {
>>>>>      for (int i = 0; i < PMD_N_STATS; i++) {
>>>>>          atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);
>>>>>      }
>>>>> +    memset(&s->current, 0 , sizeof(struct iter_stats));
>>>>> +    memset(&s->totals, 0 , sizeof(struct iter_stats));
>>>>> +    histogram_clear(&s->cycles);
>>>>> +    histogram_clear(&s->pkts);
>>>>> +    histogram_clear(&s->cycles_per_pkt);
>>>>> +    histogram_clear(&s->upcalls);
>>>>> +    histogram_clear(&s->cycles_per_upcall);
>>>>> +    histogram_clear(&s->pkts_per_batch);
>>>>> +    histogram_clear(&s->max_vhost_qfill);
>>>>> +    history_init(&s->iterations);
>>>>> +    history_init(&s->milliseconds);
>>>>> +    s->start_ms = time_msec();
>>>>> +    s->milliseconds.sample[0].timestamp = s->start_ms;
>>>>> +    /* Clearing finished. */
>>>>> +    s->clear = false;
>>>>> +}
>>>>> +
>>>>> +/* This function must be called from outside the PMD thread to safely
>>>>> + * clear the PMD stats at the start of the next iteration. It blocks the
>>>>> + * caller until the stats are cleared. */
>> [[BO'M]] Do we need the "it blocks until..." constraint? This is only called 
>> via the cli so once the flag is set then pmd will on the next iteration 
>> clear the stats. And they should be long cleared before the next cli 
>> operation to view them. In that case we can completely remove the nanosleep 
>> below. I don't know if clear needs any atomic protection. I suspect volatile 
>> would be enough.
>>>>> +void
>>>>> +pmd_perf_stats_clear(struct pmd_perf_stats *s)
>>>>> +{
>>>>> +    /* Request the PMD to clear its stats in pmd_perf_start_iteration(). 
>>>>> */
>>>>> +    s->clear = true;
>>>>> +    /* Wait a number of milliseconds for the stats to be cleared. */
>>>>> +    while (s->clear) {
>>>>
>>>> Handler thread will wait forever in case we have PMD thread without rx
>>> queues.
>>>> Also we're under '&dp_netdev_mutex' here. All the changes in
>>> configuration
>>>> will stuck.
>>>
>>> Is it ok if I state in the comment that it's the caller's responsibility to 
>>> make
>>> sure the PMD thread is executing when calling the function? I can also add a
>>> max waiting time (e.g. 1 second) to avoid eternal deadlock.
>>>
>>> Let's discuss further down if the invocation is safe.
>>>
>>>>
>>>>> +        xnanosleep(1000 * 1000);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +/* This function can be called from the anywhere to clear the stats
>>>>> + * of the non-pmd thread. */
>>>>> +void
>>>>> +non_pmd_perf_stats_clear(struct pmd_perf_stats *s)
>>>>> +{
>>>>> +    pmd_perf_stats_clear__(s);
>>>>>  }
>>>>> diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h
>>>>> index 5993c25..7a89c40 100644
>>>>> --- a/lib/dpif-netdev-perf.h
>>>>> +++ b/lib/dpif-netdev-perf.h
>>>>> @@ -38,10 +38,18 @@
>>>>>  extern "C" {
>>>>>  #endif
>>>>>
>>>>> -/* This module encapsulates data structures and functions to maintain
>>> PMD
>>>>> - * performance metrics such as packet counters, execution cycles. It
>>>>> - * provides a clean API for dpif-netdev to initialize, update and read 
>>>>> and
>>>>> +/* This module encapsulates data structures and functions to maintain
>>> basic PMD
>>>>> + * performance metrics such as packet counters, execution cycles as
>>> well as
>>>>> + * histograms and time series recording for more detailed PMD metrics.
>>>>> + *
>>>>> + * It provides a clean API for dpif-netdev to initialize, update and read
>>> and
>>>>>   * reset these metrics.
>>>>> + *
>>>>> + * The basic set of PMD counters is implemented as atomic_uint64_t
>>> variables
>>>>> + * to guarantee correct read also in 32-bit systems.
>>>>> + *
>>>>> + * The detailed PMD performance metrics are only supported on 64-bit
>>> systems
>>>>> + * with atomic 64-bit read and store semantics for plain uint64_t
>>> counters.
>>>>>   */
>>>>>
>>>>>  /* Set of counter types maintained in pmd_perf_stats. */
>>>>> @@ -66,6 +74,7 @@ enum pmd_stat_type {
>>>>>      PMD_STAT_SENT_BATCHES,  /* Number of batches sent. */
>>>>>      PMD_CYCLES_ITER_IDLE,   /* Cycles spent in idle iterations. */
>>>>>      PMD_CYCLES_ITER_BUSY,   /* Cycles spent in busy iterations. */
>>>>> +    PMD_CYCLES_UPCALL,      /* Cycles spent processing upcalls. */
>>>>>      PMD_N_STATS
>>>>>  };
>>>>>
>>>>> @@ -81,18 +90,78 @@ struct pmd_counters {
>>>>>      uint64_t zero[PMD_N_STATS];         /* Value at last _clear().  */
>>>>>  };
>>>>>
>>>>> +/* Data structure to collect statistical distribution of an integer
>>> measurement
>>>>> + * type in form of a histogram. The wall[] array contains the inclusive
>>>>> + * upper boundaries of the bins, while the bin[] array contains the 
>>>>> actual
>>>>> + * counters per bin. The histogram walls are typically set automatically
>>>>> + * using the functions provided below.*/
>>>>> +
>>>>> +#define NUM_BINS 32             /* Number of histogram bins. */
>>>>> +
>>>>> +struct histogram {
>>>>> +    uint32_t wall[NUM_BINS];
>>>>> +    uint64_t bin[NUM_BINS];
>>>>> +};
>>>>> +
>>>>> +/* Data structure to record details PMD execution metrics per iteration
>>> for
>>>>> + * a history period of up to HISTORY_LEN iterations in circular buffer.
>>>>> + * Also used to record up to HISTORY_LEN millisecond averages/totals of
>>> these
>>>>> + * metrics.*/
>>>>> +
>>>>> +struct iter_stats {
>>>>> +    uint64_t timestamp;         /* TSC or millisecond. */
>>>>> +    uint64_t cycles;            /* Number of TSC cycles spent in it/ms. 
>>>>> */
>>>>> +    uint64_t busy_cycles;       /* Cycles spent in busy iterations in 
>>>>> ms. */
>>>>> +    uint32_t iterations;        /* Iterations in ms. */
>>>>> +    uint32_t pkts;              /* Packets processed in iteration/ms. */
>>>>> +    uint32_t upcalls;           /* Number of upcalls in iteration/ms. */
>>>>> +    uint32_t upcall_cycles;     /* Cycles spent in upcalls in 
>>>>> iteration/ms. */
>>>>> +    uint32_t batches;           /* Number of rx batches in iteration/ms. 
>>>>> */
>>>>> +    uint32_t max_vhost_qfill;   /* Maximum fill level encountered in
>>> it/ms. */
>>>>> +};
>>>>> +
>>>>> +#define HISTORY_LEN 1000        /* Length of recorded history
>>>>> +                                   (iterations and ms). */
>>>>> +#define DEF_HIST_SHOW 20        /* Default number of history samples
>>> to
>>>>> +                                   display. */
>>>>> +
>>>>> +struct history {
>>>>> +    uint64_t idx;               /* Next slot in history. */
>>>>> +    struct iter_stats sample[HISTORY_LEN];
>>>>> +};
>>>>> +
>>>>>  /* Container for all performance metrics of a PMD.
>>>>>   * Part of the struct dp_netdev_pmd_thread. */
>>>>>
>>>>>  struct pmd_perf_stats {
>>>>> -    /* Start of the current PMD iteration in TSC cycles.*/
>>>>> -    uint64_t start_it_tsc;
>>>>> +    /* Set by CLI thread to order clearing of PMD stats. */
>>>>> +    volatile atomic_bool clear;
>>>>
>>>> I think, volatile + atomic is too much.
>>>> Also, you're not using atomic operations on this variable.
>>>
>>> So would you prefer volatile bool or atomic_bool?
>>>
>>> I assume that bool is atomic on all architectures, so use of atomic_bool was
>>> more a kind of documenting that.
>>> But then I should use also atomic operations to be consistent. See also the
>>> other atomic_bool below.
>>>
>>>>
>>>>> +    /* Start of the current performance measurement period. */
>>>>> +    uint64_t start_ms;
>>>>>      /* Latest TSC time stamp taken in PMD. */
>>>>>      uint64_t last_tsc;
>>>>> +    /* Used to space certain checks in time. */
>>>>> +    uint64_t next_check_tsc;
>>>>>      /* If non-NULL, outermost cycle timer currently running in PMD. */
>>>>>      struct cycle_timer *cur_timer;
>>>>>      /* Set of PMD counters with their zero offsets. */
>>>>>      struct pmd_counters counters;
>>>>> +    /* Statistics of the current iteration. */
>>>>> +    struct iter_stats current;
>>>>> +    /* Totals for the current millisecond. */
>>>>> +    struct iter_stats totals;
>>>>> +    /* Histograms for the PMD metrics. */
>>>>> +    struct histogram cycles;
>>>>> +    struct histogram pkts;
>>>>> +    struct histogram cycles_per_pkt;
>>>>> +    struct histogram upcalls;
>>>>> +    struct histogram cycles_per_upcall;
>>>>> +    struct histogram pkts_per_batch;
>>>>> +    struct histogram max_vhost_qfill;
>>>>> +    /* Iteration history buffer. */
>>>>> +    struct history iterations;
>>>>> +    /* Millisecond hitory buffer. */
>>>>> +    struct history milliseconds;
>>>>>  };
>>>>>
>>>>>  /* Support for accurate timing of PMD execution on TSC clock cycle level.
>>>>> @@ -175,8 +244,15 @@ cycle_timer_stop(struct pmd_perf_stats *s,
>>>>>      return now - timer->start;
>>>>>  }
>>>>>
>>>>> +/* Functions to initialize and reset the PMD performance metrics. */
>>>>> +
>>>>>  void pmd_perf_stats_init(struct pmd_perf_stats *s);
>>>>>  void pmd_perf_stats_clear(struct pmd_perf_stats *s);
>>>>> +void non_pmd_perf_stats_clear(struct pmd_perf_stats *s);
>>>>> +void pmd_perf_stats_clear__(struct pmd_perf_stats *s);
>>>>> +
>>>>> +/* Functions to read and update PMD counters. */
>>>>> +
>>>>>  void pmd_perf_read_counters(struct pmd_perf_stats *s,
>>>>>                              uint64_t stats[PMD_N_STATS]);
>>>>>
>>>>> @@ -199,32 +275,175 @@ pmd_perf_update_counter(struct
>>> pmd_perf_stats *s,
>>>>>      atomic_store_relaxed(&s->counters.n[counter], tmp);
>>>>>  }
>>>>>
>>>>> +/* Functions to manipulate a sample history. */
>>>>> +
>>>>> +static inline void
>>>>> +histogram_add_sample(struct histogram *hist, uint32_t val)
>>>>> +{
>>>>> +    /* TODO: Can do better with binary search? */
>>>>> +    for (int i = 0; i < NUM_BINS-1; i++) {
>>>>> +        if (val <= hist->wall[i]) {
>>>>> +            hist->bin[i]++;
>>>>> +            return;
>>>>> +        }
>>>>> +    }
>>>>> +    hist->bin[NUM_BINS-1]++;
>>>>> +}
>>>>> +
>>>>> +uint64_t histogram_samples(const struct histogram *hist);
>>>>> +
>>>>> +static inline struct iter_stats *
>>>>> +history_current(struct history *h)
>>>>> +{
>>>>> +    return &h->sample[h->idx];
>>>>> +}
>>>>> +
>>>>> +static inline struct iter_stats *
>>>>> +history_next(struct history *h)
>>>>> +{
>>>>> +    struct iter_stats *next;
>>>>> +
>>>>> +    h->idx++;
>>>>> +    if (h->idx == HISTORY_LEN) {
>>>>> +        h->idx = 0;
>>>>> +    }
>>>>> +    next = &h->sample[h->idx];
>>>>> +    memset(next, 0, sizeof(*next));
>>>>> +    return next;
>>>>> +}
>>>>> +
>>>>> +static inline struct iter_stats *
>>>>> +history_store(struct history *h, struct iter_stats *is)
>>>>> +{
>>>>> +    if (is) {
>>>>> +        h->sample[h->idx] = *is;
>>>>> +    }
>>>>> +    /* Advance the history pointer */
>>>>> +    return history_next(h);
>>>>> +}
>>>>> +
>>>>> +/* Functions recording PMD metrics per iteration. */
>>>>> +
>>>>>  static inline void
>>>>>  pmd_perf_start_iteration(struct pmd_perf_stats *s)
>>>>>  {
>>>>> +    if (s->clear) {
>>>>> +        /* Clear the PMD stats before starting next iteration. */
>>>>> +        pmd_perf_stats_clear__(s);
>>>>> +    }
>>>>> +    /* Initialize the current interval stats. */
>>>>> +    memset(&s->current, 0, sizeof(struct iter_stats));
>>>>>      if (OVS_LIKELY(s->last_tsc)) {
>>>>>          /* We assume here that last_tsc was updated immediately prior at
>>>>>           * the end of the previous iteration, or just before the first
>>>>>           * iteration. */
>>>>> -        s->start_it_tsc = s->last_tsc;
>>>>> +        s->current.timestamp = s->last_tsc;
>>>>>      } else {
>>>>>          /* In case last_tsc has never been set before. */
>>>>> -        s->start_it_tsc = cycles_counter_update(s);
>>>>> +        s->current.timestamp = cycles_counter_update(s);
>>>>>      }
>>>>>  }
>>>>>
>>>>>  static inline void
>>>>> -pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets)
>>>>> +pmd_perf_end_iteration(struct pmd_perf_stats *s, int rx_packets,
>>>>> +                       int tx_packets, bool full_metrics)
>>>>>  {
>>>>> -    uint64_t cycles = cycles_counter_update(s) - s->start_it_tsc;
>>>>> +    uint64_t now_tsc = cycles_counter_update(s);
>>>>> +    struct iter_stats *cum_ms;
>>>>> +    uint64_t cycles, cycles_per_pkt = 0;
>>>>>
>>>>> -    if (rx_packets > 0) {
>>>>> +    if (OVS_UNLIKELY(s->current.timestamp == 0)) {
>>>>> +        /* Stats were cleared during the ongoing iteration. */
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    cycles = now_tsc - s->current.timestamp;
>>>>> +    s->current.cycles = cycles;
>>>>> +    s->current.pkts = rx_packets;
>>>>> +
>>>>> +    if (rx_packets + tx_packets > 0) {
>>>>>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_BUSY, cycles);
>>>>>      } else {
>>>>>          pmd_perf_update_counter(s, PMD_CYCLES_ITER_IDLE, cycles);
>>>>>      }
>>>>> +    /* Add iteration samples to histograms. */
>>>>> +    histogram_add_sample(&s->cycles, cycles);
>>>>> +    histogram_add_sample(&s->pkts, rx_packets);
>>>>> +
>>>>> +    if (!full_metrics) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    s->counters.n[PMD_CYCLES_UPCALL] += s->current.upcall_cycles;
>>>>> +
>>>>> +    if (rx_packets > 0) {
>>>>> +        cycles_per_pkt = cycles / rx_packets;
>>>>> +        histogram_add_sample(&s->cycles_per_pkt, cycles_per_pkt);
>>>>> +    }
>>>>> +    if (s->current.batches > 0) {
>>>>> +        histogram_add_sample(&s->pkts_per_batch,
>>>>> +                             rx_packets / s->current.batches);
>>>>> +    }
>>>>> +    histogram_add_sample(&s->upcalls, s->current.upcalls);
>>>>> +    if (s->current.upcalls > 0) {
>>>>> +        histogram_add_sample(&s->cycles_per_upcall,
>>>>> +                             s->current.upcall_cycles / 
>>>>> s->current.upcalls);
>>>>> +    }
>>>>> +    histogram_add_sample(&s->max_vhost_qfill, s-
>>>> current.max_vhost_qfill);
>>>>> +
>>>>> +    /* Add iteration samples to millisecond stats. */
>>>>> +    cum_ms = history_current(&s->milliseconds);
>>>>> +    cum_ms->iterations++;
>>>>> +    cum_ms->cycles += cycles;
>>>>> +    if (rx_packets > 0) {
>>>>> +        cum_ms->busy_cycles += cycles;
>>>>> +    }
>>>>> +    cum_ms->pkts += s->current.pkts;
>>>>> +    cum_ms->upcalls += s->current.upcalls;
>>>>> +    cum_ms->upcall_cycles += s->current.upcall_cycles;
>>>>> +    cum_ms->batches += s->current.batches;
>>>>> +    cum_ms->max_vhost_qfill += s->current.max_vhost_qfill;
>>>>> +
>>>>> +    /* Store in iteration history. */
>>>>> +    history_store(&s->iterations, &s->current);
>>>>> +    if (now_tsc > s->next_check_tsc) {
>>>>> +        /* Check if ms is completed and store in milliseconds history. */
>>>>> +        uint64_t now = time_msec();
>>>>> +        if (now != cum_ms->timestamp) {
>>>>> +            /* Add ms stats to totals. */
>>>>> +            s->totals.iterations += cum_ms->iterations;
>>>>> +            s->totals.cycles += cum_ms->cycles;
>>>>> +            s->totals.busy_cycles += cum_ms->busy_cycles;
>>>>> +            s->totals.pkts += cum_ms->pkts;
>>>>> +            s->totals.upcalls += cum_ms->upcalls;
>>>>> +            s->totals.upcall_cycles += cum_ms->upcall_cycles;
>>>>> +            s->totals.batches += cum_ms->batches;
>>>>> +            s->totals.max_vhost_qfill += cum_ms->max_vhost_qfill;
>>>>> +            cum_ms = history_next(&s->milliseconds);
>>>>> +            cum_ms->timestamp = now;
>>>>> +        }
>>>>> +        s->next_check_tsc = now_tsc + 10000;
>>>>> +    }
>>>>>  }
>>>>>
>>>>> +/* Functions for formatting the output of commands. */
>>>>> +
>>>>> +struct pmd_perf_params {
>>>>> +    int command_type;
>>>>> +    bool histograms;
>>>>> +    size_t iter_hist_len;
>>>>> +    size_t ms_hist_len;
>>>>> +};
>>>>> +
>>>>> +void pmd_perf_format_overall_stats(struct ds *str, struct
>>> pmd_perf_stats *s,
>>>>> +                                   double duration);
>>>>> +void pmd_perf_format_histograms(struct ds *str, struct
>>> pmd_perf_stats *s);
>>>>> +void pmd_perf_format_iteration_history(struct ds *str,
>>>>> +                                       struct pmd_perf_stats *s,
>>>>> +                                       int n_iter);
>>>>> +void pmd_perf_format_ms_history(struct ds *str, struct
>>> pmd_perf_stats *s,
>>>>> +                                int n_ms);
>>>>> +
>>>>>  #ifdef  __cplusplus
>>>>>  }
>>>>>  #endif
>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>>>> index 48a8ebb..f5931bf 100644
>>>>> --- a/lib/dpif-netdev.c
>>>>> +++ b/lib/dpif-netdev.c
>>>>> @@ -49,6 +49,7 @@
>>>>>  #include "id-pool.h"
>>>>>  #include "latch.h"
>>>>>  #include "netdev.h"
>>>>> +#include "netdev-provider.h"
>>>>>  #include "netdev-vport.h"
>>>>>  #include "netlink.h"
>>>>>  #include "odp-execute.h"
>>>>> @@ -281,6 +282,8 @@ struct dp_netdev {
>>>>>
>>>>>      /* Probability of EMC insertions is a factor of 'emc_insert_min'.*/
>>>>>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE) atomic_uint32_t
>>> emc_insert_min;
>>>>> +    /* Enable collection of PMD performance metrics. */
>>>>> +    ATOMIC(bool) pmd_perf_metrics;
>>>>
>>>> atomic_bool ?
>>>
>>> Is atomic_bool here an overkill too?
>>>
>>>>
>>>>>
>>>>>      /* Protects access to ofproto-dpif-upcall interface during 
>>>>> revalidator
>>>>>       * thread synchronization. */
>>>>> @@ -712,6 +715,8 @@ static inline bool emc_entry_alive(struct
>>> emc_entry *ce);
>>>>>  static void emc_clear_entry(struct emc_entry *ce);
>>>>>
>>>>>  static void dp_netdev_request_reconfigure(struct dp_netdev *dp);
>>>>> +static inline bool
>>>>> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread
>>> *pmd);
>>>>>
>>>>>  static void
>>>>>  emc_cache_init(struct emc_cache *flow_cache)
>>>>> @@ -795,7 +800,8 @@ get_dp_netdev(const struct dpif *dpif)
>>>>>  enum pmd_info_type {
>>>>>      PMD_INFO_SHOW_STATS,  /* Show how cpu cycles are spent. */
>>>>>      PMD_INFO_CLEAR_STATS, /* Set the cycles count to 0. */
>>>>> -    PMD_INFO_SHOW_RXQ     /* Show poll-lists of pmd threads. */
>>>>> +    PMD_INFO_SHOW_RXQ,    /* Show poll lists of pmd threads. */
>>>>> +    PMD_INFO_PERF_SHOW,   /* Show pmd performance details. */
>>>>>  };
>>>>>
>>>>>  static void
>>>>> @@ -886,6 +892,44 @@ pmd_info_show_stats(struct ds *reply,
>>>>>                    stats[PMD_CYCLES_ITER_BUSY], total_packets);
>>>>>  }
>>>>>
>>>>> +static void
>>>>> +pmd_info_show_perf(struct ds *reply,
>>>>> +                   struct dp_netdev_pmd_thread *pmd,
>>>>> +                   struct pmd_perf_params *par)
>>>>> +{
>>>>> +    if (pmd->core_id != NON_PMD_CORE_ID) {
>>>>> +        char *time_str =
>>>>> +                xastrftime_msec("%H:%M:%S.###", time_wall_msec(), true);
>>>>> +        long long now = time_msec();
>>>>> +        double duration = (now - pmd->perf_stats.start_ms) / 1000.0;
>>>>> +
>>>>> +        ds_put_cstr(reply, "\n");
>>>>> +        ds_put_format(reply, "Time: %s\n", time_str);
>>>>> +        ds_put_format(reply, "Measurement duration: %.3f s\n", duration);
>>>>> +        ds_put_cstr(reply, "\n");
>>>>> +        format_pmd_thread(reply, pmd);
>>>>> +        ds_put_cstr(reply, "\n");
>>>>> +        pmd_perf_format_overall_stats(reply, &pmd->perf_stats,
>>> duration);
>>>>> +        if (pmd_perf_metrics_enabled(pmd)) {
>>>>> +            if (par->histograms) {
>>>>> +                ds_put_cstr(reply, "\n");
>>>>> +                pmd_perf_format_histograms(reply, &pmd->perf_stats);
>>>>> +            }
>>>>> +            if (par->iter_hist_len > 0) {
>>>>> +                ds_put_cstr(reply, "\n");
>>>>> +                pmd_perf_format_iteration_history(reply, 
>>>>> &pmd->perf_stats,
>>>>> +                        par->iter_hist_len);
>>>>> +            }
>>>>> +            if (par->ms_hist_len > 0) {
>>>>> +                ds_put_cstr(reply, "\n");
>>>>> +                pmd_perf_format_ms_history(reply, &pmd->perf_stats,
>>>>> +                        par->ms_hist_len);
>>>>> +            }
>>>>> +        }
>>>>> +        free(time_str);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>  static int
>>>>>  compare_poll_list(const void *a_, const void *b_)
>>>>>  {
>>>>> @@ -1088,9 +1132,15 @@ dpif_netdev_pmd_info(struct unixctl_conn
>>> *conn, int argc, const char *argv[],
>>>>>          if (type == PMD_INFO_SHOW_RXQ) {
>>>>>              pmd_info_show_rxq(&reply, pmd);
>>>>>          } else if (type == PMD_INFO_CLEAR_STATS) {
>>>>> -            pmd_perf_stats_clear(&pmd->perf_stats);
>>>>> +            if (pmd->core_id == NON_PMD_CORE_ID) {
>>>>> +                non_pmd_perf_stats_clear(&pmd->perf_stats);
>>>>> +            } else {
>>>>> +                pmd_perf_stats_clear(&pmd->perf_stats);
>>>>> +            }
>>>>
>>>> Can we call clrear() only for PMD threads here? We're not counting stats 
>>>> for
>>>> non-pmd anyway.
>>>
>>> The non-pmd thread does have the basic set of PMD counters for the
>>> purpose of pmd-stats-show which need to be cleared at pmd-stats-clear.
>>>
>>> Coming back to the discussion of invoking the blocking
>>> pmd_perf_stats_clear(&pmd->perf_stats) here:
>>>
>>> This code fragment is executed in the following loop in
>>> dpif_netdev_pmd_info():
>>>
>>>     sorted_poll_thread_list(dp, &pmd_list, &n);
>>>     for (size_t i = 0; i < n; i++) {
>>>         struct dp_netdev_pmd_thread *pmd = pmd_list[i];
>>>         if (!pmd) {
>>>             break;
>>>         }
>>>         if (filter_on_pmd && pmd->core_id != core_id) {
>>>             continue;
>>>         }
>>>         if (type == PMD_INFO_SHOW_RXQ) {
>>>             pmd_info_show_rxq(&reply, pmd);
>>>         } else if (type == PMD_INFO_CLEAR_STATS) {
>>>             if (pmd->core_id == NON_PMD_CORE_ID) {
>>>                 non_pmd_perf_stats_clear(&pmd->perf_stats);
>>>             } else {
>>>                 pmd_perf_stats_clear(&pmd->perf_stats);
>>>             }
>>>         } else if (type == PMD_INFO_SHOW_STATS) {
>>>             pmd_info_show_stats(&reply, pmd);
>>>         } else if (type == PMD_INFO_PERF_SHOW) {
>>>             pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params *)aux);
>>>         }
>>>     }
>>>     free(pmd_list);
>>>
>>> We are iterating over a sorted list of pmd pointers from cmap dp-
>>>> poll_threads under the protection of the global dp_netdev_mutex lock.
>>> Each pmd struct in this cmap (except for the NON_PMD_CORE_ID) is running
>>> a pmd thread because the thread is created immediately after inserting the
>>> pmd into dp->poll_threads in reconfigure_pmd_threads() and ended
>>> immediately before removing the pmd from that cmap in
>>> dp_netdev_del_pmd(). Can either of this happen while the
>>> dp_netdev_mutext is held. If not then there is not even a potential race
>>> condition.
>>>
>>> So I think the we can be sure that the pmd thread is running when calling
>>> pmd_perf_stats_clear() here. Do you agree?
> 
> The main issue here is that PMD thread could sleep and never call the
> pmd_perf_start_iteration(). PMD thread could wait for reload inside the
> following loop if it has no rx queues to poll:
> 
> 4258     if (!poll_cnt) {                                                     
>         
> 4259         while (seq_read(pmd->reload_seq) == pmd->last_reload_seq) {      
>         
> 4260             seq_wait(pmd->reload_seq, pmd->last_reload_seq);             
>         
> 4261             poll_block();                                                
>         
> 4262         }                                                                
>         
> 4263         lc = UINT_MAX;                                                   
>         
> 4264     }
> 
> And the main thread will wait forever.
> 
> 
> About volatiles and atomics:
> 'volatile' only forces compiler to use actual loads/stores. But it doesn't
> protect from operations' reordering by compiler and definitely not from memory
> operations' reordering on cpu.
> 
> lets look at pmd_perf_stats_clear__() function:
> 
> 365 void                                                                      
>       
> 366 pmd_perf_stats_clear__(struct pmd_perf_stats *s)                          
>       
> 367 {                                                                         
>       
> 368     for (int i = 0; i < PMD_N_STATS; i++) {                               
>       
> 369         atomic_read_relaxed(&s->counters.n[i], &s->counters.zero[i]);     
>       
> 370     }                                                                     
>       
> 371     memset(&s->current, 0 , sizeof(struct iter_stats));                   
>       
> 372     memset(&s->totals, 0 , sizeof(struct iter_stats));                    
>       
> 373     histogram_clear(&s->cycles);                                          
>       
> 374     histogram_clear(&s->pkts);                                            
>       
> 375     histogram_clear(&s->cycles_per_pkt);                                  
>       
> 376     histogram_clear(&s->upcalls);                                         
>       
> 377     histogram_clear(&s->cycles_per_upcall);                               
>       
> 378     histogram_clear(&s->pkts_per_batch);                                  
>       
> 379     histogram_clear(&s->max_vhost_qfill);                                 
>       
> 380     history_init(&s->iterations);                                         
>       
> 381     history_init(&s->milliseconds);                                       
>       
> 382     s->start_ms = time_msec();                                            
>       
> 383     s->milliseconds.sample[0].timestamp = s->start_ms;                    
>       
> 384     s->log_begin_it = UINT64_MAX;                                         
>       
> 385     s->log_end_it = UINT64_MAX;                                           
>       
> 386     /* Clearing finished. */                                              
>       
> 387     s->clear = false;                                                     
>       
> 388 }
> 
> Lets assume that 'clear' is volatile.
> 's->clear = false;' is the last operation in the code but compiler is able to
> move it anywhere in this function. This will break the logic.
> To avoid this reordering compiler barrier required, but it will not help
> with CPU reordering. In this case for non-x86 architectures real smp write
> barrier required.
> Mutex will be much better solution.
> 
> Lets assume that 'clear' is atomic.
> Atomic will provide protection from compiler reordering just like volatile.
> But, to prevent reordering on CPU we'll have to use Release-Acquire memory
> ordering for atomic operations. Isn't it already a mutex?
> 
> Conclusion:
> Since you want to protect your shared data from access by other threads you
> have to use mutex or not protect it at all (atomics and volatiles are
> effectively useless for this use-case unless you're combing them with memory
> barriers or heavy memory ordering qualifiers).
> In fact that you reading stats directly (non-atomically and without using
> volatile access) you will never have accurate data. From that point of view
> trying to protect data clearing just to have a bit more accurate stats-show
> output looks strange.
> 
> Some reading:
> https://www.kernel.org/doc/Documentation/process/volatile-considered-harmful.rst
> 
> 
> About performance impact of relaxed atomic operations:
> On most modern architectures relaxed load/store operations on 32bit integers
> on aligned memory will simply become usual operations without any memory 
> fences.
> This means that rcu_get operation like get_vid() will cost nothing.
> IMHO, it's clear that we should use atomic_bool for 'pmd_perf_metrics' to 
> avoid
> any issues with strange architectures. Compiler will make all the dirty work 
> for us
> without performance impact in general case.
> 
> 
>>>
>>>>
>>>>>          } else if (type == PMD_INFO_SHOW_STATS) {
>>>>>              pmd_info_show_stats(&reply, pmd);
>>>>> +        } else if (type == PMD_INFO_PERF_SHOW) {
>>>>> +            pmd_info_show_perf(&reply, pmd, (struct pmd_perf_params
>>> *)aux);
>>>>>          }
>>>>>      }
>>>>>      free(pmd_list);
>>>>> @@ -1100,6 +1150,48 @@ dpif_netdev_pmd_info(struct unixctl_conn
>>> *conn, int argc, const char *argv[],
>>>>>      unixctl_command_reply(conn, ds_cstr(&reply));
>>>>>      ds_destroy(&reply);
>>>>>  }
>>>>> +
>>>>> +static void
>>>>> +pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
>>>>> +                          const char *argv[],
>>>>> +                          void *aux OVS_UNUSED)
>>>>> +{
>>>>> +    struct pmd_perf_params par;
>>>>> +    long int it_hist = 0, ms_hist = 0;
>>>>> +    par.histograms = true;
>>>>> +
>>>>> +    while (argc > 1) {
>>>>> +        if (!strcmp(argv[1], "-nh")) {
>>>>> +            par.histograms = false;
>>>>> +            argc -= 1;
>>>>> +            argv += 1;
>>>>> +        } else if (!strcmp(argv[1], "-it") && argc > 2) {
>>>>> +            it_hist = strtol(argv[2], NULL, 10);
>>>>> +            if (it_hist < 0) {
>>>>> +                it_hist = 0;
>>>>> +            } else if (it_hist > HISTORY_LEN) {
>>>>> +                it_hist = HISTORY_LEN;
>>>>> +            }
>>>>> +            argc -= 2;
>>>>> +            argv += 2;
>>>>> +        } else if (!strcmp(argv[1], "-ms") && argc > 2) {
>>>>> +            ms_hist = strtol(argv[2], NULL, 10);
>>>>> +            if (ms_hist < 0) {
>>>>> +                ms_hist = 0;
>>>>> +            } else if (ms_hist > HISTORY_LEN) {
>>>>> +                ms_hist = HISTORY_LEN;
>>>>> +            }
>>>>> +            argc -= 2;
>>>>> +            argv += 2;
>>>>> +        } else {
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +    par.iter_hist_len = it_hist;
>>>>> +    par.ms_hist_len = ms_hist;
>>>>> +    par.command_type = PMD_INFO_PERF_SHOW;
>>>>> +    dpif_netdev_pmd_info(conn, argc, argv, &par);
>>>>> +}
>>>>>
>>>
>>>>>  static int
>>>>>  dpif_netdev_init(void)
>>>>> @@ -1117,6 +1209,12 @@ dpif_netdev_init(void)
>>>>>      unixctl_command_register("dpif-netdev/pmd-rxq-show", "[-pmd
>>> core] [dp]",
>>>>>                               0, 3, dpif_netdev_pmd_info,
>>>>>                               (void *)&poll_aux);
>>>>> +    unixctl_command_register("dpif-netdev/pmd-perf-show",
>>>>> +                             "[-nh] [-it iter-history-len]"
>>>>> +                             " [-ms ms-history-len]"
>>>>> +                             " [-pmd core | dp]",
>>>>> +                             0, 7, pmd_perf_show_cmd,
>>>>> +                             NULL);
>>>>>      unixctl_command_register("dpif-netdev/pmd-rxq-rebalance", "[dp]",
>>>>>                               0, 1, dpif_netdev_pmd_rebalance,
>>>>>                               NULL);
>>>>> @@ -3003,6 +3101,18 @@ dpif_netdev_set_config(struct dpif *dpif,
>>> const struct smap *other_config)
>>>>>          }
>>>>>      }
>>>>>
>>>>> +    bool perf_enabled = smap_get_bool(other_config, "pmd-perf-
>>> metrics", false);
>>>>> +    bool cur_perf_enabled;
>>>>> +    atomic_read_relaxed(&dp->pmd_perf_metrics, &cur_perf_enabled);
>>>>> +    if (perf_enabled != cur_perf_enabled) {
>>>>> +        atomic_store_relaxed(&dp->pmd_perf_metrics, perf_enabled);
>>>>> +        if (perf_enabled) {
>>>>> +            VLOG_INFO("PMD performance metrics collection enabled");
>>>>> +        } else {
>>>>> +            VLOG_INFO("PMD performance metrics collection disabled");
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>>      return 0;
>>>>>  }
>>>>>
>>>>> @@ -3139,6 +3249,20 @@ dp_netdev_rxq_set_cycles(struct
>>> dp_netdev_rxq *rx,
>>>>>     atomic_store_relaxed(&rx->cycles[type], cycles);
>>>>>  }
>>>>>
>>>>> +static inline bool
>>>>> +pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread
>>> *pmd)
>>>>> +{
>>>>> +    /* If stores and reads of 64-bit integers are not atomic, the
>>>>> +     * full PMD performance metrics are not available as locked
>>>>> +     * access to 64 bit integers would be prohibitively expensive. */
>>>>> +    if (sizeof(uint64_t) > sizeof(void *)) {
>>>>> +        return false;
>>>>> +    }
>>>>> +    bool pmd_perf_enabled;
>>>>> +    atomic_read_relaxed(&pmd->dp->pmd_perf_metrics,
>>> &pmd_perf_enabled);
>>>>> +    return pmd_perf_enabled;
>>>>> +}
>>>>> +
>>>>>  static void
>>>>>  dp_netdev_rxq_add_cycles(struct dp_netdev_rxq *rx,
>>>>>                           enum rxq_cycles_counter_type type,
>>>>> @@ -3247,10 +3371,11 @@ dp_netdev_process_rxq_port(struct
>>> dp_netdev_pmd_thread *pmd,
>>>>>                             struct dp_netdev_rxq *rxq,
>>>>>                             odp_port_t port_no)
>>>>>  {
>>>>> +    struct pmd_perf_stats *s = &pmd->perf_stats;
>>>>>      struct dp_packet_batch batch;
>>>>>      struct cycle_timer timer;
>>>>>      int error;
>>>>> -    int batch_cnt = 0, output_cnt = 0;
>>>>> +    int batch_cnt = 0;
>>>>>      uint64_t cycles;
>>>>>
>>>>>      /* Measure duration for polling and processing rx burst. */
>>>>> @@ -3264,15 +3389,34 @@ dp_netdev_process_rxq_port(struct
>>> dp_netdev_pmd_thread *pmd,
>>>>>          /* At least one packet received. */
>>>>>          *recirc_depth_get() = 0;
>>>>>          pmd_thread_ctx_time_update(pmd);
>>>>> -
>>>>>          batch_cnt = batch.count;
>>>>> +        if (pmd_perf_metrics_enabled(pmd)) {
>>>>> +            /* Update batch histogram. */
>>>>> +            s->current.batches++;
>>>>> +            histogram_add_sample(&s->pkts_per_batch, batch_cnt);
>>>>> +            /* Update the maximum Rx queue fill level. */
>>>>> +            int dev_type = netdev_dpdk_get_type(
>>>>> +                                netdev_rxq_get_netdev(rxq->rx));
>>>>> +            if (dev_type == DPDK_DEV_VHOST) {
>>>>> +                /* Check queue fill level for vhostuser ports. */
>>>>> +                uint32_t qfill = batch_cnt;
>>>>> +                if (OVS_UNLIKELY(batch_cnt == NETDEV_MAX_BURST)) {
>>>>> +                    /* Likely more packets in rxq. */
>>>>> +                    qfill += netdev_rxq_length(rxq->rx);
>>>>> +                }
>>>>> +                if (qfill > s->current.max_vhost_qfill) {
>>>>> +                    s->current.max_vhost_qfill = qfill;
>>>>> +                }
>>>>> +            }
>>>>> +        }
>>>>
>>>>
>>>> Oh. This is still ugly.
>>>> My suggestion:
>>>> Remove the implementation of 'netdev_rxq_length' for all ports except
>>>> vhostuser* . After that you'll be able to check the type of netdev just
>>>> by checking the result of this function and move all the netdev-dpdk
>>>> internals back to netdev-dpdk:
>>>> -----------------------------------------------------------------------
>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>>> index b493216..83446f7 100644
>>>> --- a/lib/dpif-netdev.c
>>>> +++ b/lib/dpif-netdev.c
>>>> @@ -3400,17 +3400,14 @@ dp_netdev_process_rxq_port(struct
>>> dp_netdev_pmd_thread *pmd,
>>>>              s->current.batches++;
>>>>              histogram_add_sample(&s->pkts_per_batch, batch_cnt);
>>>>              /* Update the maximum Rx queue fill level. */
>>>> -            int dev_type = netdev_dpdk_get_type(
>>>> -                                netdev_rxq_get_netdev(rxq->rx));
>>>> -            if (dev_type == DPDK_DEV_VHOST) {
>>>> -                /* Check queue fill level for vhostuser ports. */
>>>> -                uint32_t qfill = batch_cnt;
>>>> -                if (OVS_UNLIKELY(batch_cnt == NETDEV_MAX_BURST)) {
>>>> -                    /* Likely more packets in rxq. */
>>>> -                    qfill += netdev_rxq_length(rxq->rx);
>>>> -                }
>>>> -                if (qfill > s->current.max_vhost_qfill) {
>>>> -                    s->current.max_vhost_qfill = qfill;
>>>> +            if (OVS_UNLIKELY(batch_cnt == NETDEV_MAX_BURST)) {
>>>> +                /* Likely more packets in rxq. */
>>>> +                int res = netdev_rxq_length(rxq->rx);
>>>> +
>>>> +                /* Currently, netdev_rxq_length() implemented only for 
>>>> vhost
>>>> +                 * ports. So, we don't need to check netdev type here. */
>>>> +                if (res > 0 && batch_cnt + res > 
>>>> s->current.max_vhost_qfill) {
>>>> +                    s->current.max_vhost_qfill = batch_cnt + res;
>>>>                  }
>>>>              }
>>>>          }
>>>> -----------------------------------------------------------------------
>>>>
>>>> 'netdev_rxq_length()' not used anyway and probably will never be used.
>>>>
>>>> Suggestion 2:
>>>> How about to rename "max_vhost_qfill" --> "max_qfill" ? This will make
>>> code
>>>> more generic. In this case we could remove explanations from above code.
>>>>
>>>
>>> I am afraid this little "ugliness" is unavoidable.It is the explicit 
>>> purpose of the
>>> PMD performance metrics and supervision feature to monitor only the
>>> vhostuser queue length:
>>> 1. Physical port rx queues are much longer (2K default vs 128 with stock
>>> Qemu and virtio drivers) so mixing them would obfuscate the key data we
>>> are interested in.
>>> 2. There is little value in supervising phy port rx queues for excessive 
>>> queue
>>> fill level as the NICs provide rx drop counters in contrast to vhostuser 
>>> (until
>>> hopefully a future virtio standard provides a drop counter and virtio 
>>> drivers
>>> make use of it to report dropped tx packets)
>>> 3. Removing the explicit netdev type check in dpif-netdev as you suggest
>>> would base the correct working of the PMD performance metrics and
>>> supervision on design decisions taken in the netdev complex. Somebody
>>> implementing rxq_length() for the dpdk_class or dpdk_ring_class for
>>> whatever reason would break PMD performance metrics inadvertedly. I
>>> don't think we want such implicit coupling and we don't want to specifically
>>> reserve this netdev rxq operation for the purpose of PMD metrics either.
> 
> We can left a warning inside dpdk_class that will describe the situation to
> prevent breaking if someone will want to implement rxq_length() for it.
> But I don't see any reasonable use case for that.
>



Another option is to use
        
        if (!strncmp(netdev_get_type(rxq->netdev, "vhost", 4))) {
                ...
        }

Checking of 4 chars should not take long.
Could you, please, check the performance of this solution?

We can also add one field like numeric_type to netdev struct and
fill it from some global enum defined in netdev-provider.h near
to all the netdev classes. And implement something like
netdev_get_numeric_type().


enum netdev_numeric_type {
        NETDEV_TYPE_BSD,
        NETDEV_TYPE_WINDOWS,
        NETDEV_TYPE_LINUX,
        NETDEV_TYPE_INTERNAL,
        NETDEV_TYPE_TAP,
        NETDEV_TYPE_DPDK,
        NETDEV_TYPE_DPDK_RING,
        NETDEV_TYPE_DPDK_VHOST,
        NETDEV_TYPE_DPDK_VHOST_CLIENT,
        NETDEV_TYPE_TUNNEL_GENEVE,
        NETDEV_TYPE_TUNNEL_GRE,
        NETDEV_TYPE_TUNNEL_VXLAN,
        NETDEV_TYPE_TUNNEL_LISP,
        NETDEV_TYPE_TUNNEL_STT,
}

struct netdev_class {
        ...
        const char *type;
        enum netdev_numeric_type numeric_type;
        ...
}

enum netdev_numeric_type
netdev_get_numeric_type(const netdev *netdev)
{
    return netdev->netdev_class->numeric_type;
}


static const struct netdev_class dpdk_vhost_client_class =                      
 
     NETDEV_DPDK_CLASS(                                                         
  
         "dpdkvhostuserclient",
         NETDEV_TYPE_DPDK_VHOST_CLIENT,
         NULL,
         netdev_dpdk_vhost_client_construct,
         netdev_dpdk_vhost_destruct,
     ....
     );


And use

        n_type = netdev_get_numeric_type(rxq->netdev);
        if (n_type == NETDEV_TYPE_DPDK_VHOST
            || n_type == NETDEV_TYPE_DPDK_VHOST_CLIENT) {
                /* account rxq length */
        }

>>>
>> [[BO'M]] I was going to agree with the point about not breaking the netdev 
>> abstraction here. But those points are convincing that in this case we do 
>> need to peek behind the abstract interface and only call rxq_length for 
>> vhost queues.
>>
>>>>> +        /* Process packet batch. */
>>>>>          dp_netdev_input(pmd, &batch, port_no);
>>>>>
>>>>>          /* Assign processing cycles to rx queue. */
>>>>>          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
>>>>>          dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_PROC_CURR, cycles);
>>>>>
>>>>> -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
>>>>> +        dp_netdev_pmd_flush_output_packets(pmd, false);
>>>>>      } else {
>>>>>          /* Discard cycles. */
>>>>>          cycle_timer_stop(&pmd->perf_stats, &timer);
>>>>> @@ -3286,7 +3430,7 @@ dp_netdev_process_rxq_port(struct
>>> dp_netdev_pmd_thread *pmd,
>>>>>
>>>>>      pmd->ctx.last_rxq = NULL;
>>>>>
>>>>> -    return batch_cnt + output_cnt;
>>>>> +    return batch_cnt;
>>>>>  }
>>>>>
>>>>>  static struct tx_port *
>>>>> @@ -4119,22 +4263,23 @@ reload:
>>>>>
>>>>>      cycles_counter_update(s);
>>>>>      for (;;) {
>>>>> -        uint64_t iter_packets = 0;
>>>>> +        uint64_t rx_packets = 0, tx_packets = 0;
>>>>>
>>>>>          pmd_perf_start_iteration(s);
>>>>> +
>>>>>          for (i = 0; i < poll_cnt; i++) {
>>>>>              process_packets =
>>>>>                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
>>>>>                                             poll_list[i].port_no);
>>>>> -            iter_packets += process_packets;
>>>>> +            rx_packets += process_packets;
>>>>>          }
>>>>>
>>>>> -        if (!iter_packets) {
>>>>> +        if (!rx_packets) {
>>>>>              /* We didn't receive anything in the process loop.
>>>>>               * Check if we need to send something.
>>>>>               * There was no time updates on current iteration. */
>>>>>              pmd_thread_ctx_time_update(pmd);
>>>>> -            iter_packets += dp_netdev_pmd_flush_output_packets(pmd,
>>> false);
>>>>> +            tx_packets = dp_netdev_pmd_flush_output_packets(pmd,
>>> false);
>>>>>          }
>>>>>
>>>>>          if (lc++ > 1024) {
>>>>> @@ -4153,7 +4298,8 @@ reload:
>>>>>                  break;
>>>>>              }
>>>>>          }
>>>>> -        pmd_perf_end_iteration(s, iter_packets);
>>>>> +        pmd_perf_end_iteration(s, rx_packets, tx_packets,
>>>>> +                               pmd_perf_metrics_enabled(pmd));
>>>>>      }
>>>>>
>>>>>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>>>>> @@ -5050,6 +5196,7 @@ handle_packet_upcall(struct
>>> dp_netdev_pmd_thread *pmd,
>>>>>      struct match match;
>>>>>      ovs_u128 ufid;
>>>>>      int error;
>>>>> +    uint64_t cycles = cycles_counter_update(&pmd->perf_stats);
>>>>>
>>>>>      match.tun_md.valid = false;
>>>>>      miniflow_expand(&key->mf, &match.flow);
>>>>> @@ -5103,6 +5250,14 @@ handle_packet_upcall(struct
>>> dp_netdev_pmd_thread *pmd,
>>>>>          ovs_mutex_unlock(&pmd->flow_mutex);
>>>>>          emc_probabilistic_insert(pmd, key, netdev_flow);
>>>>>      }
>>>>> +    if (pmd_perf_metrics_enabled(pmd)) {
>>>>> +        /* Update upcall stats. */
>>>>> +        cycles = cycles_counter_update(&pmd->perf_stats) - cycles;
>>>>> +        struct pmd_perf_stats *s = &pmd->perf_stats;
>>>>> +        s->current.upcalls++;
>>>>> +        s->current.upcall_cycles += cycles;
>>>>> +        histogram_add_sample(&s->cycles_per_upcall, cycles);
>>>>> +    }
>>>>>      return error;
>>>>>  }
>>>>>
>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>> index 4200556..7a8fdc2 100644
>>>>> --- a/lib/netdev-dpdk.c
>>>>> +++ b/lib/netdev-dpdk.c
>>>>> @@ -36,6 +36,7 @@
>>>>>  #include <rte_mbuf.h>
>>>>>  #include <rte_meter.h>
>>>>>  #include <rte_pci.h>
>>>>> +#include <rte_version.h>
>>>>>  #include <rte_vhost.h>
>>>>>  #include <rte_version.h>
>>>>>
>>>>> @@ -188,11 +189,6 @@ enum { DPDK_RING_SIZE = 256 };
>>>>>  BUILD_ASSERT_DECL(IS_POW2(DPDK_RING_SIZE));
>>>>>  enum { DRAIN_TSC = 200000ULL };
>>>>>
>>>>> -enum dpdk_dev_type {
>>>>> -    DPDK_DEV_ETH = 0,
>>>>> -    DPDK_DEV_VHOST = 1,
>>>>> -};
>>>>> -
>>>>>  /* Quality of Service */
>>>>>
>>>>>  /* An instance of a QoS configuration.  Always associated with a
>>> particular
>>>>> @@ -843,6 +839,13 @@ netdev_dpdk_cast(const struct netdev *netdev)
>>>>>      return CONTAINER_OF(netdev, struct netdev_dpdk, up);
>>>>>  }
>>>>>
>>>>> +enum dpdk_dev_type
>>>>> +netdev_dpdk_get_type(const struct netdev *netdev)
>>>>> +{
>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>> +    return dev->type;
>>>>> +}
>>>>> +
>>>>>  static struct netdev *
>>>>>  netdev_dpdk_alloc(void)
>>>>>  {
>>>>> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
>>>>> index b7d02a7..2b357db 100644
>>>>> --- a/lib/netdev-dpdk.h
>>>>> +++ b/lib/netdev-dpdk.h
>>>>> @@ -22,11 +22,18 @@
>>>>>  #include "openvswitch/compiler.h"
>>>>>
>>>>>  struct dp_packet;
>>>>> +struct netdev;
>>>>> +
>>>>> +enum dpdk_dev_type {
>>>>> +    DPDK_DEV_ETH = 0,
>>>>> +    DPDK_DEV_VHOST = 1,
>>>>> +};
>>>>>
>>>>>  #ifdef DPDK_NETDEV
>>>>>
>>>>>  void netdev_dpdk_register(void);
>>>>>  void free_dpdk_buf(struct dp_packet *);
>>>>> +enum dpdk_dev_type netdev_dpdk_get_type(const struct netdev
>>> *netdev);
>>>>>
>>>>>  #else
>>>>>
>>>>> @@ -41,6 +48,13 @@ free_dpdk_buf(struct dp_packet *buf
>>> OVS_UNUSED)
>>>>>      /* Nothing */
>>>>>  }
>>>>>
>>>>> +static inline enum dpdk_dev_type
>>>>> +netdev_dpdk_get_type(const struct netdev *netdev OVS_UNUSED)
>>>>> +{
>>>>> +    /* Nothing to do. Return value zero to make compiler happy. */
>>>>> +    return DPDK_DEV_ETH;
>>>>> +}
>>>>> +
>>>>>  #endif
>>>>>
>>>>>  #endif /* netdev-dpdk.h */
>>>>> diff --git a/lib/netdev-dpif-unixctl.man b/lib/netdev-dpif-unixctl.man
>>>>> new file mode 100644
>>>>> index 0000000..53f4c51
>>>>> --- /dev/null
>>>>> +++ b/lib/netdev-dpif-unixctl.man
>>>>> @@ -0,0 +1,113 @@
>>>>> +.SS "DPIF-NETDEV COMMANDS"
>>>>> +These commands are used to expose internal information (mostly
>>> statistics)
>>>>> +about the "dpif-netdev" userspace datapath. If there is only one
>>> datapath
>>>>> +(as is often the case, unless \fBdpctl/\fR commands are used), the
>>> \fIdp\fR
>>>>> +argument can be omitted. By default the commands present data for all
>>> pmd
>>>>> +threads in the datapath. By specifying the "-pmd Core" option one can
>>> filter
>>>>> +the output for a single pmd in the datapath.
>>>>> +.
>>>>> +.IP "\fBdpif-netdev/pmd-stats-show\fR [\fB-pmd\fR \fIcore\fR]
>>> [\fIdp\fR]"
>>>>> +Shows performance statistics for one or all pmd threads of the datapath
>>>>> +\fIdp\fR. The special thread "main" sums up the statistics of every non
>>> pmd
>>>>> +thread.
>>>>> +
>>>>> +The sum of "emc hits", "masked hits" and "miss" is the number of
>>>>> +packet lookups performed by the datapath. Beware that a recirculated
>>> packet
>>>>> +experiences one additional lookup per recirculation, so there may be
>>>>> +more lookups than forwarded packets in the datapath.
>>>>> +
>>>>> +Cycles are counted using the TSC or similar facilities (when available on
>>>>> +the platform). The duration of one cycle depends on the processing
>>> platform.
>>>>> +
>>>>> +"idle cycles" refers to cycles spent in PMD iterations not forwarding any
>>>>> +any packets. "processing cycles" refers to cycles spent in PMD iterations
>>>>> +forwarding at least one packet, including the cost for polling, 
>>>>> processing
>>> and
>>>>> +transmitting said packets.
>>>>> +
>>>>> +To reset these counters use \fBdpif-netdev/pmd-stats-clear\fR.
>>>>> +.
>>>>> +.IP "\fBdpif-netdev/pmd-perf-show\fR [\fB-nh\fR] [\fB-it\fR
>>> \fIiter_len\fR] \
>>>>> +[\fB-ms\fR \fIms_len\fR] [\fB-pmd\fR \fIcore\fR] [\fIdp\fR]"
>>>>> +Shows detailed performance metrics for one or all pmds threads of the
>>>>> +user space datapath.
>>>>> +
>>>>> +The collection of detailed statistics can be controlled by a new
>>>>> +configuration parameter "other_config:pmd-perf-metrics". By default it
>>>>> +is disabled. The run-time overhead, when enabled, is in the order of 1%.
>>>>> +
>>>>> +The covered metrics per iteration are:
>>>>> +  - used cycles
>>>>> +  - forwared packets
>>>>> +  - number of rx batches
>>>>> +  - packets/rx batch
>>>>> +  - max. vhostuser queue fill level
>>>>> +  - number of upcalls
>>>>> +  - cycles spent in upcalls
>>>>> +
>>>>> +This raw recorded data is used threefold:
>>>>> +
>>>>> +1. In histograms for each of the following metrics:
>>>>> +   - cycles/iteration (logarithmic)
>>>>> +   - packets/iteration (logarithmic)
>>>>> +   - cycles/packet
>>>>> +   - packets/batch
>>>>> +   - max. vhostuser qlen (logarithmic)
>>>>> +   - upcalls
>>>>> +   - cycles/upcall (logarithmic)
>>>>> +   The histograms bins are divided linear or logarithmic.
>>>>> +
>>>>> +2. A cyclic history of the above metrics for 1024 iterations
>>>>> +
>>>>> +3. A cyclic history of the cummulative/average values per millisecond
>>>>> +   wall clock for the last 1024 milliseconds:
>>>>> +   - number of iterations
>>>>> +   - avg. cycles/iteration
>>>>> +   - packets (Kpps)
>>>>> +   - avg. packets/batch
>>>>> +   - avg. max vhost qlen
>>>>> +   - upcalls
>>>>> +   - avg. cycles/upcall
>>>>> +
>>>>> +The command options are
>>>>> +
>>>>> +    \fB-nh\fR:                  Suppress the histograms
>>>>> +    \fB-it\fR \fIiter_len\fR:   Display the last iter_len iteration stats
>>>>> +    \fB-ms\fR \fIms_len\fR:     Display the last ms_len millisecond stats
>>>>> +
>>>>> +The output always contains the following global PMD statistics:
>>>>> +
>>>>> +Time: 15:24:55.270 .br
>>>>> +Measurement duration: 1.008 s
>>>>> +
>>>>> +pmd thread numa_id 0 core_id 1:
>>>>> +
>>>>> +  Cycles:            2419034712  (2.40 GHz)
>>>>> +  Iterations:            572817  (1.76 us/it)
>>>>> +  - idle:                486808  (15.9 % cycles)
>>>>> +  - busy:                 86009  (84.1 % cycles)
>>>>> +  Packets:              2399607  (2381 Kpps, 848 cycles/pkt)
>>>>> +  Datapath passes:      3599415  (1.50 passes/pkt)
>>>>> +  - EMC hits:            336472  ( 9.3 %)
>>>>> +  - Megaflow hits:      3262943  (90.7 %, 1.00 subtbl lookups/hit)
>>>>> +  - Upcalls:                  0  ( 0.0 %, 0.0 us/upcall)
>>>>> +  - Lost upcalls:             0  ( 0.0 %)
>>>>> +
>>>>> +Here "Packets" actually reflects the number of packets forwarded by
>>> the
>>>>> +datapath. "Datapath passes" matches the number of packet lookups as
>>>>> +reported by the \fBdpif-netdev/pmd-stats-show\fR command.
>>>>> +
>>>>> +To reset the counters and start a new measurement use
>>>>> +\fBdpif-netdev/pmd-stats-clear\fR.
>>>>> +.
>>>>> +.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
>>>>> +Resets to zero the per pmd thread performance numbers shown by the
>>>>> +\fBdpif-netdev/pmd-stats-show\fR and \Bdpif-netdev/pmd-perf-show
>>> \fR commands.
>>>>> +It will NOT reset datapath or bridge statistics, only the values shown by
>>>>> +the above commands.
>>>>> +.
>>>>> +.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fB-pmd\fR \fIcore\fR]
>>> [\fIdp\fR]"
>>>>> +For one or all pmd threads of the datapath \fIdp\fR show the list of
>>> queue-ids
>>>>> +with port names, which this thread polls.
>>>>> +.
>>>>> +.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
>>>>> +Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current
>>> usage.
>>>>> diff --git a/manpages.mk b/manpages.mk
>>>>> index 351155f..9af2fa8 100644
>>>>> --- a/manpages.mk
>>>>> +++ b/manpages.mk
>>>>> @@ -256,6 +256,7 @@ vswitchd/ovs-vswitchd.8: \
>>>>>   lib/dpctl.man \
>>>>>   lib/memory-unixctl.man \
>>>>>   lib/netdev-dpdk-unixctl.man \
>>>>> + lib/netdev-dpif-unixctl.man \
>>>>>   lib/service.man \
>>>>>   lib/ssl-bootstrap.man \
>>>>>   lib/ssl.man \
>>>>> @@ -272,6 +273,7 @@ lib/daemon.man:
>>>>>  lib/dpctl.man:
>>>>>  lib/memory-unixctl.man:
>>>>>  lib/netdev-dpdk-unixctl.man:
>>>>> +lib/netdev-dpif-unixctl.man:
>>>>>  lib/service.man:
>>>>>  lib/ssl-bootstrap.man:
>>>>>  lib/ssl.man:
>>>>> diff --git a/vswitchd/ovs-vswitchd.8.in b/vswitchd/ovs-vswitchd.8.in
>>>>> index 80e5f53..7e4714a 100644
>>>>> --- a/vswitchd/ovs-vswitchd.8.in
>>>>> +++ b/vswitchd/ovs-vswitchd.8.in
>>>>> @@ -256,32 +256,7 @@ type).
>>>>>  ..
>>>>>  .so lib/dpctl.man
>>>>>  .
>>>>> -.SS "DPIF-NETDEV COMMANDS"
>>>>> -These commands are used to expose internal information (mostly
>>> statistics)
>>>>> -about the ``dpif-netdev'' userspace datapath. If there is only one
>>> datapath
>>>>> -(as is often the case, unless \fBdpctl/\fR commands are used), the
>>> \fIdp\fR
>>>>> -argument can be omitted.
>>>>> -.IP "\fBdpif-netdev/pmd-stats-show\fR [\fIdp\fR]"
>>>>> -Shows performance statistics for each pmd thread of the datapath
>>> \fIdp\fR.
>>>>> -The special thread ``main'' sums up the statistics of every non pmd
>>> thread.
>>>>> -The sum of ``emc hits'', ``masked hits'' and ``miss'' is the number of
>>>>> -packets received by the datapath.  Cycles are counted using the TSC or
>>> similar
>>>>> -facilities (when available on the platform).  To reset these counters use
>>>>> -\fBdpif-netdev/pmd-stats-clear\fR. The duration of one cycle depends
>>> on the
>>>>> -measuring infrastructure. ``idle cycles'' refers to cycles spent polling
>>>>> -devices but not receiving any packets. ``processing cycles'' refers to
>>> cycles
>>>>> -spent polling devices and successfully receiving packets, plus the cycles
>>>>> -spent processing said packets.
>>>>> -.IP "\fBdpif-netdev/pmd-stats-clear\fR [\fIdp\fR]"
>>>>> -Resets to zero the per pmd thread performance numbers shown by the
>>>>> -\fBdpif-netdev/pmd-stats-show\fR command.  It will NOT reset
>>> datapath or
>>>>> -bridge statistics, only the values shown by the above command.
>>>>> -.IP "\fBdpif-netdev/pmd-rxq-show\fR [\fIdp\fR]"
>>>>> -For each pmd thread of the datapath \fIdp\fR shows list of queue-ids
>>> with
>>>>> -port names, which this thread polls.
>>>>> -.IP "\fBdpif-netdev/pmd-rxq-rebalance\fR [\fIdp\fR]"
>>>>> -Reassigns rxqs to pmds in the datapath \fIdp\fR based on their current
>>> usage.
>>>>> -.
>>>>> +.so lib/netdev-dpif-unixctl.man
>>>>>  .so lib/netdev-dpdk-unixctl.man
>>>>>  .so ofproto/ofproto-dpif-unixctl.man
>>>>>  .so ofproto/ofproto-unixctl.man
>>>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>>>> index 61fb7b1..3aa8e8e 100644
>>>>> --- a/vswitchd/vswitch.xml
>>>>> +++ b/vswitchd/vswitch.xml
>>>>> @@ -375,6 +375,18 @@
>>>>>          </p>
>>>>>        </column>
>>>>>
>>>>> +      <column name="other_config" key="pmd-perf-metrics"
>>>>> +              type='{"type": "boolean"}'>
>>>>> +        <p>
>>>>> +          Enables recording of detailed PMD performance metrics for
>>> analysis
>>>>> +          and trouble-shooting. This can have a performance impact in the
>>>>> +          order of 1%.
>>>>> +        </p>
>>>>> +        <p>
>>>>> +          Defaults to false but can be changed at any time.
>>>>> +        </p>
>>>>> +      </column>
>>>>> +
>>>>>        <column name="other_config" key="n-handler-threads"
>>>>>                type='{"type": "integer", "minInteger": 1}'>
>>>>>          <p>
>>>>>
> 
> 
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v7 2/3] dpif-netdev: Detailed performance stats for PMDs

Reply via email to