Re: [ovs-dev] [PATCH RFC v3 4/4] dpif-netdev: Time based output batching.

Ilya Maximets Mon, 14 Aug 2017 06:36:25 -0700

On 14.08.2017 16:12, Jan Scheurich wrote:
>>> >From earlier in-house trials we know we need to target flush times of 50
>> us or less, so we clearly need better time resolution. Sub-ms timing in PMD
>> should be based on TSC cycles, which are already kept in the pmd struct.
>> Could you provide a corresponding patch for performance testing?
>>
>> I don't think that TSC is suitable in this case. Some reasons:
>>
>> * non-PMD threads are able to float across cpu cores.
>> * Turbo-boost can be enabled or frequency can be adjusted manually after
>> DPDK init.
>> * TSC cycles only calculated if DPDK enabled.
>>
>> TSC is used currently only for not really precise statistics.
>> For the real features we need more accurate time accounting.
>>
>> I believe that CLOCK_MONOTONIC is able to provide at least microsecond
>> granularity on the most of systems. We just need to add one more wrapper
>> function like 'time_usec()' to the lib/timeval.
>>
> 
> We have tested the effect of turbo mode on TSC and there is none. The TSC 
> frequency remains at the nominal clock speed, no matter if the core is 
> clocked down or up. So, I believe for PMD threads (where performance matters) 
> TSC would be an adequate and efficient clock.


It's highly platform dependent and testing on a few systems doesn't guarantee 
anything.
>From the other hand POSIX guarantee the monotonic characteristics for 
>CLOCK_MONOTONIC.

> 
> On PMDs I am a bit concerned about the overhead/latency introduced with the 
> clock_gettime() system call, but I haven't done any measurements to check the 
> actual impact. Have you?

Have you seen my incremental patches?
There is no overhead, because we're just replacing 'time_msec' with 'time_usec'.
No difference except converting timespec to usec instead of msec.

> 
> If we go for CLOCK_MONOTONIC in microsecond resolution, we should make sure 
> that the clock is read not more than once once every iteration (and cache the 
> us value as now in the pmd ctx struct as suggested in your other patch). But 
> then for consistency also the XPS feature should use the PMD time in us 
> resolution.

Again, please, look at my incremental patches.

> 
> For non-PMD thread we could actually skip time-based output batching 
> completely. The packet rates and the frequency of calls to dpif_netdev_run() 
> in the main ovs-vswitchd thread are so low that time-based flushing doesn't 
> seem to make much sense.
> 
> Below you can find an alternative incremental patch on top of your RFC 4/4 
> that uses TSC on PMD. We will be comparing the two alternatives for 
> performance both with non-PMD guests (iperf3) as well as PMD guests (DPDK 
> testpmd).

In your version you need to move all the output_batching related code
under #ifdef DPDK_NETDEV because it will brake userspace networking if
compiled without dpdk and output-max-latency != 0.

> 
> BR, Jan
> 
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 0d78ae4..8285786 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -265,7 +265,7 @@ struct dp_netdev {
>      struct hmap ports;
>      struct seq *port_seq;       /* Incremented whenever a port changes. */
>  
> -    /* The time that a packet can wait in output batch for sending. */
> +    /* Time in cycles that a packet can wait in output batch for sending. */
>      atomic_uint32_t output_max_latency;
>  
>      /* Meters. */
> @@ -508,7 +508,7 @@ struct tx_port {
>      int qid;
>      long long last_used;
>      struct hmap_node node;
> -    long long output_time;
> +    long long flush_time;   /* Time in LSC cycles to flush output buffer. */
>      struct dp_packet_batch output_pkts;
>  };
>  
> @@ -622,6 +622,7 @@ struct dpif_netdev {
>      uint64_t last_port_seq;
>  };
>  
> +static inline unsigned long cycles_per_microsecond(void);
>  static int get_port_by_number(struct dp_netdev *dp, odp_port_t port_no,
>                                struct dp_netdev_port **portp)
>      OVS_REQUIRES(dp->port_mutex);
> @@ -2921,11 +2922,12 @@ dpif_netdev_set_config(struct dpif *dpif, const 
> struct smap *other_config)
>      uint32_t output_max_latency, cur_max_latency;
>  
>      output_max_latency = smap_get_int(other_config, "output-max-latency",
> -                                      DEFAULT_OUTPUT_MAX_LATENCY);
> +                                      DEFAULT_OUTPUT_MAX_LATENCY)
> +                         * cycles_per_microsecond();
>      atomic_read_relaxed(&dp->output_max_latency, &cur_max_latency);
>      if (output_max_latency != cur_max_latency) {
>          atomic_store_relaxed(&dp->output_max_latency, output_max_latency);
> -        VLOG_INFO("Output maximum latency set to %"PRIu32" ms",
> +        VLOG_INFO("Output maximum latency set to %"PRIu32" cycles",
>                    output_max_latency);
>      }
>  
> @@ -3091,6 +3093,16 @@ cycles_counter(void)
>  #endif
>  }
>  
> +static inline unsigned long
> +cycles_per_microsecond(void)
> +{
> +#ifdef DPDK_NETDEV
> +    return rte_get_tsc_hz() / (1000 * 1000);
> +#else
> +    return 0;
> +#endif
> +}
> +
>  /* Fake mutex to make sure that the calls to cycles_count_* are balanced */
>  extern struct ovs_mutex cycles_counter_fake_mutex;
>  
> @@ -3171,7 +3183,7 @@ dp_netdev_pmd_flush_output_packets(struct 
> dp_netdev_pmd_thread *pmd,
>  
>      HMAP_FOR_EACH (p, node, &pmd->send_port_cache) {
>          if (!dp_packet_batch_is_empty(&p->output_pkts)
> -            && (force || p->output_time <= now)) {
> +            && (force || p->flush_time <= pmd->last_cycles)) {
>              output_cnt += dp_netdev_pmd_flush_output_on_port(pmd, p, now);
>          }
>      }
> @@ -3196,7 +3208,6 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread 
> *pmd,
>  
>          batch_cnt = batch.count;
>          dp_netdev_input(pmd, &batch, port_no, now);
> -        output_cnt = dp_netdev_pmd_flush_output_packets(pmd, now, false);
>      } else if (error != EAGAIN && error != EOPNOTSUPP) {
>          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>  
> @@ -3732,7 +3743,6 @@ dpif_netdev_run(struct dpif *dpif)
>      struct dp_netdev_pmd_thread *non_pmd;
>      uint64_t new_tnl_seq;
>      int process_packets;
> -    bool need_to_flush = true;
>  
>      ovs_mutex_lock(&dp->port_mutex);
>      non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID);
> @@ -3751,21 +3761,16 @@ dpif_netdev_run(struct dpif *dpif)
>                      cycles_count_intermediate(non_pmd, process_packets
>                                                         ? 
> PMD_CYCLES_PROCESSING
>                                                         : PMD_CYCLES_IDLE);
> -                    if (process_packets) {
> -                        need_to_flush = false;
> -                    }
>                  }
>              }
>          }
> -        if (need_to_flush) {
> -            /* We didn't receive anything in the process loop.
> -             * Check if we need to send something. */
> -            process_packets = dp_netdev_pmd_flush_output_packets(non_pmd,
> -                                                                 0, false);
> -            cycles_count_intermediate(non_pmd, process_packets
> -                                               ? PMD_CYCLES_PROCESSING
> -                                               : PMD_CYCLES_IDLE);
> -        }
> +
> +        /* Check if queued packets need to be flushed. */
> +        process_packets =
> +                dp_netdev_pmd_flush_output_packets(non_pmd, 0, false);
> +        cycles_count_intermediate(non_pmd, process_packets
> +                                           ? PMD_CYCLES_PROCESSING
> +                                           : PMD_CYCLES_IDLE);
>  
>          cycles_count_end(non_pmd, PMD_CYCLES_IDLE);
>          dpif_netdev_xps_revalidate_pmd(non_pmd, time_msec(), false);
> @@ -3946,7 +3951,6 @@ reload:
>      cycles_count_start(pmd);
>      for (;;) {
>          int process_packets;
> -        bool need_to_flush = true;
>  
>          for (i = 0; i < poll_cnt; i++) {
>              process_packets =
> @@ -3955,20 +3959,14 @@ reload:
>              cycles_count_intermediate(pmd,
>                                        process_packets ? PMD_CYCLES_PROCESSING
>                                                        : PMD_CYCLES_IDLE);
> -            if (process_packets) {
> -                need_to_flush = false;
> -            }
>          }
>  
> -        if (need_to_flush) {
> -            /* We didn't receive anything in the process loop.
> -             * Check if we need to send something. */
> -            process_packets = dp_netdev_pmd_flush_output_packets(pmd,
> -                                                                 0, false);
> -            cycles_count_intermediate(pmd,
> -                                      process_packets ? PMD_CYCLES_PROCESSING
> -                                                      : PMD_CYCLES_IDLE);
> -        }
> +        /* Check if queued packets need to be flushed. */
> +        process_packets =
> +                dp_netdev_pmd_flush_output_packets(pmd, 0, false);
> +        cycles_count_intermediate(pmd,
> +                                  process_packets ? PMD_CYCLES_PROCESSING
> +                                                  : PMD_CYCLES_IDLE);
>  
>          if (lc++ > 1024) {
>              bool reload;
> @@ -4593,7 +4591,7 @@ dp_netdev_add_port_tx_to_pmd(struct 
> dp_netdev_pmd_thread *pmd,
>  
>      tx->port = port;
>      tx->qid = -1;
> -    tx->output_time = 0LL;
> +    tx->flush_time = 0LL;
>      dp_packet_batch_init(&tx->output_pkts);
>  
>      hmap_insert(&pmd->tx_ports, &tx->node, hash_port_no(tx->port->port_no));
> @@ -5276,11 +5274,12 @@ dp_execute_cb(void *aux_, struct dp_packet_batch 
> *packets_,
>                  dp_netdev_pmd_flush_output_on_port(pmd, p, now);
>              }
>  
> -            if (dp_packet_batch_is_empty(&p->output_pkts)) {
> +            if (dp_packet_batch_is_empty(&p->output_pkts) &&
> +                !dp_packet_batch_is_empty(packets_)) {
>                  uint32_t cur_max_latency;
>  
>                  atomic_read_relaxed(&dp->output_max_latency, 
> &cur_max_latency);
> -                p->output_time = now + cur_max_latency;
> +                p->flush_time = pmd->last_cycles + cur_max_latency;
>                  pmd->n_output_batches++;
>              }
>  
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 23930f0..8ad1d00 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -345,9 +345,9 @@
>        </column>
>  
>        <column name="other_config" key="output-max-latency"
> -              type='{"type": "integer", "minInteger": 0, "maxInteger": 
> 1000}'>
> +              type='{"type": "integer", "minInteger": 0, "maxInteger": 
> 1000000}'>
>          <p>
> -          Specifies the time in milliseconds that a packet can wait in output
> +          Specifies the time in microseconds that a packet can wait in output
>            batch for sending i.e. amount of time that packet can spend in an
>            intermediate output queue before sending to netdev.
>            This option can be used to configure balance between throughput
> 
> 
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH RFC v3 4/4] dpif-netdev: Time based output batching.

Reply via email to