On Wed, 2 Dec 2015 16:48:17 +0000
"Dumitrescu, Cristian" <cristian.dumitrescu at intel.com> wrote:

> 
> 
> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen at networkplumber.org]
> > Sent: Sunday, November 29, 2015 8:47 PM
> > To: Dumitrescu, Cristian <cristian.dumitrescu at intel.com>
> > Cc: dev at dpdk.org; Stephen Hemminger <stephen at networkplumber.org>
> > Subject: [PATCH 3/3] rte_sched: eliminate floating point in calculating byte
> > clock
> > 
> > The old code was doing a floating point divide for each rte_dequeue()
> > which is very expensive. Change to using fixed point scaled inverse
> > multiply. To maintain equivalent precision, scaled math is used.
> > The application ABI is the same.
> > 
> > This improved performance from 5Gbit/sec to 10 Gbit/sec when configured
> > for 10 Gbit/sec rate.
> > 
> > There was some feedback from Cristian that he wanted a better
> > solution and was going to give one, but none was provided.
> > For 2.2 this is a better solution than existing code, if someone
> > has a better version I would love to see it.
> > 
> > Signed-off-by: Stephen Hemminger <stephen at networkplumber.org>
> > ---
> >  lib/librte_sched/rte_sched.c | 23 ++++++++++++++++++-----
> >  1 file changed, 18 insertions(+), 5 deletions(-)
> > 
> > diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
> > index 16acd6b..cfae136 100644
> > --- a/lib/librte_sched/rte_sched.c
> > +++ b/lib/librte_sched/rte_sched.c
> > @@ -47,6 +47,7 @@
> >  #include "rte_bitmap.h"
> >  #include "rte_sched_common.h"
> >  #include "rte_approx.h"
> > +#include "rte_reciprocal.h"
> > 
> >  #ifdef __INTEL_COMPILER
> >  #pragma warning(disable:2259) /* conversion may lose significant bits */
> > @@ -62,6 +63,11 @@
> >  #define RTE_SCHED_PIPE_INVALID                UINT32_MAX
> >  #define RTE_SCHED_BMP_POS_INVALID             UINT32_MAX
> > 
> > +/* Scaling for cycles_per_byte calculation
> > + * Chosen so that minimum rate is 480 bit/sec
> > + */
> > +#define RTE_SCHED_TIME_SHIFT                     8
> 
> Stephen, can you please elaborate why we need to shift the dividend at all 
> and why the shift value was picked as 8? Is 8 a hard constraint? How does 
> this affect the scheduling precision/accuracy?

The shift value is a tradeoff for scaled math. The bigger the shift
the finer the resolution, but at the risk of overflow in the cycles_per_byte.
The value was chosen as a tradeoff based on current CPU clock rate (TSC)
and minimum rate.

> > +
> >  struct rte_sched_subport {
> >     /* Token bucket (TB) */
> >     uint64_t tb_time; /* time of last update */
> > @@ -215,7 +221,7 @@ struct rte_sched_port {
> >     uint64_t time_cpu_cycles;     /* Current CPU time measured in CPU
> > cyles */
> >     uint64_t time_cpu_bytes;      /* Current CPU time measured in bytes
> > */
> >     uint64_t time;                /* Current NIC TX time measured in bytes 
> > */
> > -   double cycles_per_byte;       /* CPU cycles per byte */
> > +   struct rte_reciprocal inv_cycles_per_byte; /* CPU cycles per byte */
> > 
> >     /* Scheduling loop detection */
> >     uint32_t pipe_loop;
> > @@ -610,7 +616,7 @@ struct rte_sched_port *
> >  rte_sched_port_config(struct rte_sched_port_params *params)
> >  {
> >     struct rte_sched_port *port = NULL;
> > -   uint32_t mem_size, bmp_mem_size, n_queues_per_port, i;
> > +   uint32_t mem_size, bmp_mem_size, n_queues_per_port, i,
> > cycles_per_byte;
> > 
> >     /* Check user parameters. Determine the amount of memory to
> > allocate */
> >     mem_size = rte_sched_port_get_memory_footprint(params);
> > @@ -661,7 +667,10 @@ rte_sched_port_config(struct
> > rte_sched_port_params *params)
> >     port->time_cpu_cycles = rte_get_tsc_cycles();
> >     port->time_cpu_bytes = 0;
> >     port->time = 0;
> > -   port->cycles_per_byte = ((double) rte_get_tsc_hz()) / ((double)
> > params->rate);
> > +
> > +   cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT)
> > +           / params->rate;
> > +   port->inv_cycles_per_byte = rte_reciprocal_value(cycles_per_byte);
> > 
> >     /* Scheduling loop detection */
> >     port->pipe_loop = RTE_SCHED_PIPE_INVALID;
> > @@ -2088,11 +2097,15 @@ rte_sched_port_time_resync(struct
> > rte_sched_port *port)
> >  {
> >     uint64_t cycles = rte_get_tsc_cycles();
> >     uint64_t cycles_diff = cycles - port->time_cpu_cycles;
> > -   double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
> > +   uint64_t bytes_diff;
> > +
> > +   /* Compute elapsed time in bytes */
> > +   bytes_diff = rte_reciprocal_divide(cycles_diff <<
> > RTE_SCHED_TIME_SHIFT,
> > +                                      port->inv_cycles_per_byte);
> > 
> >     /* Advance port time */
> >     port->time_cpu_cycles = cycles;
> > -   port->time_cpu_bytes += (uint64_t) bytes_diff;
> > +   port->time_cpu_bytes += bytes_diff;
> >     if (port->time < port->time_cpu_bytes)
> >             port->time = port->time_cpu_bytes;
> > 
> > --
> > 2.1.4
> 
> Can you provide some insight into how you tested this code and the test 
> vectors you used?

We tested with 10 gbit link and range of rates from 10k bit up to 10 gbit.

Reply via email to