Hi Harry,

Thank you for your reply. Please see the inline comments below. Also, I have 
also added the Tested-by tag after the Signed-off-by tag if that's okay with 
you.

Best,
Malvika

> -----Original Message-----
> From: Van Haaren, Harry <[email protected]>
> Sent: Thursday, June 27, 2019 7:26 AM
> To: Malvika Gupta <[email protected]>
> Cc: nd <[email protected]>; Yanqin Wei (Arm Technology China)
> <[email protected]>; [email protected]
> Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic
> scalar functions
>
>
> Sorry for double post - if you're happy with the changes please add a Tested-
> by: tag?
>
>
> > -----Original Message-----
> > From: Van Haaren, Harry
> > Sent: Thursday, June 27, 2019 1:07 PM
> > To: 'Malvika Gupta' <[email protected]>
> > Cc: nd <[email protected]>; Yanqin Wei (Arm Technology China)
> > <[email protected]>; [email protected]
> > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > generic scalar functions
> >
> > > -----Original Message-----
> > > From: Malvika Gupta [mailto:[email protected]]
> > > Sent: Tuesday, June 25, 2019 8:19 PM
> > > To: Van Haaren, Harry <[email protected]>
> > > Cc: nd <[email protected]>; Yanqin Wei (Arm Technology China)
> > <[email protected]>;
> > > [email protected]
> > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > > generic scalar functions
> > >
> > > Hi Harry,
> >
> > Hi Malvika,
> >
> >
> > > I tested your patch on two ARM machines, namely ThunderX2 and
> Octeon-Tx.
> > >
> > > On Octeon-Tx (running gcc version 7.4.0), a performance improvement
> > > of 9-
> > 10%
> > > was seen. On the ThunderX2 (running gcc version 8.3.0), a
> > > performance improvement of 14-15% was observed. For both machines, I
> > > disabled the EMC and tested with two traffic patterns of 64B packet
> > > size, specifically UDP/IPv4/Ethernet and IPv4/Ethernet (Raw IP).
> >
> > Great - thanks for testing and reporting back performance, appreciated!
> >
> >
> > > Please let me know if you require any more details.
> > > Thanks,
> > > Malvika
> >
> > It would be good to know which miniflow fingerprints are commonly
> > deployed with OVS.
> >
> > If we (the OVS community) had more input in terms of benchmarks that
> > OVS deployments use, then we can optimize against those more (as was
> > mentioned at OVS Conf
> > '18 too :)
> > https://youtu.be/5-MDlpUIOBE?list=PLaJlRa-
> xItwCzuAL3mP6n02vmXab4Bwu-&t
> > =1471
> >
> > I haven't had much success with getting benchmarks yet - but if you
> > have something like a use-case or benchmark to test against, sharing
> > it with OVS Community would be great!

I am not very familiar with the miniflow fingerprints that are commonly 
deployed with OVS. I am fairly new to contributing in the OVS community and 
would probably require more investigation and knowledge before I can provide 
some feedback about this. Nonetheless, I appreciate you raising this point with 
me; it's given me something to think about and look into.

> > -Harry
> >
> >
> > > > -----Original Message-----
> > > > From: Malvika Gupta
> > > > Sent: Wednesday, June 19, 2019 1:34 PM
> > > > To: Van Haaren, Harry <[email protected]>
> > > > Cc: Yanqin Wei (Arm Technology China) <[email protected]>; nd
> > > > <[email protected]>; [email protected]
> > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > generic
> > > > scalar functions
> > > >
> > > > Hi Harry
> > > >
> > > > Thanks for your reply. I have some last few things I wanted to clarify.
> > > Please
> > > > see my inline comments below.
> > > >
> > > > > -----Original Message-----
> > > > > From: Van Haaren, Harry <[email protected]>
> > > > > Sent: Wednesday, June 19, 2019 11:22 AM
> > > > > To: Malvika Gupta <[email protected]>
> > > > > Cc: Yanqin Wei (Arm Technology China) <[email protected]>; nd
> > > > > <[email protected]>; [email protected]
> > > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add
> > > > > specialized generic scalar functions
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Malvika Gupta [mailto:[email protected]]
> > > > > > Sent: Wednesday, June 19, 2019 5:17 PM
> > > > > > To: Van Haaren, Harry <[email protected]>
> > > > > > Cc: Yanqin Wei (Arm Technology China) <[email protected]>;
> nd
> > > > > > <[email protected]>; [email protected]
> > > > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add
> > > > > > specialized generic scalar functions
> > > > > >
> > > > > > Hi Harry,
> > > > >
> > > > > Hi Malvika,
> > > > >
> > > > >
> > > > > > I would like to test your patch on ARM platforms and report
> > > > > > any performance improvement. Can you please tell me the test
> > > > > > vectors you used to match the subtables that call the
> > > > > > specialized lookup functions, i.e.,
> > > > > > dpcls_subtable_lookup_mf_u0w5_u1w1,
> > > > > > dpcls_subtable_lookup_mf_u0w4_u1w1, and
> > > > > > dpcls_subtable_lookup_mf_u0w4_u1w0? And how did you make
> that
> > > > > > distinction with traffic patterns, i.e., which traffic pattern
> > > > > > would match with
> > > > > which subtable lookup?
> > > > >
> > > > > I was just running Eth/IPv4/UDP data, and using OvS to do Phy to
> > > > > Phy forwarding.
> > > > >
> > > > > ./utilities/ovs-ofctl add-flow br0 in_port=1,action=output:2
> > > > > ./utilities/ovs- ofctl add-flow br0 in_port=2,action=output:1
> > > > >
> > > > > This is an extremely simple setup, however it is enough to
> > > > > showcase value of the patches here.
> > > > >
> > > > > By changing traffic to Eth/IPv4 and Eth alone, you can change
> > > > > the # of miniflow items, and hit the different u0_w5_u1_w0 cases
> > > > > :)
> > > > >
> > > >
> > > > 1. Are you also sending combinations of traffic flows at the same time?
> > > Like
> > > > UDP/IPv4/Eth is one traffic flow, raw IP could be another and Eth
> > packets
> > > > only is a third flow. Did you also test by say, sending these 3
> > > > flows
> > > together
> > > > which would all hit different subtables or just one kind of
> > > > traffic flow
> > > at a
> > > > time?
> > > > 2. Just to clarify, packet size is 64B, right?
> > > >
> > > > I am sorry if these seem trivial but I just wanted to have a clear
> > > > understanding of the testing environment.
> > > > Thanks!
> > > >
> > > > > If you have suggestions for other subtables to special case,
> > > > > please post the u0_wX_u1_wY values that you'd like to see, and
> > > > > I'll add them
> > to
> > > > the patchset.
> > > > >
> > > > >
> > > > > > Thank you,
> > > > > > Malvika Gupta
> > > > >
> > > > > Testing welcomed, thanks! -Harry
> > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: [email protected] <ovs-dev-
> > > > > > > [email protected]> On Behalf Of Harry van Haaren
> > > > > > > Sent: Wednesday, May 8, 2019 11:13 PM
> > > > > > > To: [email protected]
> > > > > > > Cc: [email protected]
> > > > > > > Subject: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add
> > > > > > > specialized generic
> > > > > > scalar
> > > > > > > functions
> > > > > > >
> > > > > > > This commit adds a number of specialized functions, that
> > > > > > > handle common miniflow fingerprints. This enables compiler
> > > > > > > optimization, resulting in
> > > > > > higher
> > > > > > > performance. Below a quick description of how this
> > > > > > > optimization actually works;
> > > > > > >
> > > > > > > "Specialized functions" are "instances" of the generic
> > > > > > > implementation, but the compiler is given extra context when
> > > > > > > compiling. In the case of
> > > > > > iterating
> > > > > > > miniflow datastructures, the most interesting value to
> > > > > > > enable compile time optimizations is the loop trip count per unit.
> > > > > > >
> > > > > > > In order to create a specialized function, there is a
> > > > > > > generic
> > > > > > implementation,
> > > > > > > which uses a for() loop without the compiler knowing the
> > > > > > > loop trip count
> > > > > > at
> > > > > > > compile time. The loop trip count is passed in as an
> > > > > > > argument to the
> > > > > > function:
> > > > > > >
> > > > > > > uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t
> > > > > > > loop_count)
> > > > > {
> > > > > > >     for(uint32_t i = 0; i < loop_count; i++)
> > > > > > >         // do work
> > > > > > > }
> > > > > > >
> > > > > > > In order to "specialize" the function, we call the generic
> > > > > > > implementation
> > > > > > with
> > > > > > > hard-coded numbers - these are compile time constants!
> > > > > > >
> > > > > > > uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t
> > > > > > > loop_count)
> > > > > {
> > > > > > >     // use hard coded constant for compile-time constant-
> > propogation
> > > > > > >     return miniflow_impl_generic(mf, 5); }
> > > > > > >
> > > > > > > Given the compiler is aware of the loop trip count at
> > > > > > > compile time, it can perform an optimization known as
> > > > > > > "constant
> > > propogation".
> > > > > > > Combined with inlining of the miniflow_impl_generic()
> > > > > > > function, the compiler is now
> > > > > > enabled
> > > > > > > to *compile time* unroll the loop 5x, and produce "flat" code.
> > > > > > >
> > > > > > > The last step to using the specialized functions is to
> > > > > > > utilize a
> > > > > > > function-
> > > > > > pointer
> > > > > > > to choose the specialized (or generic) implementation.
> > > > > > > The selection of the function pointer is performed at
> > > > > > > subtable creation
> > > > > > time,
> > > > > > > when miniflow fingerprint of the subtable is known. This
> > > > > > > technique is
> > > > > > known
> > > > > > > as "multiple dispatch" in some literature, as it uses
> > > > > > > multiple items of information (miniflow bit counts) to
> > > > > > > select the dispatch
> > > > function.
> > > > > > >
> > > > > > > By pointing the function pointer at the optimized
> > > > > > > implementation, OvS benefits from the compile time optimizations
> at runtime.
> > > > > > >
> > > > > > > Signed-off-by: Harry van Haaren <[email protected]>

Tested-by: Malvika Gupta <[email protected]>

> > > > > > >
> > > > > > > ---
> > > > > > >
> > > > > > > v8:
> > > > > > > - Rework to use blocks_cache from the dpcls instance, to
> > > > > > > avoid
> > > variable
> > > > > > >   lenght arrays in the data-path.
> > > > > > > ---
> > > > > > >  lib/dpif-netdev-lookup-generic.c | 83
> > > > > > > +++++++++++++++++++++++++++++-
> > > > > --
> > > > > > >  lib/dpif-netdev.c                |  9 +++-
> > > > > > >  lib/dpif-netdev.h                |  8 +++
> > > > > > >  3 files changed, 90 insertions(+), 10 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/dpif-netdev-lookup-generic.c
> > > > > > > b/lib/dpif-netdev-lookup- generic.c index
> > > > > > > 901d28ff0..ba2d024cc
> > > > > > > 100644
> > > > > > > --- a/lib/dpif-netdev-lookup-generic.c
> > > > > > > +++ b/lib/dpif-netdev-lookup-generic.c
> > > > > > > @@ -100,11 +100,11 @@ netdev_flow_key_flatten(const struct
> > > > > > > netdev_flow_key * restrict key,
> > > > > > >
> > > > > > >      /* Unit 0 flattening */
> > > > > > >      netdev_flow_key_flatten_unit(&pkt_blocks[0],
> > > > > > > -                            &tbl_blocks[0],
> > > > > > > -                            &mf_masks[0],
> > > > > > > -                            &blocks_scratch[0],
> > > > > > > -                            pkt_bits_u0,
> > > > > > > -                            u0_count);
> > > > > > > +                                 &tbl_blocks[0],
> > > > > > > +                                 &mf_masks[0],
> > > > > > > +                                 &blocks_scratch[0],
> > > > > > > +                                 pkt_bits_u0,
> > > > > > > +                                 u0_count);
> > > > > > >
> > > > > > >      /* Unit 1 flattening:
> > > > > > >       * Move the pointers forward in the arrays based on u0
> > offsets,
> > > > NOTE:
> > > > > > > @@ -225,7 +225,74 @@ dpcls_subtable_lookup_generic(struct
> > > > > > > dpcls_subtable *subtable,
> > > > > > >                                const struct netdev_flow_key
> > *keys[],
> > > > > > >                                struct dpcls_rule **rules)  {
> > > > > > > -        return lookup_generic_impl(subtable, blocks_scratch,
> > > keys_map,
> > > > > > keys,
> > > > > > > -                                   rules, subtable-
> > > >mf_bits_set_unit0,
> > > > > > > -                                   subtable->mf_bits_set_unit1);
> > > > > > > +    /* Here the runtime subtable->mf_bits counts are used,
> > > > > > > + which forces
> > > > > > the
> > > > > > > +     * compiler to iterate normal for() loops. Due to this
> > > > > > > + limitation in
> > > > > > the
> > > > > > > +     * compilers available optimizations, this function has
> > > > > > > + lower
> > > > > > performance
> > > > > > > +     * than the below specialized functions.
> > > > > > > +     */
> > > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > > + keys_map, keys,
> > > > > > > rules,
> > > > > > > +                               subtable->mf_bits_set_unit0,
> > > > > > > +
> > > > > > > + subtable->mf_bits_set_unit1); }
> > > > > > > +
> > > > > > > +static uint32_t
> > > > > > > +dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable
> > > > > *subtable,
> > > > > > > +                                   uint64_t *blocks_scratch,
> > > > > > > +                                   uint32_t keys_map,
> > > > > > > +                                   const struct
> > > > > > > + netdev_flow_key
> > > *keys[],
> > > > > > > +                                   struct dpcls_rule **rules) {
> > > > > > > +    /* hard coded bit counts - enables compile time loop
> > unrolling,
> > > and
> > > > > > > +     * generating of optimized code-sequences due to loop
> > unrolled
> > > > code.
> > > > > > > +     */
> > > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > > +keys_map, keys,
> > > > > > > rules,
> > > > > > > +                               5, 1); }
> > > > > > > +
> > > > > > > +static uint32_t
> > > > > > > +dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable
> > > > > *subtable,
> > > > > > > +                                   uint64_t *blocks_scratch,
> > > > > > > +                                   uint32_t keys_map,
> > > > > > > +                                   const struct
> > > > > > > + netdev_flow_key
> > > *keys[],
> > > > > > > +                                   struct dpcls_rule **rules) {
> > > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > > +keys_map, keys,
> > > > > > > rules,
> > > > > > > +                               4, 1); }
> > > > > > > +
> > > > > > > +static uint32_t
> > > > > > > +dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable
> > > > > *subtable,
> > > > > > > +                                   uint64_t *blocks_scratch,
> > > > > > > +                                   uint32_t keys_map,
> > > > > > > +                                   const struct
> > > > > > > + netdev_flow_key
> > > *keys[],
> > > > > > > +                                   struct dpcls_rule **rules) {
> > > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > > +keys_map, keys,
> > > > > > > rules,
> > > > > > > +                               4, 0); }
> > > > > > > +
> > > > > > > +/* Probe function to lookup an available specialized function.
> > > > > > > + * If capable to run the requested miniflow fingerprint,
> > > > > > > +this function returns
> > > > > > > + * the most optimal implementation for that miniflow fingerprint.
> > > > > > > + * @retval FunctionAddress A valid function to handle the
> > > > > > > +miniflow bit pattern
> > > > > > > + * @retval 0 The requested miniflow is not supported here,
> > > > > > > +NULL is returned  */ dpcls_subtable_lookup_func
> > > > > > > +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t
> > > > > > > +u1_bits)
> > {
> > > > > > > +    dpcls_subtable_lookup_func f = NULL;
> > > > > > > +
> > > > > > > +    if (u0_bits == 5 && u1_bits == 1) {
> > > > > > > +        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> > > > > > > +    } else if (u0_bits == 4 && u1_bits == 1) {
> > > > > > > +        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> > > > > > > +    } else if (u0_bits == 4 && u1_bits == 0) {
> > > > > > > +        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    if (f) {
> > > > > > > +        VLOG_INFO("Subtable using Generic Optimized for u0
> > > > > > > + %d,
> > > > u1 %d\n",
> > > > > > > +                  u0_bits, u1_bits);
> > > > > > > +    }
> > > > > > > +    return f;
> > > > > > >  }
> > > > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > > > > 33b93cfdf..23fc5b7a6
> > > > > > > 100644
> > > > > > > --- a/lib/dpif-netdev.c
> > > > > > > +++ b/lib/dpif-netdev.c
> > > > > > > @@ -7623,8 +7623,13 @@ dpcls_create_subtable(struct dpcls
> > > > > > > *cls, const struct netdev_flow_key *mask)
> > > > > > >          cls->blocks_scratch_size = blocks_required_per_pkt;
> > > > > > >      }
> > > > > > >
> > > > > > > -    /* Assign the generic lookup - this works with any miniflow
> > > > > > fingerprint */
> > > > > > > -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> > > > > > > +    /* Probe for an optimmized variant */
> > > > > > > +    subtable->lookup_func =
> > > > > > > + dpcls_subtable_generic_probe(unit0,
> > > > > > > + unit1);
> > > > > > > +
> > > > > > > +    /* If not set, assign generic lookup - this works with
> > > > > > > + any miniflow
> > > > > > */
> > > > > > > +    if (!subtable->lookup_func) {
> > > > > > > +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> > > > > > > +    }
> > > > > > >
> > > > > > >      cmap_insert(&cls->subtables_map, &subtable->cmap_node,
> > > > > > > mask-
> > > > > >hash);
> > > > > > >      /* Add the new subtable at the end of the pvector (with
> > > > > > > no hits
> > > > > > > yet)
> > > > > > */ diff
> > > > > > > --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h index
> > > > > > > 9263256a9..123eabad7
> > > > > > > 100644
> > > > > > > --- a/lib/dpif-netdev.h
> > > > > > > +++ b/lib/dpif-netdev.h
> > > > > > > @@ -70,6 +70,14 @@ typedef uint32_t
> > > > > > > (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
> > > > > > >                  const struct netdev_flow_key *keys[],
> > > > > > >                  struct dpcls_rule **rules);
> > > > > > >
> > > > > > > +/* Probe function to select a specialized version of the
> > > > > > > +generic lookup
> > > > > > > + * implementation. This provides performance benefit due to
> > > > > > > +compile-time
> > > > > > > + * optimizations such as loop-unrolling. These are enabled
> > > > > > > +by the compile-time
> > > > > > > + * constants in the specific function implementations.
> > > > > > > + */
> > > > > > > +dpcls_subtable_lookup_func
> > > > > > > +dpcls_subtable_generic_probe(uint32_t u0_bit_count,
> > > > > > > +uint32_t u1_bit_count);
> > > > > > > +
> > > > > > >  /* Prototype for generic lookup func, using same code path
> > > > > > > as before */ uint32_t  dpcls_subtable_lookup_generic(struct
> > > > > > > dpcls_subtable *subtable,
> > > > > > > --
> > > > > > > 2.17.1
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > dev mailing list
> > > > > > > [email protected]
> > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > > > IMPORTANT NOTICE: The contents of this email and any
> > > > > > attachments are confidential and may also be privileged. If
> > > > > > you are not the intended recipient, please notify the sender
> > > > > > immediately and do not disclose the contents to any other
> > > > > > person, use it for any purpose, or store or copy the information in
> any medium. Thank you.
> > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose
> > > the contents to any other person, use it for any purpose, or store
> > > or copy the information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to