Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation

William Tu Tue, 19 May 2020 17:13:32 -0700

On Mon, May 18, 2020 at 9:12 AM Van Haaren, Harry
<[email protected]> wrote:
>
> > -----Original Message-----
> > From: William Tu <[email protected]>
> > Sent: Monday, May 18, 2020 3:58 PM
> > To: Van Haaren, Harry <[email protected]>
> > Cc: [email protected]; [email protected]
> > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather
> > implementation
> >
> > On Wed, May 06, 2020 at 02:06:09PM +0100, Harry van Haaren wrote:
> > > This commit adds an AVX-512 dpcls lookup implementation.
> > > It uses the AVX-512 SIMD ISA to perform multiple miniflow
> > > operations in parallel.
> > >
> > > To run this implementation, the "avx512f" and "bmi2" ISAs are
> > > required. These ISA checks are performed at runtime while
> > > probing the subtable implementation. If a CPU does not provide
> > > both "avx512f" and "bmi2", then this code does not execute.
> > >
> > > The avx512 code is built as a seperate static library, with added
> > > CFLAGS to enable the required ISA features. By building only this
> > > static library with avx512 enabled, it is ensured that the main OVS
> > > core library is *not* using avx512, and that OVS continues to run
> > > as before on CPUs that do not support avx512.
> > >
> > > The approach taken in this implementation is to use the
> > > gather instruction to access the packet miniflow, allowing
> > > any miniflow blocks to be loaded into an AVX-512 register.
> > > This maximises the usefulness of the register, and hence this
> > > implementation handles any subtable with up to miniflow 8 bits.
> > >
> > > Note that specialization of these avx512 lookup routines
> > > still provides performance value, as the hashing of the
> > > resulting data is performed in scalar code, and compile-time
> > > loop unrolling occurs when specialized to miniflow bits.
> > >
> >
> > Hi Harry,
> >
> > I haven't tried running the code due to my machine only
> > support avx2. There are some minor issues such as indentation.
> > But I read through it with example below and I think it's correct.
>
> Thanks for the review! I'll post replies inline for context.
>
> Note, the Software Development Emulator (SDE) tool enables emulation of 
> AVX512 ISA.
> Full details provided at the link below, using this would enable running 
> AVX512 DPCLS
> implementation itself, should you want to test it locally:
> https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html
>
>
> > Given that you have to do a lot of preparation (ex: popcount, creating
> > bit_masks, broadcast, ... etc) before using avx instructions, do you
> > have some performance number? I didn't see any from ovsconf 18 or 19.
> > Is using avx512 much better than avx2?
>
> Correct there is some "pre-work" to do before the miniflow manipulation 
> itself.
> Note that much of the more complex work (e.g. miniflow bitmask generation for 
> the subtable)
> is done at subtable instantiation time, instead of on the critical path. Also 
> the popcount
> lookup table is "static const", which will turn into a single AVX512 load at 
> runtime.
>
> AVX512 provides some very useful features, which are used throughout the code
> below. In particular, the AVX512 "k-mask" feature allows the developer to 
> switch-off
> a lane in the SIMD register (this is sometimes referred to as a predication 
> mask).
> Using these "k-masks" solves requiring more instructions later to "merge" 
> results
> back together (as SSE or AVX2 code would have to do).
> Example : "mask_set1_epi64" allows setting a specific value into the "lanes" 
> as
> given by the k-mask, and results in an AVX512 register with those contents.
>
> There are also new instructions in AVX512 which provide even more powerful 
> ISA, for example
> the "AVX512VPOPCNTDQ" CPUID provides a vectorized popcount which can be used 
> instead of
> the "_mm512_popcnt_epi64_manual()" helper function. Enabling of the AVX512 
> VPOPCNT instruction
> is planned in future patches to OVS. Details of the instruction are available 
> on the intrinsics guide:
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64&expand=4368
>
> Finally, although the code can seem a bit verbose, most _mm512_xxx_yyy() 
> intrinsics result in a single
> instruction. This means that although the code looks "big", however the 
> resulting instruction stream often
> extremely densely packed. Combine that with the fact that the implementation 
> is focused on using instructions
> to deliver the maximum amount of required compute without any waste, it can 
> result in very high performance :)
>
> Regarding performance numbers, unfortunately I don't have official numbers to 
> state here.
> For an approximation (caveats such as "depends on exact usage" etc apply), 
> for about the same packet
> rate, the CPU cycles spent in DPCLS is about halved in the AVX512 version, 
> compared to the scalar version.
>
> <snip lots of patch contents>
>
> > I think the below function is the most difficult one.
> > I wonder if there is a better way to make it easier to understand?
> > ex: break it into subfunctions or utility functions
>
> My experience has been that breaking it up into smaller snippets causes me to
> lose sight of the big picture. Code like below is typically not written in 
> one pass but
> more of an iterative process. Seeing the desired register-contents is 
> valuable,
> and knowing the context and state of registers in near proximity to it can 
> often provide
> new optimizations or strength reduction of existing code.
>
> Clearly commenting the reason for the compute, and sometimes how it is 
> computed
> is the best-known-method for writing maintainable SIMD code. This method is 
> also used
> in DPDK for its PMDs, for example the i40e driver SIMD rx codepath:
> http://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_avx2.c#n221
>
>
> > I end up using an example from your slides 2 here:
> > https://www.openvswitch.org/support/ovscon2019/day1/1108-
> > next_steps_sw_datapath_hvh.pdf
> > and the API document here
> > https://software.intel.com/sites/landingpage/IntrinsicsGuide/
>
> Aha, you've found the colorful instruction set architecture guide :)
> There is another which presents the data-movement more graphically,
> I'll mention it but advise using the IntrinsicsGuide as linked above as it
> is the official resource, and maintained and up-to-datedate. The graphical
> webpage is here: https://www.officedaytime.com/simd512e/simd.html
>
>
>
> > > +static inline uint32_t ALWAYS_INLINE
> > > +avx512_lookup_impl(struct dpcls_subtable *subtable,
> > > +                   uint32_t keys_map,
> > > +                   const struct netdev_flow_key *keys[],
> > > +                   struct dpcls_rule **rules,
> > > +                   const uint32_t bit_count_u0,
> > > +                   const uint32_t bit_count_u1)
> > > +{
> > > +    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
> > > +    int i;
> > > +    uint32_t hashes[NETDEV_MAX_BURST];
> > > +    const uint32_t n_pkts = __builtin_popcountll(keys_map);
> > > +    ovs_assert(NETDEV_MAX_BURST >= n_pkts);
> > > +
> > > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t
> > block_cache[NETDEV_MAX_BURST * 8];
> > > +
> > > +    const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0];
> > > +    const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1];
> > > +    ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0);
> > > +    ovs_assert(__builtin_popcountll(tbl_u1) == bit_count_u1);
> > > +
> > > +    /* Load subtable blocks for masking later */
> > > +    const uint64_t *tbl_blocks = miniflow_get_values(&subtable->mask.mf);
> > > +    const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
> > > +
> > > +    /* Load pre-created subtable masks for each block in subtable */
> > > +    const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1;
> > > +    const __m512i v_mf_masks =
> > _mm512_maskz_loadu_epi64(bit_count_total_mask,
> > > +                                                        
> > > subtable->mf_masks);
> > > +
> > > +    ULLONG_FOR_EACH_1 (i, keys_map) {
> > > +        const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0];
> > > +        const uint64_t pkt_mf_u0_pop = 
> > > __builtin_popcountll(pkt_mf_u0_bits);
> > > +
> > > +        /* Pre-create register with *PER PACKET* u0 offset */
> > > +        const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0);
> > > +        const __m512i v_idx_u0_offset =
> > _mm512_maskz_set1_epi64(u1_bcast_mask,
> > > +                                                                
> > > pkt_mf_u0_pop);
> > > +
> > > +        /* Broadcast u0, u1 bitmasks to 8x u64 lanes */
> > > +        __m512i v_u0 = _mm512_set1_epi64(pkt_mf_u0_bits);
> > > +        __m512i v_pkt_bits = _mm512_mask_set1_epi64(v_u0, u1_bcast_mask,
> > > +                                         keys[i]->mf.map.bits[1]);
> > > +
> > > +        /* Bitmask by pre-created masks */
> > > +        __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_mf_masks);
> > > +
> > > +        /* Manual AVX512 popcount for u64 lanes */
> > > +        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
> > > +
> > > +        /* Offset popcounts for u1 with pre-created offset register */
> > > +        __m512i v_indexes = _mm512_add_epi64(v_popcnts, v_idx_u0_offset);
> > > +
> > > +        /* Gather u64 on single packet, merge with zero reg, up to 8 
> > > blocks */
> > > +        const __m512i v_zeros = _mm512_setzero_si512();
> > > +        const uint64_t *pkt_data = miniflow_get_values(&keys[i]->mf);
> > > +        __m512i v_all_blocks = _mm512_mask_i64gather_epi64(v_zeros,
> > > +                                 bit_count_total_mask, v_indexes, 
> > > pkt_data, 8);
> > indent
>
> Thanks!
>
> > > +        /* Zero out bits that pkt doesn't have:
> > > +         * - 2x pext() to extract bits from packet miniflow as needed by 
> > > TBL
> > > +         * - Shift u1 over by bit_count of u0, OR to create zero bitmask
> > > +         */
> > > +         uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0], 
> > > tbl_u0);
> > > +         uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1], 
> > > tbl_u1);
> > > +         uint64_t zero_mask = (u1_to_zero << bit_count_u0) | u0_to_zero;
> > indentation: remove one space
>
> Will fix.
>
>
> > > +        /* Mask blocks using AND with subtable blocks, use k-mask to zero
> > > +         * where lanes as required for this packet.
> > > +         */
> > > +        __m512i v_masked_blocks = _mm512_maskz_and_epi64(zero_mask,
> > > +                                                v_all_blocks, 
> > > v_tbl_blocks);
> > > +
> > > +        /* Store to blocks cache, full cache line aligned */
> > > +        _mm512_storeu_si512(&block_cache[i * 8], v_masked_blocks);
> > > +    }
> > > +
> > > +    /* Hash the now linearized blocks of packet metadata. */
> > > +    ULLONG_FOR_EACH_1 (i, keys_map) {
> > > +        uint64_t *block_ptr = &block_cache[i * 8];
> > > +        uint32_t hash = hash_add_words64(0, block_ptr, bit_count_total);
> > > +        hashes[i] = hash_finish(hash, bit_count_total * 8);
> > > +    }
> > > +
> > > +    /* Lookup: this returns a bitmask of packets where the hash table had
> > > +     * an entry for the given hash key. Presence of a hash key does not
> > > +     * guarantee matching the key, as there can be hash collisions.
> > > +     */
> > > +    uint32_t found_map;
> > > +    const struct cmap_node *nodes[NETDEV_MAX_BURST];
> > > +    found_map = cmap_find_batch(&subtable->rules, keys_map, hashes,
> > nodes);
> > > +
> > > +    /* Verify that packet actually matched rule. If not found, a hash
> > > +     * collision has taken place, so continue searching with the next 
> > > node.
> > > +     */
> > > +    ULLONG_FOR_EACH_1 (i, found_map) {
> > > +        struct dpcls_rule *rule;
> > > +
> > > +        CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> > > +            const uint32_t cidx = i * 8;
> > > +            uint32_t match = netdev_rule_matches_key(rule, 
> > > bit_count_total,
> > > +                                                     &block_cache[cidx]);
> > > +            if (OVS_LIKELY(match)) {
> > > +                rules[i] = rule;
> > > +                subtable->hit_cnt++;
> > > +                goto next;
> > > +            }
> > > +        }
> > > +
> > > +        /* None of the found rules was a match.  Clear the i-th bit to
> > > +         * search for this key in the next subtable. */
> > > +        ULLONG_SET0(found_map, i);
> > > +    next:
> > > +        ;                     /* Keep Sparse happy. */
> > > +    }
> > > +
> > > +    return found_map;
> > > +}
> >
> > If someone is interested, the example below with the slides
> > help understand the above function.
>
> Wow - nice work! Impressive to see the code taken apart and reduced
> to its logical behavior like this, interesting to see.
>
>
> > diff --git a/lib/dpif-netdev-lookup-avx512-gather.c 
> > b/lib/dpif-netdev-lookup-
> > avx512-gather.c
> > index 52348041bd00..f84a95423cf8 100644
> > --- a/lib/dpif-netdev-lookup-avx512-gather.c
> > +++ b/lib/dpif-netdev-lookup-avx512-gather.c
> > @@ -93,56 +93,77 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
> >
> >      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t
> > block_cache[NETDEV_MAX_BURST * 8];
> >
> > -    const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0];
> > -    const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1];
> > -    ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0);
> > -    ovs_assert(__builtin_popcountll(tbl_u1) == bit_count_u1);
> > +    const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0]; //1000,0000
> > +    const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1]; //0100,0000
> > +    ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0); //1
> > +    ovs_assert(__builtin_popcountll(tbl_u1) == bit_count_u1); //1
> > +    // bit_count_total = 2
> >
> >      /* Load subtable blocks for masking later */
> > -    const uint64_t *tbl_blocks = miniflow_get_values(&subtable->mask.mf);
> > -    const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
> > +    const uint64_t *tbl_blocks = miniflow_get_values(&subtable-
> > >mask.mf);//point to ipv4 mask
> > +    const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
> > //porint to ipv4 mask
> >
> >      /* Load pre-created subtable masks for each block in subtable */
> > -    const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1;
> > -    const __m512i v_mf_masks =
> > _mm512_maskz_loadu_epi64(bit_count_total_mask,
> > +    const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1; // 
> > (1 <<
> > 2) - 1 = 0x3
> > +    const __m512i v_mf_masks =
> > _mm512_maskz_loadu_epi64(bit_count_total_mask /* 0x3 */,
> >                                                          
> > subtable->mf_masks);
> > +    // subtable->mf_masks[0] = 0b01111111
> > +    // subtable->mf_masks[1] = 0b00111111
> > +    // v_mf_masks = [0,0,0,0,0,0, 0b00111111, 0b01111111]
> >
> > -    ULLONG_FOR_EACH_1 (i, keys_map) {
> > -        const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0];
> > -        const uint64_t pkt_mf_u0_pop = 
> > __builtin_popcountll(pkt_mf_u0_bits);
> > +    ULLONG_FOR_EACH_1 (i, keys_map) {// for each packets in batch
> > +        const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0]; 
> > //0b1000,0100
> > +        const uint64_t pkt_mf_u0_pop = 
> > __builtin_popcountll(pkt_mf_u0_bits);
> > //2
> >
> >          /* Pre-create register with *PER PACKET* u0 offset */
> > -        const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0);
> > +        const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0); //(0xff
> > << 1) = 0xfe
> >          const __m512i v_idx_u0_offset =
> > _mm512_maskz_set1_epi64(u1_bcast_mask,
> >                                                                  
> > pkt_mf_u0_pop);
> > +        //v_idx_u0_offset = [2,2,2,2,2,2,2,0]
> >
> >          /* Broadcast u0, u1 bitmasks to 8x u64 lanes */
> > -        __m512i v_u0 = _mm512_set1_epi64(pkt_mf_u0_bits);
> > -        __m512i v_pkt_bits = _mm512_mask_set1_epi64(v_u0, u1_bcast_mask,
> > -                                         keys[i]->mf.map.bits[1]);
> > +        __m512i v_u0 = _mm512_set1_epi64(pkt_mf_u0_bits);//
> > [0b10000100,0b10000100,0b10000100, ...]
> > +
> > +        __m512i v_pkt_bits = _mm512_mask_set1_epi64(v_u0, u1_bcast_mask
> > /*0xfe*/,
> > +                                         keys[i]->mf.map.bits[1] /* 
> > 0b01100000 */);
> > +        //0b01100000, 0b01100000, 0b01100000, 0b01100000, 0b01100000,
> > 0b01100000, 0b01100000,0b10000100
> >
> > -        /* Bitmask by pre-created masks */
> > +
> > +        /* Bitmask by pre-created masks. */
> >          __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_mf_masks);
> > +        // v_masks = [0,0,0,0,0,0, 0b00100000,0b00000100]
> >
> >          /* Manual AVX512 popcount for u64 lanes */
> >          __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
> > +        // v_popcnts = [0,0,0,0,0,0,1,1]
> >
> >          /* Offset popcounts for u1 with pre-created offset register */
> >          __m512i v_indexes = _mm512_add_epi64(v_popcnts, v_idx_u0_offset);
> > +        // v_indexes = [0,0,0,0,0,0,3,1]
> >
> >          /* Gather u64 on single packet, merge with zero reg, up to 8 
> > blocks */
> >          const __m512i v_zeros = _mm512_setzero_si512();
> >          const uint64_t *pkt_data = miniflow_get_values(&keys[i]->mf);
> > +        // pkt_data = ipv4_src, ipv4_dst, mac_src, vlan_tci
> > +
> >          __m512i v_all_blocks = _mm512_mask_i64gather_epi64(v_zeros,
> > -                                 bit_count_total_mask, v_indexes, 
> > pkt_data, 8);
> > +                                 bit_count_total_mask /* 0x3 */,
> > +                                 v_indexes, pkt_data, 8);
> > +        //v_all_blocks: use v_index[0]=1*8 , v_index[1]=3*8 to gather data
> > +        //v_all_blocks = [0,0,0,0,0,0, ipv4_dst, vlan_tci]
> >
> >          /* Zero out bits that pkt doesn't have:
> >           * - 2x pext() to extract bits from packet miniflow as needed by 
> > TBL
> >           * - Shift u1 over by bit_count of u0, OR to create zero bitmask
> >           */
> > -         uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0], tbl_u0);
> > -         uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1], tbl_u1);
> > +         uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0] /*
> > 0b1000,0100*/,
> > +                                         tbl_u0 /* 0b1000,0000 */);
> > +         // u0_to_zero = 0b00000001
> > +         uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1] /* 0b0110,
> > 0000*/,
> > +                                         tbl_u1 /* 0b0100,0000 */);
> > +         // u1_to_zero = 0b00000001
> >           uint64_t zero_mask = (u1_to_zero << bit_count_u0) | u0_to_zero;
> > +        // 0b00000011
> >
> >          /* Mask blocks using AND with subtable blocks, use k-mask to zero
> >           * where lanes as required for this packet.
> >
> > ---
Hi Harry,


I managed to find a machine with avx512 in google cloud and did some
performance testing. I saw lower performance when enabling avx512,
I believe I did something wrong. Do you mind having a look:

1) first a compile error
diff --git a/lib/dpif-netdev-lookup.c b/lib/dpif-netdev-lookup.c
index b22a26b8c8a2..5c71096c10c5 100644
--- a/lib/dpif-netdev-lookup.c
+++ b/lib/dpif-netdev-lookup.c
@@ -1,5 +1,6 @@

 #include <config.h>
+#include <errno.h>
 #include "dpif-netdev-lookup.h"

 #include "openvswitch/vlog.h"
---

2) cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq
pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch
invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle
avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap
clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1
xsaves arat md_clear arch_capabilities

3) start ovs and set avx and traffic gen
 ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5
 ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk
options:dpdk-devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1

4) dp flows with miniflow info
root@instance-3:~/ovs# ovs-appctl dpctl/dump-flows -m
flow-dump from pmd on cpu core: 0
ufid:caf11111-2e15-418c-a7d4-b4ec377593ca,
skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:02,dst=42:01:0a:b6:00:01),eth_type(0x0800),ipv4(src=10.182.0.2/0.0.0.0,dst=76.21.95.192/0.0.0.0,proto=6/0,tos=0x10/0,ttl=64/0,frag=no),tcp(src=22/0,dst=62190/0),tcp_flags(0/0),
packets:0, bytes:0, used:never, dp:ovs, actions:drop,
dp-extra-info:miniflow_bits(5,1)
ufid:78cc1751-3a81-4dba-900c-b3507d965bdc,
skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(tg0),packet_type(ns=0,id=0),eth(src=42:01:0a:b6:00:02,dst=42:01:0a:b6:00:01),eth_type(0x0800),ipv4(src=10.182.0.2/0.0.0.0,dst=169.254.169.254/0.0.0.0,proto=6/0,tos=0/0,ttl=64/0,frag=no),tcp(src=51650/0,dst=80/0),tcp_flags(0/0),
packets:0, bytes:0, used:never, dp:ovs, actions:drop,
dp-extra-info:miniflow_bits(5,1)

5) pmd-stat-show
root@instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show
pmd thread numa_id 0 core_id 0:
  packets received: 19838528
  packet recirculations: 0
  avg. datapath passes per packet: 1.00
  emc hits: 0
  smc hits: 0
  megaflow hits: 0
  avg. subtable lookups per megaflow hit: 0.00  (---> this doesn't
look right ....)
  miss with success upcall: 78
  miss with failed upcall: 19838418
  avg. packets per output batch: 2.00
  idle cycles: 0 (0.00%)
  processing cycles: 103431787838 (100.00%)
  avg cycles per packet: 5213.68 (103431787838/19838528)
  avg processing cycles per packet: 5213.68 (103431787838/19838528)

6) gdb also looks not right..., I didn't see any avx512 instructions
(gdb) b avx512_lookup_impl
Breakpoint 2 at 0x55e92342a8df: avx512_lookup_impl. (4 locations)
Dump of assembler code for function dpcls_avx512_gather_skx_mf_5_1:
96     const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0];
   0x000055e92342a8df <+31>: mov    0x30(%rdi),%r8
97     const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1];
   0x000055e92342a8e3 <+35>: mov    0x38(%rdi),%r9
98     ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0);
   0x000055e92342a8f6 <+54>: xor    %eax,%eax
   0x000055e92342a8f8 <+56>: popcnt %r8,%rax
   0x000055e92342a8fd <+61>: cmp    $0x5,%eax
   0x000055e92342a900 <+64>: jne    0x55e92342abc3
<dpcls_avx512_gather_skx_mf_5_1+771>
   0x000055e92342abc3 <+771>: lea    0x277b0e(%rip),%rdx        # 0x55e9236a26d8
   0x000055e92342abca <+778>: lea    0x277ccf(%rip),%rsi        #
0x55e9236a28a0 <__func__.43755>
   0x000055e92342abd1 <+785>: lea    0x277b30(%rip),%rdi        # 0x55e9236a2708
   0x000055e92342abd8 <+792>: callq  0x55e9233a71e0 <ovs_assert_failure>
99     ovs_assert(__builtin_popcountll(tbl_u1) == bit_count_u1);
   0x000055e92342a906 <+70>: xor    %eax,%eax
   0x000055e92342a908 <+72>: popcnt %r9,%rax
   0x000055e92342a90d <+77>: cmp    $0x1,%eax
   0x000055e92342a910 <+80>: jne    0x55e92342abdd
<dpcls_avx512_gather_skx_mf_5_1+797>
   0x000055e92342a916 <+86>: mov    %rcx,%r12
   0x000055e92342abdd <+797>: lea    0x277b54(%rip),%rdx        # 0x55e9236a2738
   0x000055e92342abe4 <+804>: lea    0x277cb5(%rip),%rsi        #
0x55e9236a28a0 <__func__.43755>
   0x000055e92342abeb <+811>: lea    0x277b76(%rip),%rdi        # 0x55e9236a2768
   0x000055e92342abf2 <+818>: callq  0x55e9233a71e0 <ovs_assert_failure>
100
101     /* Load subtable blocks for masking later */
102     const uint64_t *tbl_blocks = miniflow_get_values(&subtable->mask.mf);
103     const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
104
105     /* Load pre-created subtable masks for each block in subtable */
106     const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1;
107     const __m512i v_mf_masks =
_mm512_maskz_loadu_epi64(bit_count_total_mask,
108                                                         subtable->mf_masks);
109
110     ULLONG_FOR_EACH_1 (i, keys_map) {
111         const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0];
   0x000055e92342a98a <+202>: movslq %ecx,%rax
   0x000055e92342a990 <+208>: mov    (%rdx,%rax,8),%r11
   0x000055e92342a999 <+217>: mov    0x8(%r11),%r10
112         const uint64_t pkt_mf_u0_pop = __builtin_popcountll(pkt_mf_u0_bits);
113
114         /* Pre-create register with *PER PACKET* u0 offset */
115         const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0);
116         const __m512i v_idx_u0_offset =
_mm512_maskz_set1_epi64(u1_bcast_mask,
   0x000055e92342a994 <+212>: xor    %eax,%eax
   0x000055e92342a99d <+221>: popcnt %r10,%rax

Thanks!
William
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation

Reply via email to