Hi Ian, Further comments are inline.
> -----Original Message----- > From: Stokes, Ian <[email protected]> > Sent: Wednesday 16 June 2021 12:03 > To: Ferriter, Cian <[email protected]>; [email protected]; Van > Haaren, Harry <[email protected]> > Cc: [email protected] > Subject: RE: [ovs-dev] [v12 04/16] dpif-avx512: Add ISA implementation of > dpif. > > > Hi Ian, > > > > Thanks for the review. My responses are inline. > > > > > -----Original Message----- > > > From: Stokes, Ian <[email protected]> > > > Sent: Tuesday 1 June 2021 19:59 > > > To: Ferriter, Cian <[email protected]>; [email protected]; Van > > Haaren, Harry <[email protected]> > > > Cc: [email protected] > > > Subject: RE: [ovs-dev] [v12 04/16] dpif-avx512: Add ISA implementation of > > dpif. > > > > > > > This commit adds the AVX512 implementation of DPIF functionality, > > > > specifically the dp_netdev_input_outer_avx512 function. This function > > > > only handles outer (no re-circulations), and is optimized to use the > > > > AVX512 ISA for packet batching and other DPIF work. > > > > > > > > Sparse is not able to handle the AVX512 intrinsics, causing compile > > > > time failures, so it is disabled for this file. > > > > > > > > Signed-off-by: Harry van Haaren <[email protected]> > > > > Co-authored-by: Cian Ferriter <[email protected]> > > > > Signed-off-by: Cian Ferriter <[email protected]> > > > > > > Thanks for the patch Harry/Cian, still testing this to a degree but > > > questions > > below on initial thoughts. > > > > > > > > > > > --- > > > > > > > > v8: > > > > - Fixup AVX512 mask to uint32_t conversion compilation warning. > > > > --- > > > > lib/automake.mk | 5 +- > > > > lib/dpif-netdev-avx512.c | 265 +++++++++++++++++++++++++++++++ > > > > lib/dpif-netdev-private-dfc.h | 8 + > > > > lib/dpif-netdev-private-dpif.h | 32 ++++ > > > > lib/dpif-netdev-private-thread.h | 11 +- > > > > lib/dpif-netdev-private.h | 25 +++ > > > > lib/dpif-netdev.c | 70 ++++++-- > > > > 7 files changed, 400 insertions(+), 16 deletions(-) > > > > create mode 100644 lib/dpif-netdev-avx512.c > > > > create mode 100644 lib/dpif-netdev-private-dpif.h > > > > > > > > diff --git a/lib/automake.mk b/lib/automake.mk > > > > index 0bef0cc69..5fab8ba4f 100644 > > > > --- a/lib/automake.mk > > > > +++ b/lib/automake.mk > > > > @@ -33,11 +33,13 @@ lib_libopenvswitchavx512_la_CFLAGS = \ > > > > -mavx512f \ > > > > -mavx512bw \ > > > > -mavx512dq \ > > > > +-mbmi \ > > > > > > Can I ask what's needed in bmi that was not already included in bmi2? Just > > curiosity. > > > > > > > So in the dp_netdev_input_outer_avx512() function (the AVX512 DPIF > > implementation), ' _blsr_u64' is used. > > It's used twice to reset (set to 0) the lowest bit in a variable. > > > > More info on '_blsr_u64': > > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_blsr_u64& > > expand=463 > > > > > > -mbmi2 \ > > > > -fPIC \ > > > > $(AM_CFLAGS) > > > > lib_libopenvswitchavx512_la_SOURCES = \ > > > > -lib/dpif-netdev-lookup-avx512-gather.c > > > > +lib/dpif-netdev-lookup-avx512-gather.c \ > > > > +lib/dpif-netdev-avx512.c > > > > lib_libopenvswitchavx512_la_LDFLAGS = \ > > > > -static > > > > endif > > > > @@ -113,6 +115,7 @@ lib_libopenvswitch_la_SOURCES = \ > > > > lib/dpif-netdev.h \ > > > > lib/dpif-netdev-private-dfc.h \ > > > > lib/dpif-netdev-private-dpcls.h \ > > > > +lib/dpif-netdev-private-dpif.h \ > > > > lib/dpif-netdev-private-flow.h \ > > > > lib/dpif-netdev-private-hwol.h \ > > > > lib/dpif-netdev-private-thread.h \ > > > > diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c > > > > new file mode 100644 > > > > index 000000000..91f51c479 > > > > --- /dev/null > > > > +++ b/lib/dpif-netdev-avx512.c > > > > @@ -0,0 +1,265 @@ > > > > +/* > > > > + * Copyright (c) 2021 Intel Corporation. > > > > + * > > > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > > > + * you may not use this file except in compliance with the License. > > > > + * You may obtain a copy of the License at: > > > > + * > > > > + * http://www.apache.org/licenses/LICENSE-2.0 > > > > + * > > > > + * Unless required by applicable law or agreed to in writing, software > > > > + * distributed under the License is distributed on an "AS IS" BASIS, > > > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > > > > implied. > > > > + * See the License for the specific language governing permissions and > > > > + * limitations under the License. > > > > + */ > > > > + > > > > +#ifdef __x86_64__ > > > > +/* Sparse cannot handle the AVX512 instructions. */ > > > > > > So is this a limitation with sparse currently? Do you know if there are > > > any plans > > for support in sparse for AVX512 in the future? > > > > > > > Yes, unfortunately this is a limitation with sparse. I'm not sure if this > > will be > > added in the future. > > > OK, important to keep in mind so for any future AVX512 work that it needs to > be excluded with Sparse. > > Would be interesting to get line of sight on whether a solution in the sparse > community in the future, not a blocker here I would say > for the moment. > > > > > +#if !defined(__CHECKER__) > > > > + > > > > +#include <config.h> > > > > + > > > > +#include "dpif-netdev.h" > > > > +#include "dpif-netdev-perf.h" > > > > + > > > > +#include "dpif-netdev-private.h" > > > > +#include "dpif-netdev-private-dpcls.h" > > > > +#include "dpif-netdev-private-flow.h" > > > > +#include "dpif-netdev-private-thread.h" > > > > + > > > > +#include "dp-packet.h" > > > > +#include "netdev.h" > > > > + > > > > +#include "immintrin.h" > > > > + > > > > +/* Structure to contain per-packet metadata that must be attributed to > > > > the > > > > + * dp netdev flow. This is unfortunate to have to track per packet, > > > > however > > > > + * it's a bit awkward to maintain them in a performant way. This > > > > structure > > > > + * helps to keep two variables on a single cache line per packet. > > > > + */ > > > > +struct pkt_flow_meta { > > > > + uint16_t bytes; > > > > + uint16_t tcp_flags; > > > > +}; > > > > + > > > > +/* Structure of heap allocated memory for DPIF internals. */ > > > > +struct dpif_userdata { > > > > + OVS_ALIGNED_VAR(CACHE_LINE_SIZE) > > > > + struct netdev_flow_key keys[NETDEV_MAX_BURST]; > > > > + OVS_ALIGNED_VAR(CACHE_LINE_SIZE) > > > > + struct netdev_flow_key *key_ptrs[NETDEV_MAX_BURST]; > > > > + OVS_ALIGNED_VAR(CACHE_LINE_SIZE) > > > > + struct pkt_flow_meta pkt_meta[NETDEV_MAX_BURST]; > > > > +}; > > > > + > > > > +int32_t > > > > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd, > > > > + struct dp_packet_batch *packets, > > > > + odp_port_t in_port) > > > > +{ > > > > + /* Allocate DPIF userdata. */ > > > > + if (OVS_UNLIKELY(!pmd->netdev_input_func_userdata)) { > > > > + pmd->netdev_input_func_userdata = > > > > + xmalloc_pagealign(sizeof(struct dpif_userdata)); > > > > + } > > > > + > > > > + struct dpif_userdata *ud = pmd->netdev_input_func_userdata; > > > > + struct netdev_flow_key *keys = ud->keys; > > > > + struct netdev_flow_key **key_ptrs = ud->key_ptrs; > > > > + struct pkt_flow_meta *pkt_meta = ud->pkt_meta; > > > > + > > > > + /* The AVX512 DPIF implementation handles rules in a way that is > > > > optimized > > > > + * for reducing data-movement between HWOL/EMC/SMC and DPCLS. > > > > This is > > > > + * achieved by separating the rule arrays. Bitmasks are kept for > > > > each > > > > + * packet, indicating if it matched in the HWOL/EMC/SMC array or > > > > DPCLS > > > > + * array. Later the two arrays are merged by AVX-512 expand > > instructions. > > > > + */ > > > > + > > > > + /* Stores the computed output: a rule pointer for each packet. */ > > > > + struct dpcls_rule *rules[NETDEV_MAX_BURST]; > > > > + struct dpcls_rule *dpcls_rules[NETDEV_MAX_BURST]; > > > > > > At first glimpse there's not much difference between rules and dpcls_rules > > above, maybe clearer through the code later but one line > > > comment on each to explain their use would be beneficial. > > > > > > > Good point. Let's make this clear. I'll add comments in the next version to > > highlight that "*rules[]" is for storing rule pointers from HWOL/EMC/SMC and > > "*dpcls_rules[]" is for rule pointers from DPCLS. > > > > > > + uint32_t dpcls_key_idx = 0; > > > > + > > > > + for (uint32_t i = 0; i < NETDEV_MAX_BURST; i += 8) { > > > > > > Magic number 8 above. As you using it to index into the array of rules I > > > can see > > why you would use it in this form if it was a once off. > > > > > > But from a quick glimpse of the code I can see "8" being used to index in > > > the > > arrays multiple times, I'd suggest a define equal to 8 at > > > the beginning of the file in this case instead. > > > > > > > Unless you feel strongly about the #define I prefer having the "8" number > > at the > > code locations where it's used rather than having to remember the value of a > > macro called " UINT64_PER_ZMM" or something similar. > > > > Let me know what you think. > > So if it's once off I think it's ok, but if it's used more than once in > different areas of the code I'd prefer to see a Define. > > If you are aware of other areas in the code that will use it I would suggest > using the define. > I've refactored the code to use a hash define here. I'll send in the next version. > > > > > > + _mm512_storeu_si512(&rules[i], _mm512_setzero_si512()); > > > > + _mm512_storeu_si512(&dpcls_rules[i], _mm512_setzero_si512()); > > > > + } > > > > > > So from above the first operation is to set all elements in rules and > > > dpcls_rules > > to 0, as they were allocated values may not have been > > > set to zero already correct? > > > > > > > Correct, this is a faster way of initializing the entire rules and > > dpcls_rules arrays > > to 0. > > > > > > + > > > > + /* Prefetch each packet's metadata. */ > > > > + const size_t batch_size = dp_packet_batch_size(packets); > > > > + for (int i = 0; i < batch_size; i++) { > > > > + struct dp_packet *packet = packets->packets[i]; > > > > + OVS_PREFETCH(dp_packet_data(packet)); > > > > + pkt_metadata_prefetch_init(&packet->md); > > > > + } > > > > + > > > > + /* Check if EMC or SMC are enabled. */ > > > > + struct dfc_cache *cache = &pmd->flow_cache; > > > > + const uint32_t emc_enabled = pmd->ctx.emc_insert_min != 0; > > > > + const uint32_t smc_enabled = pmd->ctx.smc_enable_db; > > > > > > I assume there is a lock on the pmd being called here (i.e. before > > dp_netdev_input_outer_avx512 Is called) so as to avoid these pmd > > > values being changed while this check is occurring? > > > > > > > I think there's no lock on the pmd. These "pmd->ctx.*mc" values are set by > > the > > pmd thread in pmd_thread_main() and are private to each thread. Then > > pmd_thread_main() calls into either scalar or AVX512 DPIFs and reads the > > values. They can't be changed by another thread in the meantime. They are > > only > > set by the thread which will use the values. Hopefully that makes sense. > > > > OK that makes sense. > > > > > + > > > > + uint32_t emc_hits = 0; > > > > + uint32_t smc_hits = 0; > > > > + > > > > + /* A 1 bit in this mask indicates a hit, so no DPCLS lookup on the > > > > pkt. */ > > > > + uint32_t hwol_emc_smc_hitmask = 0; > > > > + > > > > + /* Perform first packet interation. */ > > > > > > Minor typo above, interaction. > > > > > > > Good catch, I'll fix this in the next version. > > > > > > + uint32_t lookup_pkts_bitmask = (1ULL << batch_size) - 1; > > > > + uint32_t iter = lookup_pkts_bitmask; > > > > + while (iter) { > > > > + uint32_t i = __builtin_ctz(iter); > > > > + iter = _blsr_u64(iter); > > > > > > Having some trouble understanding the logic above for iter, i and > > lookup_pkts_bitmask. > > > > > > From what I can tell, lookup_pkts_bitmask will represent the number of > > packets in a batch that require a lookup? > > > > > > Set iter equal to the lookup_pkst_bitmask (assuming lookup_pkst_bitmask > > > will > > be required later so must be unchanged until then). > > > > > > > Yes, all the above is true. > > > > > For "i" set it to the number of trailing zeroes following the LSB in in > > > iter. So at > > this point, does each trailing zero represent a packet yet > > > to be extracted? Are you expecting that it has already hit the > > > EMC/SMC/HWOL > > here or is this for a packet that has not hit any of these > > > yet? > > > > > > > Each trailing zero represents a packet that has already been extracted and > > looked up with HWOL/EMC/SMC if any of those are enabled. "i" is for all > > packets > > that have entered this AVX512 DPIF. > > > > > Also __builtin_ctz Can return undefined, I'm thinking is there a case to > > > be > > handled here for that situation? > > > > > > > I think this isn't an issue. From https://gcc.gnu.org/onlinedocs/gcc/Other- > > Builtins.html: > > Built-in Function: int __builtin_ctz (unsigned int x) > > Returns the number of trailing 0-bits in x, starting at the least > > significant bit > > position. If x is 0, the result is undefined. > > > > We check whether "iter" is 0 as part of the loop "while (iter)". So we > > shouldn't > > get an undefined result. > > > > Yes, your correct. > > > > Finally set iter to _blsr_u64(iter), As this is the only change that I > > > saw that > > would influence iter to break the while loop, could you > > > explain the expected operation here? > > > > > > From the intrinsic guide on _blsr_u64: > > > > > > Copy all bits from unsigned 64-bit integer a to dst, and reset (set to 0) > > > the bit in > > dst that corresponds to the lowest set bit in a. > > > > > > From the initial comment before this block, I assumed this is a once off > > operation on the First packet, but is it the case it happens for all > > > packets? > > > > > > > The expected operation here is that we have a number of bits set in "iter" > > representing packets that we want to process (HWOL->MFEX->EMC->SMC). As > > each packet is processed, we clear (set to 0) the bit representing that > > packet > > using "_blsr_u64()". The "__builtin_ctz ()" will give us the correct index > > into the > > "packets", "pkt_meta", "keys" and "rules" arrays. > > > > Let's walk through the below code with a 4 packet batch example. Let's > > represent "iter" in binary. > > > > while (iter) { > > uint32_t i = __builtin_ctz(iter); > > iter = _blsr_u64(iter); > > > > Replace the "while" with an "if" and write out each iteration > > if (iter) { // iter = 1111 > > uint32_t i = __builtin_ctz(iter); // i = 0 > > iter = _blsr_u64(iter); // iter = 1110 > > // do all processing (HWOL->MFEX->EMC->SMC) > > } > > if (iter) { // iter = 1110 > > uint32_t i = __builtin_ctz(iter); // i = 1 > > iter = _blsr_u64(iter); // iter = 1100 > > // do all processing (HWOL->MFEX->EMC->SMC) > > } > > if (iter) { // iter = 1100 > > uint32_t i = __builtin_ctz(iter); // i = 2 > > iter = _blsr_u64(iter); // iter = 1000 > > // do all processing (HWOL->MFEX->EMC->SMC) > > } > > if (iter) { // iter = 1000 > > uint32_t i = __builtin_ctz(iter); // i = 3 > > iter = _blsr_u64(iter); // iter = 0000 > > // do all processing (HWOL->MFEX->EMC->SMC) > > } > > if (iter) { // iter = 0000 > > // fail if check, this isn't reached, move on to DPCLS code. > > } > > > > Hopefully that makes sense. > > > > That's really helpful, can I make a suggestion to provide a summarized > example of this as a comment before the function itself? > AS this is the expected behavior it will make it easier to test > against/modify if needs be in future. It does not have to be as detailed > As above (maybe even provide the example for 1100. I think long term it would > help with maintainability. > I've added a comment with this explanation, thanks for the suggestion! > > > > + > > > > + /* Initialize packet md and do miniflow extract. */ > > > > + struct dp_packet *packet = packets->packets[i]; > > > > + pkt_metadata_init(&packet->md, in_port); > > > > + struct netdev_flow_key *key = &keys[i]; > > > > + miniflow_extract(packet, &key->mf); > > > > + > > > > + /* Cache TCP and byte values for all packets. */ > > > > + pkt_meta[i].bytes = dp_packet_size(packet); > > > > + pkt_meta[i].tcp_flags = miniflow_get_tcp_flags(&key->mf); > > > > + > > > > + key->len = netdev_flow_key_size(miniflow_n_values(&key->mf)); > > > > + key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet, > > > > &key->mf); > > > > + > > > > + struct dp_netdev_flow *f = NULL; > > > > + > > > > + if (emc_enabled) { > > > > + f = emc_lookup(&cache->emc_cache, key); > > > > + > > > > + if (f) { > > > > + rules[i] = &f->cr; > > > > + emc_hits++; > > > > + hwol_emc_smc_hitmask |= (1 << i); > > > > > > So HWOL may be disabled, but the assumption here would be that if it is > > enabled you would have a hit, is that correct? > > > > > > > We implement HWOL the same way as in a similar way to the scalar DPIF. We > > don't check recirculation depth since the AVX512 DPIF is for outer packets > > only. > > Otherwise, the checks are the same as with the scalar DPIF. So basically the > > checks are: > > 1. Does the packet have a flow mark > > 2. If so, does the flow mark match with an actual flow. > > > > So there is no assumption that there will be a hit. We just check whether > > there is > > a hit. > > Understood. > > > > > > I'm wondering is there a case with this logic that you have a traffic > > > type that > > we have a hit for in EMC/SMC but that possibly is not > > > supported by HWOL and as such you may not have a hit? > > > > > > > Yes, this is possible. When there is no HWOL hit, we will continue to EMC > > and > > SMC lookups. These will be performed if they are enabled respectively. > > > > Thanks for clarifying. > > > > > + continue; > > > > + } > > > > + }; > > > > > > Is the semi colon a typo above? > > > > > > > Good catch, I'll remove this in the next version. > > > > > > + > > > > + if (smc_enabled && !f) { > > > > + f = smc_lookup_single(pmd, packet, key); > > > > + if (f) { > > > > + rules[i] = &f->cr; > > > > + smc_hits++; > > > > + hwol_emc_smc_hitmask |= (1 << i); > > > > + continue; > > > > + } > > > > + } > > > > + > > > > + /* The flow pointer was not found in HWOL/EMC/SMC, so add it > > > > to the > > > > + * dpcls input keys array for batch lookup later. > > > > + */ > > > > + key_ptrs[dpcls_key_idx] = &keys[i]; > > > > + dpcls_key_idx++; > > > > + } > > > > + > > > > + > > > > + /* DPCLS handles any packets missed by HWOL/EMC/SMC. It operates on > > > > the > > > > + * key_ptrs[] for input miniflows to match, storing results in the > > > > + * dpcls_rules[] array. > > > > + */ > > > > + if (dpcls_key_idx > 0) { > > > > + struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port); > > > > + if (OVS_UNLIKELY(!cls)) { > > > > + return -1; > > > > + } > > > > + int any_miss = !dpcls_lookup(cls, > > > > + (const struct netdev_flow_key **) > > > > key_ptrs, > > > > + dpcls_rules, dpcls_key_idx, NULL); > > > > > > Is there a reason you've used int any_miss rather than a bool above? > > > > > > > This is just personal preference. I'll change this to use a bool in the > > next version. > > > > > dpcls_lookup reutrns bool anyway, true if all entries are found, > > > otherwise false > > so it could avoid the ! operand also on the call to > > > dpcls_lookup no? > > > > > > > We call "dpcls_lookup" in a similar way to the scalar DPIF here, with the > > "!" > > operand to represent "any_miss" and branch on this afterwards. Hopefully > > that's > > OK. > > I think it is, just had to re-read on first glimpse. > > > > > > > + if (OVS_UNLIKELY(any_miss)) { > > > > + return -1; > > > > + } > > > > + > > > > + /* Merge DPCLS rules and HWOL/EMC/SMC rules. */ > > > > + uint32_t dpcls_idx = 0; > > > > + for (int i = 0; i < NETDEV_MAX_BURST; i += 8) { > > > > + /* Indexing here is somewhat complicated due to DPCLS > > > > output rule > > > > + * load index depending on the hitmask of HWOL/EMC/SMC. > > > > More > > > > + * packets from HWOL/EMC/SMC bitmask means less DPCLS rules > > are > > > > + * used. > > > > + */ > > > > + __m512i v_cache_rules = _mm512_loadu_si512(&rules[i]); > > > > + __m512i v_merged_rules = > > > > + _mm512_mask_expandloadu_epi64(v_cache_rules, > > > > + > > > > ~hwol_emc_smc_hitmask, > > > > + > > > > &dpcls_rules[dpcls_idx]); > > > To clarify above, where is the destination for the memory load here? Is it > > v_merged_rules? > > > > > > > We are storing the result of the memory load to "v_merged_rules". > > > > > Also is it the mask is interacting with the dpcls_rules[dpcls_idx] ? i.e. > > > for each > > bit set in mask take the value from dpcls_rules, but > > > when the bit is not set take the value from v_cache_rules, just trying to > > > get my > > head around the operation of how the merged rules > > > would look. > > > > > > > Yes, you are correct. This is where we are merging some elements of the > > "dpcls_rules" array and some elements of the "rules" array. For the > > "_mm512_mask_expandloadu_epi64()", we are loading from "dpcls_rules" or > > "v_cache_rules" (which is the "rules" array loaded into an AVX512 zmm > > register). > > When a bit in the mask is set, we take that "dpcls_rule". Otherwise, we use > > the > > "rule". > > > > > > + _mm512_storeu_si512(&rules[i], v_merged_rules); > > > > + > > > > + /* Update DPCLS load index and bitmask for HWOL/EMC/SMC > > > > hits. > > > > + * There are 8 output pointer per register, subtract the > > > > + * HWOL/EMC/SMC lanes equals the number of DPCLS rules > > > > consumed. > > > > + */ > > > > + uint32_t hitmask_FF = (hwol_emc_smc_hitmask & 0xFF); > > > > + dpcls_idx += 8 - __builtin_popcountll(hitmask_FF); > > > > + hwol_emc_smc_hitmask = (hwol_emc_smc_hitmask >> 8); > > > > + } > > > > + } > > > > + > > > > + /* At this point we don't return error anymore, so commit stats > > > > here. */ > > > > + pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RECV, > > > > batch_size); > > > > + pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT, > > > > emc_hits); > > > > + pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, > > > > smc_hits); > > > > + pmd_perf_update_counter(&pmd->perf_stats, > > > > PMD_STAT_MASKED_HIT, > > > > + dpcls_key_idx); > > > > + pmd_perf_update_counter(&pmd->perf_stats, > > > > PMD_STAT_MASKED_LOOKUP, > > > > + dpcls_key_idx); > > > > + > > > > + /* Initialize the "Action Batch" for each flow handled below. */ > > > > + struct dp_packet_batch action_batch; > > > > + action_batch.trunc = 0; > > > > + action_batch.do_not_steal = false; > > > > + > > > > + while (lookup_pkts_bitmask) { > > > > + uint32_t rule_pkt_idx = __builtin_ctz(lookup_pkts_bitmask); > > > > + uint64_t needle = (uintptr_t) rules[rule_pkt_idx]; > > > > + > > > > + /* Parallel compare 8 flow* 's to the needle, create a > > > > bitmask. */ > > > > + uint32_t batch_bitmask = 0; > > > > + for (uint32_t j = 0; j < NETDEV_MAX_BURST; j += 8) { > > > > + /* Pre-calculate store addr */ > > > > > > Minor, period missing above in comment. > > > > > > > Good catch, I'll fix this in the next version. > > > > > > + uint32_t num_pkts_in_batch = > > > > __builtin_popcountll(batch_bitmask); > > > > > > So num_pkts_in_batch will always be 0 on first iteration as batch_bitmask > > > will > > be equal to zero? > > > > > > > Correct. > > > > > A little unsure here of the ordering but I guess the key is to update the > > > bitmask > > for the next iteriation? > > > > > > > The popcount of "batch_bitmask" looks silly for the first iteration of the > > for > > loop, but makes sense for the subsequent iterations of the for loop. We > > calculate "num_pkts_in_batch" to get the correct index to use for the > > "action_batch" of packets. > > > > Sure, that confirms what I thought. > > > > > + void *store_addr = > > > > &action_batch.packets[num_pkts_in_batch]; > > > > + > > > > + /* Search for identical flow* in burst, update bitmask. */ > > > > + __m512i v_needle = _mm512_set1_epi64(needle); > > > > + __m512i v_hay = _mm512_loadu_si512(&rules[j]); > > > > + __mmask8 k_cmp_bits = _mm512_cmpeq_epi64_mask(v_needle, > > > > v_hay); > > > > + uint32_t cmp_bits = k_cmp_bits; > > > > + batch_bitmask |= cmp_bits << j; > > > > + > > > > + /* Compress and store the batched packets. */ > > > > + struct dp_packet **packets_ptrs = &packets->packets[j]; > > > > + __m512i v_pkt_ptrs = _mm512_loadu_si512(packets_ptrs); > > > > + _mm512_mask_compressstoreu_epi64(store_addr, cmp_bits, > > > > v_pkt_ptrs); > > > > + } > > > > + > > > > + /* Strip all packets in this batch from the > > > > lookup_pkts_bitmask. */ > > > > + lookup_pkts_bitmask &= (~batch_bitmask); > > > > + action_batch.count = __builtin_popcountll(batch_bitmask); > > > > + > > > > + /* Loop over all packets in this batch, to gather the byte and > > > > tcp_flag > > > > + * values, and pass them to the execute function. It would be > > > > nice to > > > > + * optimize this away, however it is not easy to refactor in > > > > dpif. > > > > + */ > > > > + uint32_t bytes = 0; > > > > + uint16_t tcp_flags = 0; > > > > + uint32_t bitmask_iter = batch_bitmask; > > > > + for (int i = 0; i < action_batch.count; i++) { > > > > + uint32_t idx = __builtin_ctzll(bitmask_iter); > > > > + bitmask_iter = _blsr_u64(bitmask_iter); > > > > + > > > > + bytes += pkt_meta[idx].bytes; > > > > + tcp_flags |= pkt_meta[idx].tcp_flags; > > > > + } > > > > + > > > > + dp_netdev_batch_execute(pmd, &action_batch, > > > > rules[rule_pkt_idx], > > > > + bytes, tcp_flags); > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +#endif > > > > +#endif > > > > diff --git a/lib/dpif-netdev-private-dfc.h > > > > b/lib/dpif-netdev-private-dfc.h > > > > index 52349a3fc..bd18bd3fd 100644 > > > > --- a/lib/dpif-netdev-private-dfc.h > > > > +++ b/lib/dpif-netdev-private-dfc.h > > > > @@ -81,6 +81,9 @@ extern "C" { > > > > #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \ > > > > DEFAULT_EM_FLOW_INSERT_INV_PROB) > > > > > > > > +/* Forward declaration for SMC function prototype. */ > > > > > > A little vague, maybe extend with " prototypes that require access to > > dp_netdev_pmd_thread " > > > > > > > I'll add more detail in the next version, like this: > > /* Forward declaration for SMC function prototype that requires access to > > * 'struct dp_netdev_pmd_thread'. */ > > > > Sounds good. > > > > > +struct dp_netdev_pmd_thread; > > > > + > > > > struct emc_entry { > > > > struct dp_netdev_flow *flow; > > > > struct netdev_flow_key key; /* key.hash used for emc hash value. > > > > */ > > > > @@ -237,6 +240,11 @@ emc_lookup(struct emc_cache *cache, const struct > > > > netdev_flow_key *key) > > > > return NULL; > > > > } > > > > > > > > +struct dp_netdev_flow * > > > > +smc_lookup_single(struct dp_netdev_pmd_thread *pmd, > > > > + struct dp_packet *packet, > > > > + struct netdev_flow_key *key); > > > > + > > > > #ifdef __cplusplus > > > > } > > > > #endif > > > > diff --git a/lib/dpif-netdev-private-dpif.h > > > > b/lib/dpif-netdev-private-dpif.h > > > > new file mode 100644 > > > > index 000000000..2fd7cc400 > > > > --- /dev/null > > > > +++ b/lib/dpif-netdev-private-dpif.h > > > > @@ -0,0 +1,32 @@ > > > > +/* > > > > + * Copyright (c) 2021 Intel Corporation. > > > > + * > > > > + * Licensed under the Apache License, Version 2.0 (the "License"); > > > > + * you may not use this file except in compliance with the License. > > > > + * You may obtain a copy of the License at: > > > > + * > > > > + * http://www.apache.org/licenses/LICENSE-2.0 > > > > + * > > > > + * Unless required by applicable law or agreed to in writing, software > > > > + * distributed under the License is distributed on an "AS IS" BASIS, > > > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > > > > implied. > > > > + * See the License for the specific language governing permissions and > > > > + * limitations under the License. > > > > + */ > > > > + > > > > +#ifndef DPIF_NETDEV_PRIVATE_DPIF_H > > > > +#define DPIF_NETDEV_PRIVATE_DPIF_H 1 > > > > + > > > > +#include "openvswitch/types.h" > > > > + > > > > +/* Forward declarations to avoid including files. */ > > > > +struct dp_netdev_pmd_thread; > > > > +struct dp_packet_batch; > > > > + > > > > +/* Available implementations for dpif work. */ > > > > +int32_t > > > > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd, > > > > + struct dp_packet_batch *packets, > > > > + odp_port_t in_port); > > > > > > Just on the function name dp_netdev_input_outer_avx512 Above. > > > > > > Is it likely to see a separate inner equivalent in the future? And would > > > it be a > > separate function such as > > > dp_netdev_input_inner_avx512 or would it just be on function handling both > > cases? > > > > > > > We want to add support for recirculation, or handling of the inner headers > > of > > packets to provide more performance benefits. Whether it will be separate > > functions is a good question. We'll need to look closer when it comes to > > actually > > adding support for inner packets and find the appropriate solution. > > > > I guess naming this with "outer" hopefully makes sense for now since it only > > handles a packets first pass through OVS. > > Sure, outer makes sense for the moment, could be updated at a later stage > depending on the implementation. > > > > > > > + > > > > +#endif /* netdev-private.h */ > > > > diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private- > > > > thread.h > > > > index 01a28a681..c0c94c566 100644 > > > > --- a/lib/dpif-netdev-private-thread.h > > > > +++ b/lib/dpif-netdev-private-thread.h > > > > @@ -45,14 +45,19 @@ struct dp_netdev_pmd_thread_ctx { > > > > struct dp_netdev_rxq *last_rxq; > > > > /* EMC insertion probability context for the current processing > > > > cycle. */ > > > > uint32_t emc_insert_min; > > > > + /* Enable the SMC cache from ovsdb config. */ > > > > + bool smc_enable_db; > > > > }; > > > > > > > > /* Forward declaration for typedef. */ > > > > struct dp_netdev_pmd_thread; > > > > > > > > -typedef void (*dp_netdev_input_func)(struct dp_netdev_pmd_thread > > > > *pmd, > > > > - struct dp_packet_batch *packets, > > > > - odp_port_t port_no); > > > > +/* Typedef for DPIF functions. > > > > + * Returns a bitmask of packets to handle, possibly including > > > > upcall/misses. > > > > + */ > > > > +typedef int32_t (*dp_netdev_input_func)(struct dp_netdev_pmd_thread > > > > *pmd, > > > > + struct dp_packet_batch > > > > *packets, > > > > + odp_port_t port_no); > > > > > > > > /* PMD: Poll modes drivers. PMD accesses devices via polling to > > > > eliminate > > > > * the performance overhead of interrupt processing. Therefore netdev > > > > can > > > > diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h > > > > index d7b6fd7ec..0315b5bf6 100644 > > > > --- a/lib/dpif-netdev-private.h > > > > +++ b/lib/dpif-netdev-private.h > > > > @@ -31,4 +31,29 @@ > > > > #include "dpif-netdev-private-dfc.h" > > > > #include "dpif-netdev-private-thread.h" > > > > > > > > +/* Allow other implementations to lookup the DPCLS instances. */ > > > > +struct dpcls * > > > > +dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd, > > > > + odp_port_t in_port); > > > > + > > > > +/* Allow other implementations to call dpcls_lookup() for subtable > > > > search. > > > > */ > > > > +bool > > > > +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[], > > > > + struct dpcls_rule **rules, const size_t cnt, > > > > + int *num_lookups_p); > > > > + > > > > +/* Allow other implementations to execute actions on a batch. */ > > > > +void > > > > +dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd, > > > > + struct dp_packet_batch *packets, > > > > + struct dpcls_rule *rule, > > > > + uint32_t bytes, > > > > + uint16_t tcp_flags); > > > > + > > > > +/* Available implementations for dpif work. */ > > > > +int32_t > > > > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd, > > > > + struct dp_packet_batch *packets, > > > > + odp_port_t in_port); > > > > + > > > > #endif /* netdev-private.h */ > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c > > > > index bec984643..5ed61d08b 100644 > > > > --- a/lib/dpif-netdev.c > > > > +++ b/lib/dpif-netdev.c > > > > @@ -185,10 +185,6 @@ static uint32_t > > > > dpcls_subtable_lookup_reprobe(struct dpcls *cls); > > > > static void dpcls_insert(struct dpcls *, struct dpcls_rule *, > > > > const struct netdev_flow_key *mask); > > > > static void dpcls_remove(struct dpcls *, struct dpcls_rule *); > > > > -static bool dpcls_lookup(struct dpcls *cls, > > > > - const struct netdev_flow_key *keys[], > > > > - struct dpcls_rule **rules, size_t cnt, > > > > - int *num_lookups_p); > > > > > > > > /* Set of supported meter flags */ > > > > #define DP_SUPPORTED_METER_FLAGS_MASK \ > > > > @@ -485,7 +481,7 @@ static void dp_netdev_execute_actions(struct > > > > dp_netdev_pmd_thread *pmd, > > > > const struct flow *flow, > > > > const struct nlattr *actions, > > > > size_t actions_len); > > > > -static void dp_netdev_input(struct dp_netdev_pmd_thread *, > > > > +static int32_t dp_netdev_input(struct dp_netdev_pmd_thread *, > > > > struct dp_packet_batch *, odp_port_t > > > > port_no); > > > > static void dp_netdev_recirculate(struct dp_netdev_pmd_thread *, > > > > struct dp_packet_batch *); > > > > @@ -557,7 +553,7 @@ dpif_netdev_xps_revalidate_pmd(const struct > > > > dp_netdev_pmd_thread *pmd, > > > > bool purge); > > > > static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread > > > > *pmd, > > > > struct tx_port *tx); > > > > -static inline struct dpcls * > > > > +inline struct dpcls * > > > > dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd, > > > > odp_port_t in_port); > > > > > > > > @@ -1922,7 +1918,7 @@ void dp_netdev_flow_unref(struct > > > > dp_netdev_flow *flow) > > > > } > > > > } > > > > > > > > -static inline struct dpcls * > > > > +inline struct dpcls * > > > > dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd, > > > > odp_port_t in_port) > > > > { > > > > @@ -2722,7 +2718,7 @@ dp_netdev_pmd_lookup_flow(struct > > > > dp_netdev_pmd_thread *pmd, > > > > int *lookup_num_p) > > > > { > > > > struct dpcls *cls; > > > > - struct dpcls_rule *rule; > > > > + struct dpcls_rule *rule = NULL; > > > > odp_port_t in_port = u32_to_odp(MINIFLOW_GET_U32(&key->mf, > > > > > > > > in_port.odp_port)); > > > > struct dp_netdev_flow *netdev_flow = NULL; > > > > @@ -4236,7 +4232,10 @@ dp_netdev_process_rxq_port(struct > > > > dp_netdev_pmd_thread *pmd, > > > > } > > > > > > > > /* Process packet batch. */ > > > > - pmd->netdev_input_func(pmd, &batch, port_no); > > > > + int32_t ret = pmd->netdev_input_func(pmd, &batch, port_no); > > > > + if (ret) { > > > > + dp_netdev_input(pmd, &batch, port_no); > > > > + } > > > > > > > > /* Assign processing cycles to rx queue. */ > > > > cycles = cycle_timer_stop(&pmd->perf_stats, &timer); > > > > @@ -5254,6 +5253,8 @@ dpif_netdev_run(struct dpif *dpif) > > > > non_pmd->ctx.emc_insert_min = 0; > > > > } > > > > > > > > + non_pmd->ctx.smc_enable_db = dp->smc_enable_db; > > > > + > > > > for (i = 0; i < port->n_rxq; i++) { > > > > > > > > if (!netdev_rxq_enabled(port->rxqs[i].rx)) { > > > > @@ -5525,6 +5526,8 @@ reload: > > > > pmd->ctx.emc_insert_min = 0; > > > > } > > > > > > > > + pmd->ctx.smc_enable_db = pmd->dp->smc_enable_db; > > > > + > > > > process_packets = > > > > dp_netdev_process_rxq_port(pmd, poll_list[i].rxq, > > > > poll_list[i].port_no); > > > > @@ -6419,6 +6422,24 @@ packet_batch_per_flow_execute(struct > > > > packet_batch_per_flow *batch, > > > > actions->actions, actions->size); > > > > } > > > > > > > > +void > > > > +dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd, > > > > + struct dp_packet_batch *packets, > > > > + struct dpcls_rule *rule, > > > > + uint32_t bytes, > > > > + uint16_t tcp_flags) > > > > +{ > > > > + /* Gets action* from the rule. */ > > > > + struct dp_netdev_flow *flow = dp_netdev_flow_cast(rule); > > > > + struct dp_netdev_actions *actions = > > > > dp_netdev_flow_get_actions(flow); > > > > + > > > > + dp_netdev_flow_used(flow, dp_packet_batch_size(packets), bytes, > > > > + tcp_flags, pmd->ctx.now / 1000); > > > > + const uint32_t steal = 1; > > > > + dp_netdev_execute_actions(pmd, packets, steal, &flow->flow, > > > > + actions->actions, actions->size); > > > > +} > > > > + > > > > static inline void > > > > dp_netdev_queue_batches(struct dp_packet *pkt, > > > > struct dp_netdev_flow *flow, uint16_t > > > > tcp_flags, > > > > @@ -6523,6 +6544,30 @@ smc_lookup_batch(struct > > > > dp_netdev_pmd_thread *pmd, > > > > pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, > > > > n_smc_hit); > > > > } > > > > > > > > +struct dp_netdev_flow * > > > > +smc_lookup_single(struct dp_netdev_pmd_thread *pmd, > > > > + struct dp_packet *packet, > > > > + struct netdev_flow_key *key) > > > > +{ > > > > + const struct cmap_node *flow_node = smc_entry_get(pmd, key->hash); > > > > + > > > > + if (OVS_LIKELY(flow_node != NULL)) { > > > > + struct dp_netdev_flow *flow = NULL; > > > > + > > > > + CMAP_NODE_FOR_EACH (flow, node, flow_node) { > > > > + /* Since we dont have per-port megaflow to check the port > > > > + * number, we need to verify that the input ports match. */ > > > > + if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, key) && > > > > + flow->flow.in_port.odp_port == > > > > packet->md.in_port.odp_port)) { > > > > + > > > > + return (void *) flow; > > > > + } > > > > + } > > > > + } > > > > + > > > > + return NULL; > > > > +} > > > > + > > > > /* Try to process all ('cnt') the 'packets' using only the datapath > > > > flow cache > > > > * 'pmd->flow_cache'. If a flow is not found for a packet > > > > 'packets[i]', the > > > > * miniflow is copied into 'keys' and the packet pointer is moved at > > > > the > > > > @@ -6928,12 +6973,13 @@ dp_netdev_input__(struct > > > > dp_netdev_pmd_thread *pmd, > > > > } > > > > } > > > > > > > > -static void > > > > +static int32_t > > > > dp_netdev_input(struct dp_netdev_pmd_thread *pmd, > > > > struct dp_packet_batch *packets, > > > > odp_port_t port_no) > > > > { > > > > dp_netdev_input__(pmd, packets, false, port_no); > > > > + return 0; > > > > > > Always returning 0, I think this would change at a later stage no in the > > > patch > > series? > > > > > > > This won't change later in the patch series. The "return 0;" is added so > > the scalar > > DPIF API will match with newly introduced API for DPIF functions defined by > > "dp_netdev_input_func" in " lib/dpif-netdev-private-dpif.h". Other DPIF > > implementations might return a nonzero value to indicate packets to handle > > because of misses in that DPIF implementation. The "dp_netdev_input" should > > handle all cases, that other implementations might not, so will always > > return 0. > > > > > BR > > > Ian > > > > Thanks again for the review Ian. > > No problem, looking forward to the next revision. > > Regards > Ian > > > > } > > > > > > > > static void > > > > @@ -8374,7 +8420,7 @@ netdev_flow_key_gen_masks(const struct > > > > netdev_flow_key *tbl, > > > > > > > > /* Returns true if 'target' satisfies 'key' in 'mask', that is, if > > > > each 1-bit > > > > * in 'mask' the values in 'key' and 'target' are the same. */ > > > > -bool > > > > +inline bool ALWAYS_INLINE > > > > dpcls_rule_matches_key(const struct dpcls_rule *rule, > > > > const struct netdev_flow_key *target) > > > > { > > > > @@ -8400,7 +8446,7 @@ dpcls_rule_matches_key(const struct dpcls_rule > > > > *rule, > > > > * priorities, instead returning any rule which matches the flow. > > > > * > > > > * Returns true if all miniflows found a corresponding rule. */ > > > > -static bool > > > > +bool > > > > dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[], > > > > struct dpcls_rule **rules, const size_t cnt, > > > > int *num_lookups_p) > > > > -- > > > > 2.31.1 > > > > > > > > _______________________________________________ > > > > dev mailing list > > > > [email protected] > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
