> -----Original Message----- > From: William Tu <[email protected]> > Sent: Tuesday, July 7, 2020 7:50 PM > To: Van Haaren, Harry <[email protected]> > Cc: ovs-dev <[email protected]>; Stokes, Ian <[email protected]>; > Ilya > Maximets <[email protected]>; Federico Iezzi <[email protected]> > Subject: Re: [PATCH v6 0/6] DPCLS Subtable ISA Optimization > > On Mon, Jul 6, 2020 at 4:57 AM Van Haaren, Harry > <[email protected]> wrote: > > > > > -----Original Message----- > > > From: William Tu <[email protected]> > > > Sent: Sunday, July 5, 2020 3:06 PM > > > To: Van Haaren, Harry <[email protected]> > > > Cc: ovs-dev <[email protected]>; Stokes, Ian <[email protected]>; > Ilya > > > Maximets <[email protected]>; Federico Iezzi <[email protected]> > > > Subject: Re: [PATCH v6 0/6] DPCLS Subtable ISA Optimization > > > > > > On Thu, Jul 2, 2020 at 10:42 AM Harry van Haaren > > > <[email protected]> wrote: > > > > > > > > v6 work done: > > > > - Fix as --64 unrecognized option > > > > - Fix build issues with avx512 library changes > > > > - Fix files left in build dir after distclean > > > > - Fix CPU arch dependant RTE_CPUFLAG_ usage > > > > > > > > Thanks William & Ian for review & help on v5. > > > > All known issues are fixed and working here. > > > > <snip cover letter details> > > > > > > Harry van Haaren (6): > > > > dpif-netdev: implement subtable lookup validation. > > > > dpif-netdev: add subtable lookup prio set command. > > > > dpif-netdev: add subtable-lookup-prio-get command. > > > > dpdk: enable cpu feature detection. > > > > dpif-lookup: add avx512 gather implementation. > Hi Harry, > > Another thing I noticed is that for the avx512 implementation, the > current series > only enables the three cases > DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5, 1) > DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 1) > DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4, 0)
These three are the "specialized" functions for particular miniflow fingerprints. By miniflow fingerprint I mean the (x, y) combination of bit counts, eg (5,1). By "specialized" I mean that a compile time, the compiler is going to generate a separate function for that fingerprint. Due to constant propagation, the compiler can use the (5,1) information at compile time to do loop unrolling, and reduce complexity of the generated code, resulting in improved performance. There is also an "_any" version of the AVX512 implementation. This is a blanket "cover all" version of the function, with loops in the implementation to handle various (x, y) fingerprints. Note that the current implementation of the AVX512 code has a limitation to (x + y) <= 8. This is purely an implementation limitation, and can be increased if desired. > I did the following approach to collect how the current OVS uses the > miniflow bitmap. > apply this patch > diff --git a/tests/ofproto-macros.at b/tests/ofproto-macros.at > index b2b17eed335a..c9513213270d 100644 > --- a/tests/ofproto-macros.at > +++ b/tests/ofproto-macros.at > @@ -312,7 +312,8 @@ m4_divert_pop([PREPARE_TESTS]) > # > # OVS_VSWITCHD_STOP(["/expected error/d"]) > m4_define([OVS_VSWITCHD_STOP], > - [AT_CHECK([check_logs $1]) > + [ovs-appctl dpctl/dump-flows -m; > + AT_CHECK([check_logs $1]) > OVS_APP_EXIT_AND_WAIT([ovs-vswitchd]) > OVS_APP_EXIT_AND_WAIT([ovsdb-server])]) > > Run the test > $ make check TESTSUITEFLAGS='-j4 -d' Oh cool! I didn't think of using the unit-tests as a source of common miniflow fingerprints. I like this - awesome. > Collect the miniflow_bits > $ grep -r miniflow tests/testsuite.dir/* | sed -n > 's/.*\(miniflow_bits.*\)/\1/p' > miniflow.txt > $ sort miniflow.txt | uniq > miniflow_bits(11,1) = 12 > miniflow_bits(4,0) = 4 > miniflow_bits(4,1) = 5 > miniflow_bits(4,2) = 6 > miniflow_bits(4,3) = 7 > miniflow_bits(5,0) = 5 > miniflow_bits(5,1) = 6 > miniflow_bits(5,10) = 15 > miniflow_bits(5,2) = 7 > miniflow_bits(5,3) = 8 > miniflow_bits(6,0) = 6 > miniflow_bits(8,1) = 9 > miniflow_bits(9,0) = 9 > miniflow_bits(9,4) = 13 > miniflow_bits(9,5) = 14 > > Does miniflow_bits(x, y) and (x + y) must be less than or equal to 8 here? For the AVX512 implementation today, (x + y) must be <= 8 correct. As above, we can increase that to cover a larger quantity of subtables. Based on your output (now annotated) above, 6 are > 8, while 9 tables are covered today. If we expand to <= 16 blocks, then all subtables in the unit tests would be covered. We will put such AVX512 modifications on the backlog for upstream in OVS 2.15. In addition, I think we should specialize all subtables in the unit tests, as they are common/tested/real subtable fingerprints. This can be done for both scalar and AVX512 specialized functions. Adding specialized functions is very low risk as there is no manual code optimization - we rely on the compiler to do the hard (optimizing) work :) > Thanks > William Nice idea - great miniflow fingerprint results! -Harry > > > > docs/dpdk/bridge: add datapath performance section. > > > > > > > > > > The v6 looks good to me. It would be great if there is another pair of > > > eyes on this series. > > > It's a little hard for me to understand the avx512 gather > > > implementation (patch5), > > > but the idea of using autovalidator and it passes our test cases give > > > me more confidence > > > that it is correct. > > > > > > Acked-by: William Tu <[email protected]> > > > > I'm happy to see the Autovalidator concept in practice working well, > > it is likely a development/testing model to re-use in future for similar > > types of work to ensure consistency and validity of SIMD code. > > > > Thanks for your multiple reviews of the series over the last months. > > -Harry _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
