Hey All, [OT: Apologies for a missing indent, some HTML mixup occurred somewhere, now plain-text email again.]
>From: Federico Iezzi <[email protected]> >Sent: Wednesday, May 20, 2020 5:13 PM >To: William Tu <[email protected]> >Cc: Van Haaren, Harry <[email protected]>; [email protected]; >[email protected] >Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather >implementation > >On Wed, 20 May 2020 at 15:32, William Tu <[email protected]> wrote: >On Wed, May 20, 2020 at 3:35 AM Federico Iezzi <[email protected]> wrote: >> On Wed, 20 May 2020 at 12:20, Van Haaren, Harry <[email protected]> >> wrote: >>> >>> > -----Original Message----- >>> > From: William Tu <[email protected]> >>> > Sent: Wednesday, May 20, 2020 1:12 AM >>> > To: Van Haaren, Harry <[email protected]> >>> > Cc: [email protected]; [email protected] >>> > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather >>> > implementation >>> > >>> > On Mon, May 18, 2020 at 9:12 AM Van Haaren, Harry >>> > <[email protected]> wrote: >>> > > >>> > > > -----Original Message----- >>> > > > From: William Tu <[email protected]> >>> > > > Sent: Monday, May 18, 2020 3:58 PM >>> > > > To: Van Haaren, Harry <[email protected]> >>> > > > Cc: [email protected]; [email protected] >>> > > > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather >>> > > > implementation >>> > > > >>> > > > On Wed, May 06, 2020 at 02:06:09PM +0100, Harry van Haaren wrote: >>> > > > > This commit adds an AVX-512 dpcls lookup implementation. >>> > > > > It uses the AVX-512 SIMD ISA to perform multiple miniflow >>> > > > > operations in parallel. >>> >>> <snip lots of code/patch contents for readability> >>> >>> > Hi Harry, >>> > >>> > I managed to find a machine with avx512 in google cloud and did some >>> > performance testing. I saw lower performance when enabling avx512, >> >> >> AVX512 instruction path lowers the clock speed well below the base frequency >> [1]. >> Aren't you killing the PMD performance while improving the lookup ones? >> >> [1] >> https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf >> (see page 20) Thanks for raising your question – likely there are others with similar questions. It will be good to discuss here and to be able to present the logic and design taken these OVS patches for enabling AVX512. From a frequency perspective, there is a mis-conception that AVX512 will always cause the worst-case degradation. For example, there are differences in frequency based on what instructions are executing. This does makes it more complicated, however there are rules here – and those rules provide us SW developers with best practices. I've added my colleague Edwin on CC, who is much more familiar with AVX512 frequency topic, and can provide more detail. From an OVS Software Developer perspective, these were the design decisions that made AVX512 enabling work: AVX512 provides very powerful compute ISA, so to optimize with it we must efficiently achieve compute. This patchset achieves "flattening" of a packet miniflow data-structure, based on the miniflow of the subtable to match on. In short, it implements the tuple-space-search as required for DPCLS wildcarded lookup in SIMD. The instruction count reduction is large – and that's what ultimately leads to the performance improvements. Given a DPCLS implementation with AVX512, we must consider the other work done on that thread – you correctly point out that other work (e.g. DPDK PMDs) also execute on that core. My experience has been that performance goes up – including DPDK PMD rx and tx – overall rate of work done increases. Given OVS can spend significant amounts of time in DPCLS itself, any potential slowdown of the PMD code is very likely still giving performance improvements. Finally – the design itself here is very flexible – this allows each deployment of OVS to test if/how-much the AVX512 code-path improves real-world performance, and enable it based on that. >Thanks for sharing the link. >Does that mean if OVS PMD uses avx512 on one core, then all the other cores's >frequency will be lower? > >Only where avx512 instructions are executed the clock is reduced to cope with >the thermals >I'm not sure if there is a situation where avx512 code is executed only on >specific PMDs, if that happens is bad as some may PMD be faster/slower (see >below) >Kinda like when dynamic turbo boost is enabled and some pmd go faster because >of the higher clock > > >There are some discussion here: >https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/ > >Wow, quite interesting. Thanks! > > >My take is that overall down clocking will happen, but application >will get better performance. > >Indeed the part of the code wrote for avx512 goes much faster, the rest, stay >on the normal path and will go slow due to the reduced clock. >Those are different use-cases and programs but see Cannon Lake Anandtech >review regarding what AVX512 can deliver > >### >When we crank on the AVX2 and AVX512, there is no stopping the Cannon Lake >chip here. At a score of 4519, it beats a full 18-core Core i9-7980XE >processor running in non-AVX. >https://www.anandtech.com/show/13405/intel-10nm-cannon-lake-and-core-i3-8121u-deep-dive-review/9 >### > >Indeed you have to expect much-improved performance from it, the question is >how much non-avx512 code will slow down >See also this one -> >https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html There's a lot of (and some very detailed) information out there, and it's useful to read the available information. Ultimately it is very unlikely somebody has tested your exact configuration or deployment, particularly since this OVS patchset is fresh on the mailing-list in the past weeks. I welcome $ perf top output like William's email, showing CPU %'s spent in DPCLS, more real-world data the better for showing the value of AVX512 in DPCLS. Regards, -Harry _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
