Hi Ilya, Thanks a lot for such a wonderful review of the series. Replies inline.
> -----Original Message----- > From: Ilya Maximets <[email protected]> > Sent: Tuesday, April 12, 2022 5:05 PM > To: Amber, Kumar <[email protected]>; [email protected] > Cc: [email protected]; [email protected]; [email protected]; > Ferriter, Cian <[email protected]>; Stokes, Ian <[email protected]>; > [email protected]; Van Haaren, Harry <[email protected]> > Subject: Re: [PATCH v9 0/4] IPv4 Hashing AVX512 Optimizations > > On 4/6/22 16:38, Kumar Amber wrote: > > Hashing Optimization are also included which can further improve > > performance by approximately 10%. > > Hi, Amber, others. That's an interesting patch set. > I got my hands on an AVX512-capable system for a few hours, so I was able > to play with it. A few observations: > > 1. Patch set does improve the performance in my VM-OVS-VM > setup by ~3.5 %. I suppose, your 10% was achieved for > a PHY-OVS-PHY setup, right? > Yes correct. > 2. The hash function itself seems very generic, so it's not > necessary for it to be strictly part of the avx512 code. > I have done that in next series. > 3. Patch #3 doesn't seem to give any performance boost. > At least in my testing, but it creates an assumption that > optimized miniflow_extract always calculates the hash, > which is not great from the generalization point of view > and may harm future attempts to decouple parsing and > datapath implementations. > Removed. > 4. Patches #2 and #4 are just for testing purposes. > Will send the script changes separately. > 5. It appears to be that any function call to a function that > is not part of the avx512 code is extremely expensive. > Mostly because compiler injects VZEROUPPER instructions > just before them all the time. And these are very heavy > if called per-packet. AFAIU, that is the cost of switching > between avx and non-avx instructions. > Did you notice the same? > I can look at it and confirm but your observation is correct. > 6. Reading entries from the miniflow is heavier than reading > from the packet itself for some reason. I didn't dig > very deep here. > Yes we noticed the same and also causes load before stores issues When loading from mfex thus we decided to do it based on packets. > 7. All switch branches in mfex_avx512_process() has exactly > the same code. I suppose, that's a leftover from the > IPv6 series. > Not exactly if you notice each branch has a different offset for tcp flags. > 8. In-packet offsets are magic numbers, which is not great. > I think, they can be easily defined as a sum of structure > offsets, without any runtime penalties. > Thanks a lot for improving the patch, we didn’t think like in a generic way. > 9. 'hash_len' is not really a hash length, it's a length of > the miniflow. Calculation of the 'key->len' vs direct > set doesn't seem to make any difference performance-wise. > It would be 1 or 2 cycles only . > > What I tried: > > Since the hash function itself is independent from the avx512 code, I > assumed that the same optimization can be applied to > a generic code path. So I attempted to generalize the function, > move it to the common header and use inside the generic miniflow_extract(). > See the quick'n'dirty patch I used for > ipv4 udp traffic below. That gave a great result: the generic code became > faster by 5-9% in my VM-OVS-VM setup! And I'm guessing that PHY-OVS-PHY > should be even better. > > The next thing I tried is to replace the mfex_5tuple_hash_ipv4() call with the > generalized function, also removing key->hash and key->len assignments > from it and surrounding code, because I also reverted patch #3. At this point > I didn't notice any significant performance difference for the avx512 > implementation in comparison with the original patch set. If there is some > difference, the generalized function can likely be tweaked a bit. > > Conclusions: > > Same optimization can be applied efficiently regardless of the > datapath/miniflow_extract implementation by using the same common code > without any extra sacrifices, but significantly improving the maintainability > of > the solution. > > Maintainability can be achieved by using actual parsed packet fields and > header offsets without any magic numbers. There is no need for > autovalidation if the hash function is the same in all cases. There is no > need > to use custom fuzzing, if we can add checks to existing fuzzers for oss-fuzz > and test-flow. > > What do you think? > > Below is the code snippet from my experiments (supposed to be applied on > top of the first patch of the set, only ipv4/udp case modified): > I have sent a v10 with the suggested changes . http://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/ > Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
