Hi Ilya,

Thanks a lot for such a wonderful review of the series.
Replies inline.

> -----Original Message-----
> From: Ilya Maximets <[email protected]>
> Sent: Tuesday, April 12, 2022 5:05 PM
> To: Amber, Kumar <[email protected]>; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> Ferriter, Cian <[email protected]>; Stokes, Ian <[email protected]>;
> [email protected]; Van Haaren, Harry <[email protected]>
> Subject: Re: [PATCH v9 0/4] IPv4 Hashing AVX512 Optimizations
> 
> On 4/6/22 16:38, Kumar Amber wrote:
> > Hashing Optimization are also included which can further improve
> > performance by approximately 10%.
> 
> Hi, Amber, others.  That's an interesting patch set.
> I got my hands on an AVX512-capable system for a few hours, so I was able
> to play with it.  A few observations:
> 
> 1. Patch set does improve the performance in my VM-OVS-VM
>    setup by ~3.5 %.  I suppose, your 10% was achieved for
>    a PHY-OVS-PHY setup, right?
> 

Yes correct.

> 2. The hash function itself seems very generic, so it's not
>    necessary for it to be strictly part of the avx512 code.
>
I have done that in next series.
 
> 3. Patch #3 doesn't seem to give any performance boost.
>    At least in my testing, but it creates an assumption that
>    optimized miniflow_extract always calculates the hash,
>    which is not great from the generalization point of view
>    and may harm future attempts to decouple parsing and
>    datapath implementations.
> 

Removed.

> 4. Patches #2 and #4 are just for testing purposes.
>

Will send the script changes separately.
 
> 5. It appears to be that any function call to a function that
>    is not part of the avx512 code is extremely expensive.
>    Mostly because compiler injects VZEROUPPER instructions
>    just before them all the time.  And these are very heavy
>    if called per-packet.  AFAIU, that is the cost of switching
>    between avx and non-avx instructions.
>    Did you notice the same?
> 

I can look at it and confirm but your observation is correct.

> 6. Reading entries from the miniflow is heavier than reading
>    from the packet itself for some reason.  I didn't dig
>    very deep here.
>

Yes we noticed the same and also causes load before stores issues
When loading from mfex thus we decided to do it based on packets.
 
> 7. All switch branches in mfex_avx512_process() has exactly
>    the same code.  I suppose, that's a leftover from the
>    IPv6 series.
> 
Not exactly if you notice each branch has a different offset for tcp flags.

> 8. In-packet offsets are magic numbers, which is not great.
>    I think, they can be easily defined as a sum of structure
>    offsets, without any runtime penalties.
> 
Thanks a lot for improving the patch, we didn’t think like in a generic way.

> 9. 'hash_len' is not really a hash length, it's a length of
>     the miniflow.  Calculation of the 'key->len' vs direct
>     set doesn't seem to make any difference performance-wise.
>
It would be 1 or 2 cycles only .
 
> 
> What I tried:
> 
> Since the hash function itself is independent from the avx512 code, I
> assumed that the same optimization can be applied to
> a generic code path.   So I attempted to generalize the function,
> move it to the common header and use inside the generic miniflow_extract().
> See the quick'n'dirty patch I used for
> ipv4 udp traffic below.  That gave a great result: the generic code became
> faster by 5-9% in my VM-OVS-VM setup!  And I'm guessing that PHY-OVS-PHY
> should be even better.
> 
> The next thing I tried is to replace the mfex_5tuple_hash_ipv4() call with the
> generalized function, also removing key->hash and key->len assignments
> from it and surrounding code, because I also reverted patch #3.  At this point
> I didn't notice any significant performance difference for the avx512
> implementation in comparison with the original patch set.  If there is some
> difference, the generalized function can likely be tweaked a bit.
> 
> Conclusions:
> 
> Same optimization can be applied efficiently regardless of the
> datapath/miniflow_extract implementation by using the same common code
> without any extra sacrifices, but significantly improving the maintainability 
> of
> the solution.
> 
> Maintainability can be achieved by using actual parsed packet fields and
> header offsets without any magic numbers.  There is no need for
> autovalidation if the hash function is the same in all cases.  There is no 
> need
> to use custom fuzzing, if we can add checks to existing fuzzers for oss-fuzz
> and test-flow.
> 
> What do you think?
> 
> Below is the code snippet from my experiments (supposed to be applied on
> top of the first patch of the set, only ipv4/udp case modified):
> 

I have sent a v10 with the suggested changes .

http://patchwork.ozlabs.org/project/openvswitch/patch/[email protected]/

> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to