On 14 Jul 2022, at 16:18, Van Haaren, Harry wrote:

>> -----Original Message-----
>> From: Eelco Chaudron <[email protected]>
>> Sent: Thursday, July 14, 2022 2:25 PM
>> To: Van Haaren, Harry <[email protected]>
>> Cc: [email protected]; [email protected]; Amber, Kumar
>> <[email protected]>; Pai G, Sunil <[email protected]>; Finn, Emma
>> <[email protected]>; Stokes, Ian <[email protected]>
>> Subject: Re: [PATCH v10 10/10] odp-execute: Add ISA implementation of 
>> set_masked
>> IPv4 action
>>
>>> From: Emma Finn <[email protected]>
>>>
>>> This commit adds support for the AVX512 implementation of the
>>> ipv4_set_addrs action as well as an AVX512 implementation of
>>> updating the checksums.
>
> <snip>
>
>>> +        /* Update the IP checksum based on updated IP values. */
>>> +        uint16_t delta = avx512_ipv4_update_csum(v_res, v_packet);
>>> +        uint32_t new_csum = old_csum + delta;
>>> +        delta = csum_finish(new_csum);
>>> +
>>> +        /* Insert new checksum. */
>>> +        v_res = _mm256_insert_epi16(v_res, delta, 5);
>>> +
>>> +        /* If ip_src or ip_dst has been modified, L4 checksum needs to
>>> +         * be updated too. */
>>> +        if (mask->ipv4_src || mask->ipv4_dst) {
>>> +
>>> +            uint16_t delta_checksum = avx512_l4_update_csum(v_packet, 
>>> v_res);
>>> +
>>
>> Wondering if all this AVX code being executed really is faster than 
>> recalc_csum32(uh-
>>> udp_csum, old_addr, new_addr)?
>
> Ultimately, measuring is worth more than talking about it. In our 
> measurements here,
> yes absolutely it is, our measurements are available in the cover letter of 
> the patchset.

I was not referring to the entire AVX implementation, but only the checksum 
update for the L4 portion.

> Note that the code here is compute-bound, its juggling values between 
> registers, and
> with XMM/YMM registers, SIMD IPC of 3 can be achieved. That means that in 
> theory,
> the SIMD code executes ~3 intrinsics *per cycle*, but in practice the IPC is 
> often *more*
> due to interleaved scalar code, and Out-of-Order execution capabilities of 
> the CPU.
>
> Although the code is verbose (lots of typing) the resulting instruction 
> stream is generally
> optimized very well by the compiler, and reduced to very small, dense and hot 
> loops.

So we might be fine here with the AVX overhead, was just curious here if we 
could further speed up.

> I recommend using "perf top" to investigate the hotspots, for those unaware 
> of tools
> and methods, a DPDK Userspace presentation covers exactly this using OVS 
> DPCLS as
> the examples code! https://youtu.be/ZmwOKR5JyPk

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to