On 14 Jul 2022, at 16:18, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Eelco Chaudron <[email protected]>
>> Sent: Thursday, July 14, 2022 2:25 PM
>> To: Van Haaren, Harry <[email protected]>
>> Cc: [email protected]; [email protected]; Amber, Kumar
>> <[email protected]>; Pai G, Sunil <[email protected]>; Finn, Emma
>> <[email protected]>; Stokes, Ian <[email protected]>
>> Subject: Re: [PATCH v10 10/10] odp-execute: Add ISA implementation of
>> set_masked
>> IPv4 action
>>
>>> From: Emma Finn <[email protected]>
>>>
>>> This commit adds support for the AVX512 implementation of the
>>> ipv4_set_addrs action as well as an AVX512 implementation of
>>> updating the checksums.
>
> <snip>
>
>>> + /* Update the IP checksum based on updated IP values. */
>>> + uint16_t delta = avx512_ipv4_update_csum(v_res, v_packet);
>>> + uint32_t new_csum = old_csum + delta;
>>> + delta = csum_finish(new_csum);
>>> +
>>> + /* Insert new checksum. */
>>> + v_res = _mm256_insert_epi16(v_res, delta, 5);
>>> +
>>> + /* If ip_src or ip_dst has been modified, L4 checksum needs to
>>> + * be updated too. */
>>> + if (mask->ipv4_src || mask->ipv4_dst) {
>>> +
>>> + uint16_t delta_checksum = avx512_l4_update_csum(v_packet,
>>> v_res);
>>> +
>>
>> Wondering if all this AVX code being executed really is faster than
>> recalc_csum32(uh-
>>> udp_csum, old_addr, new_addr)?
>
> Ultimately, measuring is worth more than talking about it. In our
> measurements here,
> yes absolutely it is, our measurements are available in the cover letter of
> the patchset.
I was not referring to the entire AVX implementation, but only the checksum
update for the L4 portion.
> Note that the code here is compute-bound, its juggling values between
> registers, and
> with XMM/YMM registers, SIMD IPC of 3 can be achieved. That means that in
> theory,
> the SIMD code executes ~3 intrinsics *per cycle*, but in practice the IPC is
> often *more*
> due to interleaved scalar code, and Out-of-Order execution capabilities of
> the CPU.
>
> Although the code is verbose (lots of typing) the resulting instruction
> stream is generally
> optimized very well by the compiler, and reduced to very small, dense and hot
> loops.
So we might be fine here with the AVX overhead, was just curious here if we
could further speed up.
> I recommend using "perf top" to investigate the hotspots, for those unaware
> of tools
> and methods, a DPDK Userspace presentation covers exactly this using OVS
> DPCLS as
> the examples code! https://youtu.be/ZmwOKR5JyPk
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev