Re: [PATCH] net/crc: add 4x folding loop for x86 SSE implementation

Shreesh Adiga Thu, 11 Jun 2026 20:02:28 -0700

On Thu, Jun 11, 2026 at 10:36 PM Stephen Hemminger <
[email protected]> wrote:


> On Tue,  9 Jun 2026 13:27:12 +0530
> Shreesh Adiga <[email protected]> wrote:
>
> > Add a 64-byte loop that maintains 4 fold registers and processes
> > 64 bytes at a time. The 4x fold registers is then reduced to 16 byte
> > single fold, similar to AVX512 implementation. This technique is
> > described in the paper by Intel:
> > "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ
> Instruction"
> >
> > This results in roughly 50% performance improvement due to better ILP
> > for large input sizes like 1024.
> >
> > Signed-off-by: Shreesh Adiga <[email protected]>
> > ---
>
> Looks good applied to next-net.
>
> A couple of nits from more detailed AI review, that you still might want
> to look at:
>
> The current crc_autotest does not exercise the new 64-byte CRC16 path.
> Its CRC32 vectors are 1512 and 348 bytes, so the CRC32 4x loop is
> covered — but the largest CRC16 vector is 32 bytes, all three CRC16
> tests being ≤32. So the new CRC16 rk1_rk2 (64-byte fold) constants ship
> untested in CI. My exhaustive test confirms they're correct, but a
> future regression there wouldn't be caught. Suggest adding a CRC16
> vector ≥64 bytes, ideally a non-multiple of 64 (e.g. 80 or 100) so it
> hits the 4x loop, the single-fold tail, and the partial-bytes path
> together.
>
> In partial_bytes the comment /* k = rk1 & rk2 */ is now stale
>  — after the patch k holds rk3_rk4 on every path reaching it.
> Not introduced by this patch, but the patch is what made it wrong;
> worth fixing in passing.
>
> I've submitted couple of follow up patches that should address the above:
https://patches.dpdk.org/project/dpdk/patch/[email protected]/
https://patches.dpdk.org/project/dpdk/patch/[email protected]/

Re: [PATCH] net/crc: add 4x folding loop for x86 SSE implementation

Reply via email to