Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

John Naylor Mon, 09 Dec 2024 03:40:05 -0800

On Sat, Dec 7, 2024 at 10:16 PM Devulapalli, Raghuveer
<[email protected]> wrote:


> > There is another technique that computes CRC on 3 separate chunks and
> > combines them at the end, so about 3x faster on large-enough chunks.
> > That's the way used for the Arm proposal [0], coincidentally also citing a 
> > white
> > paper from Intel, but as Dimitry pointed out in that thread, its link has 
> > apparently
> > disappeared. Raghuveer, do you know about this, and is there another link
> > available?
> >
> > http://www.intel.com/content/dam/www/public/us/en/documents/white-
> > papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf
>
>  I am not aware of this paper. Let me poke a few people internally and get 
> back to you on this.

Thanks! I have a portable PoC of how this works, but I'll save that
for another thread, since it's not Intel (or Arm) specific.

> > The cut off point in one implementation is only 144 bytes [1] , which is 
> > maybe not
> > as small as we'd like, but is quite a bit smaller than 256. That seems 
> > better suited
> > to our workloads, and more portable. I have a *brand-new* laptop with an 
> > Intel
> > chip, and IIUC it doesn't support AVX-512 because it uses a big-little 
> > architecture.
> > I also understand that Sierra Forrest (a server product line) will be all 
> > little cores
> > with no AVX-512 support, so I'm not sure why the proposal here requires AVX-
> > 512.
>
> AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and 
> Zen5. Sierra Forest contains the SSE and AVX/AVX2 family ISA but AFAIK 
> AVX/AVX2 does not contain any CRC32C specific instructions. See:

CRC32C was added in SSE 4.2, so it's quite old. The AVX-512 intrinsics
used in the patch are not CRC-specific, if I understand correctly.

My point was, it seems Intel still considers AVX-512 as optional, so
we can't count on it being present even in future chips. That's why
I'm interested in alternatives, at least as a first step. If we can
get 3x throughput, the calculation might bend up low enough in the
profile that going to 6x might not be noticeable (not sure).

> > > There a very frequent call computing COMP_CRC32C over just 20 bytes,
> > > while holding a crucial lock.  If we were to do introduce something
> > > like this
> > > AVX-512 algorithm, it'd probably be worth to dispatch differently in
> > > case of compile-time known small lengths.
> >
> > I know you've read an earlier version of the patch and realized that it 
> > wouldn't
> > help here, but we could probably dispatch differently regardless, although 
> > it may
> > only be worth it if we can inline the instructions. Since we technically 
> > only need to
> > wait for xl_prev, I believe we could push the computation of the other 12 
> > bytes to
> > before acquiring the lock, then only execute a single instruction on 
> > xl_prev to
> > complete the CRC computation. Is there any reason why we couldn't do that,
> > assuming we have a clean way to make that portable? That would mean that the
> > CRCs between major versions would be different, but I think we don't 
> > guarantee
> > that anyway.
>
> Not sure about that. This is not my expertise and I might need a little time 
> to figure this out. Unfortunately, I am on travel with limited internet 
> connection for the next 6 weeks. I will only be able to address this when I 
> get back. Is this a blocker for the patch or is this something we can address 
> as a revision?

This is orthogonal and is not related to the patch, since it doesn't
affect 8 and 20-byte paths, only 256 and greater.

--
John Naylor
Amazon Web Services

Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Reply via email to