On Sat, Dec 7, 2024 at 10:16 PM Devulapalli, Raghuveer <raghuveer.devulapa...@intel.com> wrote:
> > There is another technique that computes CRC on 3 separate chunks and > > combines them at the end, so about 3x faster on large-enough chunks. > > That's the way used for the Arm proposal [0], coincidentally also citing a > > white > > paper from Intel, but as Dimitry pointed out in that thread, its link has > > apparently > > disappeared. Raghuveer, do you know about this, and is there another link > > available? > > > > http://www.intel.com/content/dam/www/public/us/en/documents/white- > > papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf > > I am not aware of this paper. Let me poke a few people internally and get > back to you on this. Thanks! I have a portable PoC of how this works, but I'll save that for another thread, since it's not Intel (or Arm) specific. > > The cut off point in one implementation is only 144 bytes [1] , which is > > maybe not > > as small as we'd like, but is quite a bit smaller than 256. That seems > > better suited > > to our workloads, and more portable. I have a *brand-new* laptop with an > > Intel > > chip, and IIUC it doesn't support AVX-512 because it uses a big-little > > architecture. > > I also understand that Sierra Forrest (a server product line) will be all > > little cores > > with no AVX-512 support, so I'm not sure why the proposal here requires AVX- > > 512. > > AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and > Zen5. Sierra Forest contains the SSE and AVX/AVX2 family ISA but AFAIK > AVX/AVX2 does not contain any CRC32C specific instructions. See: CRC32C was added in SSE 4.2, so it's quite old. The AVX-512 intrinsics used in the patch are not CRC-specific, if I understand correctly. My point was, it seems Intel still considers AVX-512 as optional, so we can't count on it being present even in future chips. That's why I'm interested in alternatives, at least as a first step. If we can get 3x throughput, the calculation might bend up low enough in the profile that going to 6x might not be noticeable (not sure). > > > There a very frequent call computing COMP_CRC32C over just 20 bytes, > > > while holding a crucial lock. If we were to do introduce something > > > like this > > > AVX-512 algorithm, it'd probably be worth to dispatch differently in > > > case of compile-time known small lengths. > > > > I know you've read an earlier version of the patch and realized that it > > wouldn't > > help here, but we could probably dispatch differently regardless, although > > it may > > only be worth it if we can inline the instructions. Since we technically > > only need to > > wait for xl_prev, I believe we could push the computation of the other 12 > > bytes to > > before acquiring the lock, then only execute a single instruction on > > xl_prev to > > complete the CRC computation. Is there any reason why we couldn't do that, > > assuming we have a clean way to make that portable? That would mean that the > > CRCs between major versions would be different, but I think we don't > > guarantee > > that anyway. > > Not sure about that. This is not my expertise and I might need a little time > to figure this out. Unfortunately, I am on travel with limited internet > connection for the next 6 weeks. I will only be able to address this when I > get back. Is this a blocker for the patch or is this something we can address > as a revision? This is orthogonal and is not related to the patch, since it doesn't affect 8 and 20-byte paths, only 256 and greater. -- John Naylor Amazon Web Services