> Sorry for going back so far, but this thread was pointed out to me, and this 
> aspect
> of the design could use some more discussion:
> 
> + * pg_crc32c_avx512(): compute the crc32c of the buffer, where the
> + * buffer length must be at least 256, and a multiple of 64. Based
> 
> There is another technique that computes CRC on 3 separate chunks and
> combines them at the end, so about 3x faster on large-enough chunks.
> That's the way used for the Arm proposal [0], coincidentally also citing a 
> white
> paper from Intel, but as Dimitry pointed out in that thread, its link has 
> apparently
> disappeared. Raghuveer, do you know about this, and is there another link
> available?
> 
> http://www.intel.com/content/dam/www/public/us/en/documents/white-
> papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf

 I am not aware of this paper. Let me poke a few people internally and get back 
to you on this. 

> The cut off point in one implementation is only 144 bytes [1] , which is 
> maybe not
> as small as we'd like, but is quite a bit smaller than 256. That seems better 
> suited
> to our workloads, and more portable. I have a *brand-new* laptop with an Intel
> chip, and IIUC it doesn't support AVX-512 because it uses a big-little 
> architecture.
> I also understand that Sierra Forrest (a server product line) will be all 
> little cores
> with no AVX-512 support, so I'm not sure why the proposal here requires AVX-
> 512.

AVX-512 is present all of Intel main P-core based Xeon and AMD's Zen4 and Zen5. 
Sierra Forest contains the SSE and AVX/AVX2 family ISA but AFAIK AVX/AVX2 does 
not contain any CRC32C specific instructions. See:

1) 
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=pclmul&ig_expand=754&techs=AVX_ALL
2) 
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=754&techs=AVX_ALL&text=crc32

> 
> > There a very frequent call computing COMP_CRC32C over just 20 bytes,
> > while holding a crucial lock.  If we were to do introduce something
> > like this
> > AVX-512 algorithm, it'd probably be worth to dispatch differently in
> > case of compile-time known small lengths.
> 
> I know you've read an earlier version of the patch and realized that it 
> wouldn't
> help here, but we could probably dispatch differently regardless, although it 
> may
> only be worth it if we can inline the instructions. Since we technically only 
> need to
> wait for xl_prev, I believe we could push the computation of the other 12 
> bytes to
> before acquiring the lock, then only execute a single instruction on xl_prev 
> to
> complete the CRC computation. Is there any reason why we couldn't do that,
> assuming we have a clean way to make that portable? That would mean that the
> CRCs between major versions would be different, but I think we don't guarantee
> that anyway.

Not sure about that. This is not my expertise and I might need a little time to 
figure this out. Unfortunately, I am on travel with limited internet connection 
for the next 6 weeks. I will only be able to address this when I get back. Is 
this a blocker for the patch or is this something we can address as a revision? 
 
Raghuveer
 

Reply via email to