Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

John Naylor Tue, 28 Jan 2025 22:46:47 -0800

On Sat, Jan 25, 2025 at 3:35 AM Devulapalli, Raghuveer
<raghuveer.devulapa...@intel.com> wrote:


> > #1 - The choice of AVX-512. There is no such thing as a "CRC instruction 
> > operating
> > on 8 bytes", and the proposed algorithm is a multistep process using 
> > carryless
> > multiplication and requiring at least 256 bytes of input. The Chromium 
> > sources
> > cited as the source for this patch also contain an implementation using 
> > 128-bit
> > instructions, and which only requires at least 64 bytes of input. Is there 
> > a reason
> > that not tested or proposed as well? That would be much easier to 
> > read/maintain,
> > work on more systems, and might give a speed boost on smaller inputs. These 
> > are
> > useful properties to have.
> >
> > https://github.com/chromium/chromium/blob/main/third_party/zlib/crc32_simd
> > .c#L215
>
> Agreed. postgres already has the SSE42 version pg_comp_crc32c_sse42, but I 
> didn’t
> realize it uses the crc32 instruction which processes only 8 bytes at a time. 
> This can
> certainly be upgraded to process 64bytes at a time and should be faster. 
> Since most
> of the AVX-512 stuff is almost ready, I propose to do this in a follow up 
> patch immediately.

It doesn't make sense to me that more limited/difficult hardware
support (and more complex coding for that) and a larger input
threshold should be a prerequisite for something that doesn't have
these disadvantages.

> Let me know if you disagree. The AVX512 version processes 256 bytes at a time 
> and will
> most certainly be faster than the improved SSE42 version, which is why the 
> chromium
> library has both AVX512 and SSE42.

It looks like chromium simply vendored the zlib library. Input
destined for compression is always going to be "large". That's not
true in general for our use case, and we mentioned that fact seven
months ago, when Andres said upthread [1]: "This is extremely workload
dependent, it's not hard to find workloads with lots of very small
record and very few big ones...". Given that feedback, it would have
made a lot of sense to mention the 64-byte alternative back then,
especially since it's the exact same pclmull algorithm based on the
same paper, and is found in the same zlib .c file, but for some reason
that was not done.

More broadly, the best strategy is to start with the customer and work
backward to the technology. It's more risky to pick the technology
upfront and try to find ways to use it. My goal here is to help you
make the right tradeoffs. Here's my view:

1. If we can have a relatively low input size threshold for
improvement, it's possibly worth a bit of additional complexity in
configure and run-time checks. There is a complicating factor in
testing that though: the latency of carryless multiplication
instructions varies drastically on different microarchitectures.
2. If we can improve large inputs in a simple fashion, with no
additional hardware support, that's worth doing in any case.
3. Complex hardware support (6 CPUIDs!) that only works on large
inputs (a minority of workloads) looks to be the worst of both worlds
and it's not the tradeoff we should make.

Further, we verified upthread that Intel's current and near-future
product line includes server chips (some with over 100 cores, so not
exactly low-end) that don't support AVX-512 at all. I have no idea how
common they will be, but they will certainly be found in cloud
datacenters somewhere. Shouldn't we have an answer for them as well?

> > #2 - The legal status of the algorithm from following Intel white paper, 
> > which is
> > missing from its original location, archived here:
> >
> > https://web.archive.org/web/20220802143127/https://www.intel.com/content/
> > dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-
> > instruction-paper.pdf
> >
> > https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel-
> > asm_64.S
> >
> > ...so I'm unclear if these patents are applicable to software 
> > implementations.
> > They also seem to be expired, but I am not a lawyer.
> > Could you look into this please? Even if we do end up with AVX-512, this 
> > would be
> > a good fallback.
>
> Given that SSE42 is pretty much available in all x86 processors at this 
> point, do we need a
> fallback C version specially after we improve the SSE42 version.

I know you had extended time off work, but I've already shared my
findings and explained my reasoning [2]. The title of the paper is
"Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction",
so unsurprisingly it does improve the SSE42 version. With a few dozen
lines of code, I can get ~3x speedup on page-sized inputs. At the very
least we want to use this technique on Arm [3], and the only blocker
now is the question regarding the patents. I'm interested to hear the
response on this.

[1] 
https://www.postgresql.org/message-id/20240612201135.kk77tiqcux77lgev%40awork3.anarazel.de
[2] 
https://www.postgresql.org/message-id/CANWCAZbr4sO1bPoS%2BE%3DiRWnrBZp7zUKZEJk39KYt_Pu9%2BX1-SQ%40mail.gmail.com
[3] https://commitfest.postgresql.org/51/4620/

--
John Naylor
Amazon Web Services

Re: Proposal for Updating CRC32C with AVX-512 Algorithm.

Reply via email to