On Sat, Jan 25, 2025 at 3:35 AM Devulapalli, Raghuveer <raghuveer.devulapa...@intel.com> wrote:
> > #1 - The choice of AVX-512. There is no such thing as a "CRC instruction > > operating > > on 8 bytes", and the proposed algorithm is a multistep process using > > carryless > > multiplication and requiring at least 256 bytes of input. The Chromium > > sources > > cited as the source for this patch also contain an implementation using > > 128-bit > > instructions, and which only requires at least 64 bytes of input. Is there > > a reason > > that not tested or proposed as well? That would be much easier to > > read/maintain, > > work on more systems, and might give a speed boost on smaller inputs. These > > are > > useful properties to have. > > > > https://github.com/chromium/chromium/blob/main/third_party/zlib/crc32_simd > > .c#L215 > > Agreed. postgres already has the SSE42 version pg_comp_crc32c_sse42, but I > didn’t > realize it uses the crc32 instruction which processes only 8 bytes at a time. > This can > certainly be upgraded to process 64bytes at a time and should be faster. > Since most > of the AVX-512 stuff is almost ready, I propose to do this in a follow up > patch immediately. It doesn't make sense to me that more limited/difficult hardware support (and more complex coding for that) and a larger input threshold should be a prerequisite for something that doesn't have these disadvantages. > Let me know if you disagree. The AVX512 version processes 256 bytes at a time > and will > most certainly be faster than the improved SSE42 version, which is why the > chromium > library has both AVX512 and SSE42. It looks like chromium simply vendored the zlib library. Input destined for compression is always going to be "large". That's not true in general for our use case, and we mentioned that fact seven months ago, when Andres said upthread [1]: "This is extremely workload dependent, it's not hard to find workloads with lots of very small record and very few big ones...". Given that feedback, it would have made a lot of sense to mention the 64-byte alternative back then, especially since it's the exact same pclmull algorithm based on the same paper, and is found in the same zlib .c file, but for some reason that was not done. More broadly, the best strategy is to start with the customer and work backward to the technology. It's more risky to pick the technology upfront and try to find ways to use it. My goal here is to help you make the right tradeoffs. Here's my view: 1. If we can have a relatively low input size threshold for improvement, it's possibly worth a bit of additional complexity in configure and run-time checks. There is a complicating factor in testing that though: the latency of carryless multiplication instructions varies drastically on different microarchitectures. 2. If we can improve large inputs in a simple fashion, with no additional hardware support, that's worth doing in any case. 3. Complex hardware support (6 CPUIDs!) that only works on large inputs (a minority of workloads) looks to be the worst of both worlds and it's not the tradeoff we should make. Further, we verified upthread that Intel's current and near-future product line includes server chips (some with over 100 cores, so not exactly low-end) that don't support AVX-512 at all. I have no idea how common they will be, but they will certainly be found in cloud datacenters somewhere. Shouldn't we have an answer for them as well? > > #2 - The legal status of the algorithm from following Intel white paper, > > which is > > missing from its original location, archived here: > > > > https://web.archive.org/web/20220802143127/https://www.intel.com/content/ > > dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32- > > instruction-paper.pdf > > > > https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel- > > asm_64.S > > > > ...so I'm unclear if these patents are applicable to software > > implementations. > > They also seem to be expired, but I am not a lawyer. > > Could you look into this please? Even if we do end up with AVX-512, this > > would be > > a good fallback. > > Given that SSE42 is pretty much available in all x86 processors at this > point, do we need a > fallback C version specially after we improve the SSE42 version. I know you had extended time off work, but I've already shared my findings and explained my reasoning [2]. The title of the paper is "Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction", so unsurprisingly it does improve the SSE42 version. With a few dozen lines of code, I can get ~3x speedup on page-sized inputs. At the very least we want to use this technique on Arm [3], and the only blocker now is the question regarding the patents. I'm interested to hear the response on this. [1] https://www.postgresql.org/message-id/20240612201135.kk77tiqcux77lgev%40awork3.anarazel.de [2] https://www.postgresql.org/message-id/CANWCAZbr4sO1bPoS%2BE%3DiRWnrBZp7zUKZEJk39KYt_Pu9%2BX1-SQ%40mail.gmail.com [3] https://commitfest.postgresql.org/51/4620/ -- John Naylor Amazon Web Services