Hi John, I added dispatch logic for the new pclmul version on top of your v5 (which seems outdated now and so I will refrain from posting a patch here). Could you take a look at the approach here to see if this makes sense? The logic in the pg_comp_crc32c_choose function is the main change and seems a lot cleaner to me.
See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/ Raghuveer > -----Original Message----- > From: John Naylor <johncnaylo...@gmail.com> > Sent: Monday, February 17, 2025 10:40 PM > To: Nathan Bossart <nathandboss...@gmail.com> > Cc: Devulapalli, Raghuveer <raghuveer.devulapa...@intel.com>; pgsql- > hack...@lists.postgresql.org; Shankaran, Akash <akash.shanka...@intel.com> > Subject: Re: Improve CRC32C performance on SSE4.2 > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandboss...@gmail.com> > wrote: > > > > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > > > I tried using branching for the runtime check, and this looks like > > > the way to go: > > > - Existing -msse4.2 builders will still call directly, but inside > > > the function there is a length check and only for long input will it > > > do a runtime check for pclmul. > > > - This smooths the way for -msse4.2 (and the equivalent on Arm) to > > > inline calls with short constant input (e.g. WAL insert lock), > > > although I've not done that here. > > > - This can be a simple starting point for consolidating runtime > > > checks, as was proposed for popcount in the AVX-512 CRC thread, but > > > with branching my model was Andres' sketch here: > > > > > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76on > > > e%40awork3.anarazel.de > > > > Oh, nifty. IIUC this should help avoid quite a bit of overhead even > > before adding the PCLMUL stuff since it removes the function pointers > > for the runtime-check builds. > > I figured the same, but it doesn't seem to help on the microbenchmark. > (I also tried the pg_waldump benchmark you used, but I couldn't get it > working so > I'm probably missing a step.) > > master: > > 20 > latency average = 3.958 ms > latency average = 3.860 ms > latency average = 3.857 ms > 32 > latency average = 5.921 ms > latency average = 5.166 ms > latency average = 5.128 ms > 48 > latency average = 6.384 ms > latency average = 6.022 ms > latency average = 5.991 ms > > v6: > > 20 > latency average = 4.084 ms > latency average = 3.879 ms > latency average = 3.896 ms > 32 > latency average = 5.533 ms > latency average = 5.536 ms > latency average = 5.573 ms > 48 > latency average = 6.201 ms > latency average = 6.205 ms > latency average = 6.111 ms > > > While this needn't block this patch set, I do find the dispatch code > > to be pretty complicated. Maybe we can improve that in the future by > > using macros to auto-generate much of it. My concern here is less > > about this particular patch set and more about the long term > > maintainability as we add more and more stuff like it, each with its > > own tangled web of build and dispatch rules. > > I think the runtime dispatch is fairly simple for this case. To my taste, the > worse > maintainability hazard is the building. To make that better, I'd do this: > - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them > unconditionally for each platform > - Initialize the runtime info by CPU platform and not other symbols where > possible > (I guess anything needing AVX-512 will still be a mess) > - Put all hardware CRC .c files into a single file pg_crc32c_hw.c. > Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use > by platform and have a separate pg_crc32c_hw_impl.h header that has the > simple loop with these macros. That header would for now only be included into > pg_crc32c_hw.c. > > The first two could be done as part of this patch or soon after. The third is > a bit > more invasive, but seems like a necessary prerequisite for inlining on small > constant input, to keep that from being a mess. > > There's another potential irritation for maintenance too: I spent too much > time > only adding pg_cpucap_initialize() to frontend main() functions that need it. > I > should have just added them to every binary. > We don't add new programs often, but it's still less automatic than the > function > pointer way, so I'm open to other ideas. > > Attached v7 to fix CI failures. > -- > John Naylor > Amazon Web Services