Hi John, 

I added dispatch logic for the new pclmul version on top of your v5 (which 
seems outdated now and so I will refrain from posting a patch here). Could you 
take a look at the approach here to see if this makes sense?  The logic in the 
pg_comp_crc32c_choose function is the main change and seems a lot cleaner to 
me. 

See 
https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/
 

Raghuveer

> -----Original Message-----
> From: John Naylor <johncnaylo...@gmail.com>
> Sent: Monday, February 17, 2025 10:40 PM
> To: Nathan Bossart <nathandboss...@gmail.com>
> Cc: Devulapalli, Raghuveer <raghuveer.devulapa...@intel.com>; pgsql-
> hack...@lists.postgresql.org; Shankaran, Akash <akash.shanka...@intel.com>
> Subject: Re: Improve CRC32C performance on SSE4.2
> 
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart <nathandboss...@gmail.com>
> wrote:
> >
> > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > > I tried using branching for the runtime check, and this looks like
> > > the way to go:
> > > - Existing -msse4.2 builders will still call directly, but inside
> > > the function there is a length check and only for long input will it
> > > do a runtime check for pclmul.
> > > - This smooths the way for -msse4.2 (and the equivalent on Arm) to
> > > inline calls with short constant input (e.g. WAL insert lock),
> > > although I've not done that here.
> > > - This can be a simple starting point for consolidating runtime
> > > checks, as was proposed for popcount in the AVX-512 CRC thread, but
> > > with branching my model was Andres' sketch here:
> > >
> > > https://www.postgresql.org/message-id/20240731023918.ixsfbeuub6e76on
> > > e%40awork3.anarazel.de
> >
> > Oh, nifty.  IIUC this should help avoid quite a bit of overhead even
> > before adding the PCLMUL stuff since it removes the function pointers
> > for the runtime-check builds.
> 
> I figured the same, but it doesn't seem to help on the microbenchmark.
> (I also tried the pg_waldump benchmark you used, but I couldn't get it 
> working so
> I'm probably missing a step.)
> 
> master:
> 
> 20
> latency average = 3.958 ms
> latency average = 3.860 ms
> latency average = 3.857 ms
> 32
> latency average = 5.921 ms
> latency average = 5.166 ms
> latency average = 5.128 ms
> 48
> latency average = 6.384 ms
> latency average = 6.022 ms
> latency average = 5.991 ms
> 
> v6:
> 
> 20
> latency average = 4.084 ms
> latency average = 3.879 ms
> latency average = 3.896 ms
> 32
> latency average = 5.533 ms
> latency average = 5.536 ms
> latency average = 5.573 ms
> 48
> latency average = 6.201 ms
> latency average = 6.205 ms
> latency average = 6.111 ms
> 
> > While this needn't block this patch set, I do find the dispatch code
> > to be pretty complicated.  Maybe we can improve that in the future by
> > using macros to auto-generate much of it.  My concern here is less
> > about this particular patch set and more about the long term
> > maintainability as we add more and more stuff like it, each with its
> > own tangled web of build and dispatch rules.
> 
> I think the runtime dispatch is fairly simple for this case. To my taste, the 
> worse
> maintainability hazard is the building. To make that better, I'd do this:
> - Rename the CRC choose*.c files to pg_cpucap_{x86,arm}.c and build them
> unconditionally for each platform
> - Initialize the runtime info by CPU platform and not other symbols where 
> possible
> (I guess anything needing AVX-512 will still be a mess)
> - Put all hardware CRC .c files into a single file pg_crc32c_hw.c.
> Define PG_CRC_HW8/4/2/1 macros in pg_crc32c.h that tell which intrinsic to use
> by platform and have a separate pg_crc32c_hw_impl.h header that has the
> simple loop with these macros. That header would for now only be included into
> pg_crc32c_hw.c.
> 
> The first two could be done as part of this patch or soon after. The third is 
> a bit
> more invasive, but seems like a necessary prerequisite for inlining on small
> constant input, to keep that from being a mess.
> 
> There's another potential irritation for maintenance too: I spent too much 
> time
> only adding pg_cpucap_initialize() to frontend main() functions that need it. 
> I
> should have just added them to every binary.
> We don't add new programs often, but it's still less automatic than the 
> function
> pointer way, so I'm open to other ideas.
> 
> Attached v7 to fix CI failures.
> --
> John Naylor
> Amazon Web Services

Reply via email to