On Thu, 2025-05-08 at 05:23 +0700, John Naylor wrote: > Right. That's also true of PowerPC -- perhaps that's a different team > than yours?
indeed that's another team. I can ping them but there is not much more I can do. > This case is a bit different, since Arm can compute hardware CRC on > any input size. The fast path here is only guaranteed to be taken at > inputs of 79 bytes or bigger, with the 64-byte main loop and 16-byte > alignment. For larger inputs, an indirect call shouldn't matter, so > to > keep it simple I'd recommend having only the runtime check. And for > smaller inputs call sb8 directly -- this will avoid the indirect call > which in turn just jumps to sb8. This is important because the server > computes CRC on a 20-byte input for the WAL record header while > holding a lock. I worked on the code and got it working on 16 byte chunks so the WAL writes will directly benefit from this. I will try to calculate and add the polynomials for the smaller chunks. Maybe we where wrong and we will still see a speed improvement, also for the smaller pieces. However that is something that I cannot tackle immediately. I'll come back to this in the summer. > Something like: > > #define COMP_CRC32C(crc, data, len) \ > ((crc) = pg_comp_crc32c_dispatch((crc), (data), (len))) > > static inline > pg_crc32c > pg_comp_crc32c_dispatch(pg_crc32c crc, const void *data, size_t len) > { > /* > if (len < VX_MIN_LEN + VX_ALIGN_MASK) > { > /* > * For inputs small enough that we may not take the vector path, > * just call sb8 directly. > */ > return pg_comp_crc32c_sb8(crc, data, len);; > } > else > /* Otherwise, use a runtime check for S390_VX instructions. */ > return pg_comp_crc32c(crc, data, len); > } Is static inline generally preferred here or should I do something like: #define COMP_CRC32C(crc, data, len) \ ((crc) = (len) < MAX_S390X_CRC ? \ pg_comp_crc32c_sb8((crc),(data),(len)) : \ pg_comp_crc32c((crc), (data), (len))) > If this behaves like Linux on other platforms, it'll be copy-on-write > when forking, and only the children will set this variable on first > call. Is there some documented reason we need special treatment here? yes for forks i think there is no problem. But I don't know what will happen when two cpus will try to read/set the value. The value is definitely not updated across both CPU caches. So my guess would be that there is a high chance that the _choose code is called multiple times(IF there really are multiple threads trying to calculate a crc at the same time AND the pointer is not already initialized). Indeed, because it makes no difference if the code is executed multiple times its totally fine to not use atomic store here. So I'll remove the code and the checks from the build system again. > Does the build still fall back to sb8 with a warning, or does it > fail? > It'd simpler if the guard against certain clang versions went into > the > choose function, so it can fall back to sb8. My current implementation will silently fall back to sb8 and not compile any of the s390x code. Other possible strategies are: - print a warning and fall back to sb0 - fail the compile and inform the user to switch compiler or to disable crc32vx in the build system - move all the checks into the .c file and to this with with macros indeed for zlib-ng we went the way to put all the checks into the code and fail the build if compiled with a known broken compiler. My arguments to not do the same in postgres are: - postgres uses sb8 on s390x already so by NOT changing it to another algorithm we do not change the expected behavior. - imho, this is something that the build system should check and take care of, not the code that's getting compiled. What is the preferred postgres way here? > While playing around with that, the above doesn't compile on gcc 14, > but 13 and 15 work -- maybe a bug report is in order? Thy for reporting I'll look into that! Many thanks for all other comments. I'll work on the code to incorporate all remarks. -- Eduard Stefes <eduard.ste...@ibm.com>