On Fri, Feb 24, 2017 at 7:47 PM, Bruce Momjian <br...@momjian.us> wrote: > On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote: >> > It might be worth looking into using the CRC CPU instruction to reduce this >> > overhead, like we do for the WAL checksums. Since that is a different >> > algorithm it would be a compatibility break and we would need to support >> > the >> > old algorithm for upgraded clusters.. >> >> We looked at that when picking the algorithm. At that point it seemed >> that CRC CPU instructions were not universal enough to rely on them. >> The algorithm we ended up on was designed to be fast on SIMD hardware. >> Unfortunately on x86-64 that required SSE4.1 integer instructions, so >> with default compiles there is a lot of performance left on table. A >> low hanging fruit would be to do CPU detection like the CRC case and >> enable a SSE4.1 optimized variant when those instructions are >> available. IIRC it was actually a lot faster than the naive hardware >> CRC that is used for WAL and about on par with interleaved CRC. > > Uh, I thought already did compile-time testing for SSE4.1 and used them > if present. Why do you say "with default compiles there is a lot of > performance left on table?"
Compile time checks don't help because the compiled binary could be run on a different host that does not have SSE4.1 (as extremely unlikely as it is at this point of time). A runtime check is done for WAL checksums that use a special CRC32 instruction. Block checksums predate that and use a different algorithm that was picked because it could be accelerated with vectorized execution on non-Intel architectures. We just never got around to adding runtime checks for the architecture to enable this speedup. The attached test runs 1M iterations of the checksum about 3x faster when compiled with SSE4.1 and vectorization, 4x if AVX2 is added into the mix. Test: gcc $CFLAGS -Isrc/include -DN=1000000 testchecksum.c -o testchecksum && time ./testchecksum Results: CFLAGS="-O2": 2.364s CFLAGS="-O2 -msse4.1 -ftree-vectorize": 0.752s CFLAGS="-O2 -mavx2 -ftree-vectorize": 0.552s That 0.552s is 15GB/s per core on a 3 year old laptop. Regards, Ants Aasma
#include "postgres.h" #include "storage/checksum_impl.h" void main() { char page[8192] = {0}; uint32 i, sum = 0; for (i = 0; i < N; i++) sum ^= pg_checksum_page(page, i); }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers