On Fri, Feb 24, 2017 at 7:47 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Sat, Jan 21, 2017 at 09:02:25PM +0200, Ants Aasma wrote:
>> > It might be worth looking into using the CRC CPU instruction to reduce this
>> > overhead, like we do for the WAL checksums. Since that is a different
>> > algorithm it would be a compatibility break and we would need to support 
>> > the
>> > old algorithm for upgraded clusters..
>> We looked at that when picking the algorithm. At that point it seemed
>> that CRC CPU instructions were not universal enough to rely on them.
>> The algorithm we ended up on was designed to be fast on SIMD hardware.
>> Unfortunately on x86-64 that required SSE4.1 integer instructions, so
>> with default compiles there is a lot of performance left on table. A
>> low hanging fruit would be to do CPU detection like the CRC case and
>> enable a SSE4.1 optimized variant when those instructions are
>> available. IIRC it was actually a lot faster than the naive hardware
>> CRC that is used for WAL and about on par with interleaved CRC.
> Uh, I thought already did compile-time testing for SSE4.1 and used them
> if present.  Why do you say "with default compiles there is a lot of
> performance left on table?"

Compile time checks don't help because the compiled binary could be
run on a different host that does not have SSE4.1 (as extremely
unlikely as it is at this point of time). A runtime check is done for
WAL checksums that use a special CRC32 instruction. Block checksums
predate that and use a different algorithm that was picked because it
could be accelerated with vectorized execution on non-Intel
architectures. We just never got around to adding runtime checks for
the architecture to enable this speedup.

The attached test runs 1M iterations of the checksum about 3x faster
when compiled with SSE4.1 and vectorization, 4x if AVX2 is added into
the mix.

gcc $CFLAGS -Isrc/include -DN=1000000 testchecksum.c -o testchecksum
&& time ./testchecksum

CFLAGS="-O2": 2.364s
CFLAGS="-O2 -msse4.1 -ftree-vectorize": 0.752s
CFLAGS="-O2 -mavx2 -ftree-vectorize": 0.552s

That 0.552s is 15GB/s per core on a 3 year old laptop.

Ants Aasma
#include "postgres.h"
#include "storage/checksum_impl.h"

void main() {
	char page[8192] = {0};
	uint32 i, sum = 0;

	for (i = 0; i < N; i++)
		sum ^= pg_checksum_page(page, i);
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to