Hi,
Comparing the current SSE4.2 implementation of the CRC32C algorithm in
Postgres, to an optimized AVX-512 algorithm [0] we observed significant gains.
The result was a ~6.6X average multiplier of increased performance measured on
3 different Intel products. Details below. The AVX-512 algorithm in C is a port
of the ISA-L library [1] assembler code.
Workload call size distribution details (write heavy):
* Average was approximately around 1,010 bytes per call
* ~80% of the calls were under 256 bytes
* ~20% of the calls were greater than or equal to 256 bytes up to the max
buffer size of 8192
The 256 bytes is important because if the buffer is smaller, it makes sense
fallback to the existing implementation. This is because the AVX-512 algorithm
needs a minimum of 256 bytes to operate.
Using the above workload data distribution,
at 0% calls < 256 bytes, a 841% improvement on average for crc32c
functionality was observed.
at 50% calls < 256 bytes, a 758% improvement on average for crc32c
functionality was observed.
at 90% calls < 256 bytes, a 44% improvement on average for crc32c
functionality was observed.
at 97.6% calls < 256 bytes, the workload's crc32c performance breaks-even.
at 100% calls < 256 bytes, a 14% regression is seen when using AVX-512
implementation.
The results above are averages over 3 machines, and were measured on: Intel
Saphire Rapids bare metal, and using EC2 on AWS cloud: Intel Saphire Rapids
(m7i.2xlarge) and Intel Ice Lake (m6i.2xlarge).
Summary Data (Saphire Rapids bare metal, AWS m7i-2xl, and AWS m6i-2xl):
+---------------------+-------------------+-------------------+-------------------+--------------------+
| Rates in Bytes/us | Bare Metal | AWS m6i-2xl | AWS m7i-2xl
| |
| (Larger is Better)
+---------+---------+---------+---------+---------+---------+ Overall
Multiplier |
| | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 | SSE 4.2 |
AVX-512 | |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 256-8192 | 12,046 | 83,196 | 7,471 | 39,965 | 11,867 |
84,589 | 6.62 |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 64 - 255 | 16,865 | 15,909 | 9,209 | 7,363 | 12,496 |
10,046 | 0.86 |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Weighted Multiplier [*]
| 1.44 |
+-----------------------------+--------------------+
There was no evidence of AVX-512 frequency throttling from perf data, which
stayed steady during the test.
Feedback on this proposed improvement is appreciated. Some questions:
1) This AVX-512 ISA-L derived code uses BSD-3 license [2]. Is this compatible
with the PostgreSQL License [3]? They both appear to be very permissive
licenses, but I am not an expert on licenses.
2) Is there a preferred benchmark I should run to test this change?
If licensing is a non-issue, I can post the initial patch along with my
Postgres benchmark function patch for further review.
Thanks,
Paul
[0]
https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
[1] https://github.com/intel/isa-l
[2] https://opensource.org/license/bsd-3-clause
[3] https://opensource.org/license/postgresql
[*] Weights used were 90% of requests less than 256 bytes, 10% greater than or
equal to 256 bytes.