INTENTION
Inspired by the effort to integrate JIT for executor acceleration I thought
selected simple algorithms working with array-oriented data should be
drastically accelerated by using SIMD instructions on modern hardware.
I want to introduce this style of programming with the example of hex_encode:
- operates on arrays (bytea)
- simple algorithm
- in some situations partially limiting the performance (e.g pg_dump)
IMPLEMENTATION GUIDELINES
The main goal ist to accelerate common cases on the most common hardware by
exploiting all the resources the hardware delivers.
The following guidelines took me to a first implementation:
- restrict on 64 -bit architectures
These are the dominant server architectures, have the necessary data
formats and corresponding registers and operating instructions
- start with Intel x86-64 SIMD instructions:
This is the vastly most used platform, available for development and in
practical use
- don’t restrict the concept to only Intel x86-64, so that later people with
more experience on other architectures can jump in and implement comparable
algorithms
- fallback to the established implementation in postgres in non appropriate
cases or on user request (GUC)
- implementation of leaf function/procedures in assembly language
These consist mostly of a central loop without calling subroutines or
doing additionally branching
- coding for maximum hardware usage instead of elegant programming
Once tested, the simple algorithm works as advertised and is used to
replace most execution parts of the standard implementaion in C
- isolated footprint by integrating it only in the specific subroutine (here
hex-encode)
This ensures that the requirements for fast execution are met (e.g.
buffer sizes) and no repeated checks are needed like in a library use case.
- trying to keep both vector execution ports always doing useful work by
avoiding waits for latencies
- trying to access memory in linear fashion (reading from input buffer, writing
to output buffer) to avaoid internal cache problems
- focus optimization for the most advanced SIMD instruction set: AVX512
This provides the most advanced instructions and quite a lot of large
registers to aid in latency avoiding
- if possible provide fallback implementations on older SIMD standards (e.g.
AVX2 or SSE2)
This is usefull on many older servers and client processors, but due to
their too little number of registers latency avoiding or full execution queues
cannot be fully achieved.
IMPLEMENTATION DETAILS
- The loops implementing the algorithm are written in NASM assembler:
NASM is actively maintained, has many output formats, follows the Intel
style, has all current instrucions implemented and is fast.
- The loops are mostly independent of operating systems, so all OS’s basing on
a NASM obj output format are supported:
This includes Linux and Windows as the most important ones
- The algorithms use advanced techniques (constant and temporary registers) to
avoid most unnessary memory accesses:
The assembly implementation gives you the full control over the
registers (unlike intrinsics)
- Multiple dependency chains work interleaved to minimize latencies:
Coding is often interspersed and using almost all registers available.
- Some instructions (Moves, zeroing) are executed outside the processor
execution ports:
These don’t consume execution cyles on a port but their latency has to
be considered.
- Some vector instructions (multiply add) have latencies of 5 for example:
This means that after the instruction is issued, the processor has to
wait 5 cycles until the result can be used in the same dependency chain. To
avoid this and keep all vector execution ports (p0 and p5) busy you have to
have 9 other instructions in between doing work on other streams of the
algorithm to maximize hardware usage and overall performance.
- All loops are implemented as separate C-callable functions (according to the
OS calling convention):
They are all leaf functions by calling no other subroutines.
- The decision which implementation is choosen is done at the caller side by a
special dispatcher routine:
The caller handles the architectural capabilites (instruction sets
available) and knows the required work: There is often a suitable minimum
amount of work required for efficently calling a provided implementation.
- Loops should be run at least 2-4 times to compensate for initializing
overhead:
This implicits a certain amount of minimum work count based on the
specific SIMD implementations
- The loops terminate after detecting an error (e.g. wrong input data) and
return the succesfull completed amount of work:
The standard linear implementation takes over with the already
established error-handling.
- The loops work optimally with some extra output buffer space at the end to be
able to overshoot in the last round:
Nonethless the correct amount of work is returned to the caller and a
vector size of output buffer following the real result is zeroed out (Currently
disabled!)
- the loop may preload some data after the input buffer but assures that the
following page boundary is never crossed to avoid any access violation:
This makes no harm to the memory system because the output buffer has a
supplemental buffer at the end, but this could be changed to leaving the tail
handling to the standard implementaion if deemed unsupportable (as for now).
(to be continued...)