Hi Michael,
On 12.12.19 13:05, Michael Crusoe wrote:
Specifically I'm interested in seeing more of our packages for the
latest RaspberryPI systems (arm64).
Likewise I am interested in testing things on arm. However, remote
debugging is not a lot of fun and I am busy writing my dissertation anyways.
Two downsides of using the SIMDE library:
1) Doesn't work with raw assembly, only C/C++ compiler intrinsics
(<emmintrin.h> and friends)
I don't see this as a downside. Embedding your intrinsics into the
regular source will enable more optimizations for the compiler.
2) Switching between different types of SIMD (like using SSE fallbacks
for an SSE2 operation) is done at compile time and not run time.
This is a bummer, but can be solved (see below).
Questions for you all:
1) Is this a good idea?
I think it is a good idea, iff you have a benchmark proving that the
optimizations will improve the runtimes significantly. For instance,
there are a number of different ways to compute the reverse complement.
Using a switch statement is very slow, a table is ten times faster, a
simd approach can even give another 7x speedup [3].
2) Should we carry these patches if upstream doesn't accept them?
Dunno.
3) Any ideas about compiling with different
-m{avx2,avx,sse4.2,sse4.1,ssse3,sse3,sse2,sse,mmx} settings + simple
wrapper generation to pick the right executable?
I did that just recently for phylonium [1]. Here is the best approach I
found: Have each optimized function in a separate file. Compile each
with its specific -m setting. Further provide a generic implementation
as well as one entrypoint function. The latter can then at call time
determine which optimized implementation to use via
__builtin_cpu_supports(). Using ifuncs this can even be delegated to
dynlink-time.
The devil is in the details: hurd and kfreebsd (and macOS) don't support
ifuncs [2]. __builtin_cpu_supports() needs some help to work in ifuncs.
Also you have to disable the shenanigans for non-x86/whatever platforms.
I definitely think that a library is the right place for these
optimizations. (That's one of the reasons I started my libdna project.)
If you want to optimize libssw you can try using my approach and see how
far it get's you. ☺
Best
Fabian
1: https://salsa.debian.org/med-team/phylonium/libs/
2: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945133
3: https://github.com/kloetzl/libdna/blob/master/bench/Brevcomp.cxx