On 10/29/24 8:21 PM, Waldek Hebisch wrote:
On Wed, Oct 16, 2024 at 04:24:53PM +0800, Qian Yun wrote:
The functions in POLYVEC uses modulus (to a prime).
This makes SIMD algorithm more complex than I though,
but if implemented correctly, it should easily gives
over 10x speedup.
- Qian
On 10/14/24 9:34 PM, Qian Yun wrote:
I've taken a brief look, I think it is possible to speedup
functions in package POLYVEC by SIMD, via SBCL package sb-simd,
on machines with AVX2 instructions.
Already SSE2 (which is supposed to be present on all 64-bit PC-s)
should give significant speedup. Comparatively, AVX/AVX2 offers
limited gains, basicaly only on machines which have sufficiently
wide hardware (IIUC only top models). Potential troubles to solve:
Steam survey shows that 97.49% machines have AVX and 94.61% have AVX2.
AVX512 is much less common, only 13%.
Isn't AVX supposed to have twice throughput than SSE2?
- aligning data to 16 byte boundary which is required by normal SSE2
instructions (I did not check what is required by AVX)
- extention from 32-bits to 64-bits. Main operations limit
coeffiecints to 32-bits (actually less than this) but to
prevent overflow we need operations with 64-bit accuracy.
SSE has one operation that returns increased accuracy, but
IIRC this is for 16-bit to 32-bit (we could use such thing
sometimes, but not always). AVX may be better here. But
for example 'vector_combination' needs extention before
actual arithmetic (or some way to access high part of the
product, but IIUC AVX have no way to do this).
- fast way to compute remainders. There are fast ways but quite
low-level, it is hard to use them from higher-level languages
I saw "Modular SIMD arithmetic in Mathemagix" by J van der Hoeven.
There's algorithm doing modular multiplication entirely in 32-bit.
Our current method stores tmp value in 64bit.
I wrote potentially 10x speedup is based on the numbers in
this article.
BTW, which "high level" functions uses POLYVEC underneath?
I'd like to use that as benchmark.
- Qian
- for polynomial mutiplication we probably should reverse one
of arguments, otherwise the loop is hard to vectorise.
To check what is possible I tried polynomial multiplication code
on C. Using '-O' the C version results in scalar code, but faster
than sbcl-compiled one. At '-O3' gcc tries to autovectorize the
loop resulting in slower code. Namely, since there is no warranty
of alignment gcc inserts extra code to specially handle few first
elements, then do main part in aligned way and then trailer.
since one of indices goes in "wrong direction" gcc inserts shuffle
instruction. All this slows down the code so that in interesting
range it run at about half of speed of scalar code. Reversing
one of arguments should eliminate most of the slowdown.
But extra handling of inital part is problematic too, for this
we should get proper alignment of data.
BTW: My past profile data indicates that most work is done in
'inner_mul' and 'vector_combination'. 'inner_mul' is performing
several additions before computing remainder so should be easier
to speed up. 'vector_combination' has only 3 arithmetic operations
before final remainder, so here remainder is critical. One possible
way to speed up remainder is to use primes of special form.
If p = 2^k - a with small a and resonable k, then remainder could be
done in 6 operations (2 shifts, 2 multiplications and 2 subtractions).
On my Core2 remainder takes 20 clocks, so 6 operations would be
a substantial improvement. There are other schemes for computing
remainder, of similar cost.
--
You received this message because you are subscribed to the Google Groups "FriCAS -
computer algebra system" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion visit
https://groups.google.com/d/msgid/fricas-devel/0cd9e2df-b5f9-4b88-afed-484317fb6d89%40gmail.com.