On Wed, Oct 16, 2024 at 04:24:53PM +0800, Qian Yun wrote:
> The functions in POLYVEC uses modulus (to a prime).
> This makes SIMD algorithm more complex than I though,
> but if implemented correctly, it should easily gives
> over 10x speedup.
> 
> - Qian
> 
> On 10/14/24 9:34 PM, Qian Yun wrote:
> > I've taken a brief look, I think it is possible to speedup
> > functions in package POLYVEC by SIMD, via SBCL package sb-simd,
> > on machines with AVX2 instructions.

Already SSE2 (which is supposed to be present on all 64-bit PC-s)
should give significant speedup.  Comparatively, AVX/AVX2 offers
limited gains, basicaly only on machines which have sufficiently
wide hardware (IIUC only top models).  Potential troubles to solve:

- aligning data to 16 byte boundary which is required by normal SSE2
  instructions (I did not check what is required by AVX)
- extention from 32-bits to 64-bits.  Main operations limit
  coeffiecints to 32-bits (actually less than this) but to
  prevent overflow we need operations with 64-bit accuracy.
  SSE has one operation that returns increased accuracy, but
  IIRC this is for 16-bit to 32-bit (we could use such thing
  sometimes, but not always).  AVX may be better here.  But
  for example 'vector_combination' needs extention before
  actual arithmetic (or some way to access high part of the
  product, but IIUC AVX have no way to do this).
- fast way to compute remainders.  There are fast ways but quite
  low-level, it is hard to use them from higher-level languages
- for polynomial mutiplication we probably should reverse one
  of arguments, otherwise the loop is hard to vectorise.

To check what is possible I tried polynomial multiplication code
on C.  Using '-O' the C version results in scalar code, but faster
than sbcl-compiled one.  At '-O3' gcc tries to autovectorize the
loop resulting in slower code.  Namely, since there is no warranty
of alignment gcc inserts extra code to specially handle few first
elements, then do main part in aligned way and then trailer.
since one of indices goes in "wrong direction" gcc inserts shuffle
instruction.  All this slows down the code so that in interesting
range it run at about half of speed of scalar code.  Reversing
one of arguments should eliminate most of the slowdown.
But extra handling of inital part is problematic too, for this
we should get proper alignment of data.

BTW: My past profile data indicates that most work is done in
'inner_mul' and 'vector_combination'.  'inner_mul' is performing
several additions before computing remainder so should be easier
to speed up.  'vector_combination' has only 3 arithmetic operations
before final remainder, so here remainder is critical.  One possible
way to speed up remainder is to use primes of special form.
If p = 2^k - a with small a and resonable k, then remainder could be
done in 6 operations (2 shifts, 2 multiplications and 2 subtractions).
On my Core2 remainder takes 20 clocks, so 6 operations would be
a substantial improvement.  There are other schemes for computing
remainder, of similar cost.

-- 
                              Waldek Hebisch

-- 
You received this message because you are subscribed to the Google Groups 
"FriCAS - computer algebra system" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/fricas-devel/ZyDTSUhzJ4wxNfc4%40fricas.org.

Reply via email to