Re: [fricas-devel] Re: use SIMD to speedup POLYVEC

Waldek Hebisch Fri, 01 Nov 2024 11:16:09 -0700

On Tue, Oct 29, 2024 at 09:41:20PM +0800, Qian Yun wrote:
> 
> 
> On 10/29/24 8:21 PM, Waldek Hebisch wrote:
> > On Wed, Oct 16, 2024 at 04:24:53PM +0800, Qian Yun wrote:
> > > The functions in POLYVEC uses modulus (to a prime).
> > > This makes SIMD algorithm more complex than I though,
> > > but if implemented correctly, it should easily gives
> > > over 10x speedup.
> > > 
> > > - Qian
> > > 
> > > On 10/14/24 9:34 PM, Qian Yun wrote:
> > > > I've taken a brief look, I think it is possible to speedup
> > > > functions in package POLYVEC by SIMD, via SBCL package sb-simd,
> > > > on machines with AVX2 instructions.
> > 
> > Already SSE2 (which is supposed to be present on all 64-bit PC-s)
> > should give significant speedup.  Comparatively, AVX/AVX2 offers
> > limited gains, basicaly only on machines which have sufficiently
> > wide hardware (IIUC only top models).  Potential troubles to solve:
> 
> Steam survey shows that 97.49% machines have AVX and 94.61% have AVX2.
> AVX512 is much less common, only 13%.


My Core 2 has only SSE2, no AVX.  There are faster machines now,
but it is faster than modern low-end machines.
 
> Isn't AVX supposed to have twice throughput than SSE2?

AVX doubles decode efficiency and adds some new instructions.
But throughput depends on width of execution units.  If machine
has 128-bit execution units (or narrower, but I hope no machine
is narrower), then gain compared to SSE@ will be
quit modest (if any).  IIUC quite a lot of AVX machines has
128-bit execution units

> > - aligning data to 16 byte boundary which is required by normal SSE2
> >    instructions (I did not check what is required by AVX)
> > - extention from 32-bits to 64-bits.  Main operations limit
> >    coeffiecints to 32-bits (actually less than this) but to
> >    prevent overflow we need operations with 64-bit accuracy.
> >    SSE has one operation that returns increased accuracy, but
> >    IIRC this is for 16-bit to 32-bit (we could use such thing
> >    sometimes, but not always).  AVX may be better here.  But
> >    for example 'vector_combination' needs extention before
> >    actual arithmetic (or some way to access high part of the
> >    product, but IIUC AVX have no way to do this).
> > - fast way to compute remainders.  There are fast ways but quite
> >    low-level, it is hard to use them from higher-level languages
> 
> I saw "Modular SIMD arithmetic in Mathemagix" by J van der Hoeven.
> There's algorithm doing modular multiplication entirely in 32-bit.
> Our current method stores tmp value in 64bit.

No, AFAICS Joris is extending 32-bit numbers to 64-bit.

> I wrote potentially 10x speedup is based on the numbers in
> this article.

Well, in 'vector_combination' base step is 'z(i) := (a*v(i) + b*w(i)) rem p'.
IIRC time per step on Core 2 is 20 cycles.  Joris is considering
each operation as separate thing, so that is two multiplications
and addition.  Using his timing for operations that is 4.2 cycles
in best case with SSE and 2.5 with AVX.  So there is potential
for significant speedup.

> BTW, which "high level" functions uses POLYVEC underneath?
> I'd like to use that as benchmark.

Original motivation for POLYVEC was guessing package, that
is functions like 'guessHolo', 'guessPRec', etc.  Theoreticaly
modular part is the only important one (other have lower
theoretical complexity), so I  initally worked on modular
part.  However, when modular part was integrated into whole
thing it turned out that in several cases other parts took
more than 90% of execution time.  So remaing work was mostly
to make non-modular parts faster.  Still, sometimes they take
substantial time.

Second use is polynomial GCD over algebraic extentions, which
in turn is used by the integrator.

Currently factorization over prime fields use POLYVEC and similar
modular operations, like multiplication of vector by matrix
or matrix multiplication.

-- 
                              Waldek Hebisch

-- 
You received this message because you are subscribed to the Google Groups 
"FriCAS - computer algebra system" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/fricas-devel/ZyUa1V99BR6xIP_q%40fricas.org.

Re: [fricas-devel] Re: use SIMD to speedup POLYVEC

Reply via email to