ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund <t...@gmplib.org> writes: > I played with vmlal.u32 on A9 and A15. Surprisingly, both CPUs are very > cooperative in that the accumulation dependency is very shallow. Nice. Is the same true for the non-simd umaal instruction? Things are complicated, I cannot get my head around what's going on.
This example with partial overlapping runs at 12 cycles: umaal r2, r1, r14, r14 umaal r2, r3, r14, r14 umaal r2, r5, r14, r14 umaal r2, r7, r14, r14 This example with 100% overlapping runs at 8 cycles: umaal r2, r1, r14, r14 umaal r2, r1, r14, r14 umaal r2, r1, r14, r14 umaal r2, r1, r14, r14 THis example with partial overlapping runs at 16 cycles: umaal r2, r1, r14, r14 umaal r4, r1, r14, r14 umaal r6, r1, r14, r14 umaal r8, r1, r14, r14 With completely independent operands, things run at 8 cycles. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel