ni...@lysator.liu.se (Niels Möller) writes: Unfortunately not. speed -C -s ... mpn_addmul_2 reported around 14 cycles, so it's 7 c/l, compared to 2.38 for the current non-simd code. If I interpret speed output correctly. Include addmul_1.3 in the measurements as a sanity check.
> What about SIMD multiply-accumulate? IIRC, these insns have the same > latency ate throughput as non-accumulating SIMD multiplies. Should look into that (I didn't notice any useful integer multiply-accumulate instructions on my first reading of the manual). But I suspect you get them on the critical path, and then the relevant comparison is to add latency, not mul latency. IIRC, there is an almost parallel set of SIMD multiply-accumulate insns. One might need to use a bigger building block, say addmul_4, in order to deal with accumulation latency. I did measure SIMD multiply(-accumulate) throughput some months ago and concluded it was great, at least for A15 but I think it was great also or A9. I did not measure the other needed insns separately or in a mix with multiply insns. It might be the case that SIMD add and SIMD multiply compete for decoding slots or issue slots. Not an uncommon design tradeoff. The more importand would be using multiply-accumulate. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel