Richard Henderson <r...@twiddle.net> writes: Perhaps I got the methodology wrong here, but it sure appears as if vmlal does not require the addend input until the 4th cycle, producing full output on the 5th. This seems to be the easiest way to hide a lot of output latency. I measured a few of your more surprising numbers now, and I agree.
I didn't check the vmlal acc latency, but I recall to have seen similar helpfllll behaviour for umlal and umaal. I'm not sure quite what's going on with the 3/4 issue rates. I really would have expected to see either exactly 1, or very nearly 1/2, especially for vadd. I think you mean 4/3. But also that is an underestimate. with 8-way unrolling I get a bit more, about 7/5. cortex_a15_neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar There is a tradition among older LISP programmers to use names of about that length. Preferably using several names with minimal edit distance. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel