Richard Henderson <r...@twiddle.net> writes:

  Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
  not require the addend input until the 4th cycle, producing full output on the
  5th.  This seems to be the easiest way to hide a lot of output latency.
  
I measured a few of your more surprising numbers now, and I agree.

I didn't check the vmlal acc latency, but I recall to have seen similar
helpfllll behaviour for umlal and umaal.

  I'm not sure quite what's going on with the 3/4 issue rates.  I really would
  have expected to see either exactly 1, or very nearly 1/2, especially for 
vadd.
  
I think you mean 4/3.  But also that is an underestimate. with 8-way
unrolling I get a bit more, about 7/5.

  cortex_a15_neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar

There is a tradition among older LISP programmers to use names of about
that length.  Preferably using several names with minimal edit distance.

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to