On 02/26/2013 05:24 AM, Torbjorn Granlund wrote:
> We should probably work out the latencies for the interesting
> instructions.  That's not hard to do.

Testing for issue latency like this:

        ldr     r0, =1694100000
0:
        vmull.u32       q1, d0, d1
        vmull.u32       q2, d0, d1
        vmull.u32       q3, d0, d1
        subs            r0, #1
        vmull.u32       q4, d0, d1
        bne             0b

Output latency like this:

        vmull.u32       q1, d0, d1
        vmull.u32       q2, d2, d3
        vmull.u32       q3, d4, d5
        vmull.u32       q0, d6, d7

Then dividing the output of "time" by 4.

In the issue table I'll list pairs of independent insns, seeing which might be
dual issueable.

                issue   output
vmull           1       5
vmlal           1       5
vadd.i64 [qd]   3/4     3
vpaddl          3/4     3
vuzp            7/4     4.5
vext            3/4     3

On the off-chance that there are various producer/consumer bypasses within a
given functional unit, but perhaps not across functional units, I'll list
output latency in a table.

                        input
vmull->vuzp             5
vuzp->vpaddl            3
vmlal<->vmlal accum     1
vadd<->vmlal accum      4

Perhaps I got the methodology wrong here, but it sure appears as if vmlal does
not require the addend input until the 4th cycle, producing full output on the
5th.  This seems to be the easiest way to hide a lot of output latency.

I'm not sure quite what's going on with the 3/4 issue rates.  I really would
have expected to see either exactly 1, or very nearly 1/2, especially for vadd.

I did have a browse through gcc's scheduler description for a15 neon, and it
doesn't quite match up with the numbers I see here.  Relevant entries:

        (define_cpu_unit "ca15_cx_ij, ca15_cx_ik" "cortex_a15_neon")

There are two dispatch units for neon insns, J and K.

        (define_cpu_unit "ca15_cx_ialu1, ca15_cx_ialu2" "cortex_a15_neon")

There are two arithmetic pipelines.

        (define_reservation "ca15_cx_imac" "(ca15_cx_ij+ca15_cx_imac1)")

Multiply-accumulate must issue to J,
Add-accumulate (eg. vpadal) and shifts must issue to K.

        (define_reservation "ca15_cx_perm" "ca15_cx_ij|ca15_cx_ik")
        (define_reservation "ca15_cx_perm_2" "ca15_cx_ij+ca15_cx_ik")
        (define_insn_reservation
          "cortex_a15_neon_bp_simple" 4
          (and (eq_attr "tune" "cortexa15")
               (eq_attr "neon_type"
                      "neon_bp_simple"))
          "ca15_issue3,ca15_ls+ca15_cx_perm_2,ca15_cx_perm")

Permute insns (eg. vext) must decompose to 3 micro-ops, because they take both
J and K dispatch units in the first cycle and then either J or K in the second
cycle.

The scheduling description has the look of being auto-generated.  No-one would
write names like
cortex_a15_neon_mul_qdd_64_32_long_qqd_16_ddd_32_scalar_64_32_long_scalar by
hand, on purpose.  Does auto-generating mean it's more or less accurate?

Anyway,


r~
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to