ni...@lysator.liu.se (Niels Möller) writes: Hmm, I tried changing all output registers to unique registers (only written once in the loop, never ever read (except as vmlal reads the output register before accumulating to it). Do you mean that I need to change the *input* registers of all instructions too? Not if you manage to break all dependencies without doing that. beating on just one input reg frees up every other register for the write-once coding...
For simplicity, I did this to my addmul_4 loop. Cycle time dropped from 4.5 c/l to 4. I.e., 16 cycles needed to execute a loop consisting of 11 almost completlyindependent instructions, 2 vmlal 2 vext 2 vpaddl 2 ld1 (scalar) 1 st1 (scalar) 1 subs 1 bne Then it's clear some execution unit is not able to keep up with instrution decoding at all, right? I then try taking out some instructions (but keeping load, store and looping): no arithmetic: 1.75 c/l (only load, store, loop overhead) vmlal only: 2.75 c/l (also the same with vmull instead of vmlal) vext only: 2.5 c/l vext+vmlal: 3.5 c/l vpaddl only: 2.0 c/l vpaddl+vext: 3.0 c/l vpadd+vmlal: 3.0 c/l all: 4.0 c/l What conclusions can one draw from this exercise? It seems that vext and vmlal compete for execution resources, while vpaddl can be done mostly in parallell with the other operations. That's an important conclusion. Perhaps one should avoid vext? (But since these experiments where done on A9, we shouldn't make that conclusion at all). If vext is bad also for A15, I'd hope to use the 32 + 64 -> 64 add instruction for all (k'ish) column summations, and only when a column is about to get ready, add previous carry-in, and then shuffle as needed with vext. That might reduce things to one vext per iteration. Your vmlal performance seem strange. I can run one vmlal every 2nd cycle on my A9, i.e., a c/l of throughput. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel