On Mon, Jun 4, 2012 at 1:44 AM, Richard Sandiford <rdsandif...@googlemail.com> wrote: > Klaus Pedersen <proje...@gmail.com> writes: [...] >> My original fix, that use sane cost for the ACC_REGS: gpr_acc_cost_3.patch > > Why sane? Transfers from and (especially) to HI and LO really are > expensive on many processors. Obviously it'd be nice at some point to > make this legacy code take processor-specific costs into account, but...
At least on my target, moving to and from LO/HI is really cheap. If gcc can avoid using mult and div when LO/HI is used a scratch - I am fine with that. But the real reason I say "sane" is that otherwise it is really difficult to convince gcc to use 2-op madd's instead of the 3-op version. Potentially 2-op madd should have a big benefit over 3-op version as it allow the madd to run in the background until you need the result. I suspect ACC_REG cost is the reason this loop: int i; long long acc = 0; for (i=0; i<16; i++) acc += (long long)*a++ * *b++; return acc; doesn't take take advantage of the fact that LO/HI is alive: addiu $8,$4,64 # D.1481, a, move $7,$0 # acc, move $6,$0 # acc, .L2: lw $3,0($4) # D.1443, MEM[base: a_25, offset: 0B] lw $2,0($5) # D.1445, MEM[base: b_26, offset: 0B] mtlo $7 # acc addiu $4,$4,4 # a, a, addiu $5,$5,4 # b, b, mthi $6 # acc madd $3,$2 # D.1443, D.1445 mflo $7 # acc bne $4,$8,.L2 #, a, D.1481, mfhi $6 # acc move $2,$6 #, acc j $31 mflo $3 # tmp2 But for some reason the unrolled version works fine: long long acc = 0; acc += (long long)*a++ * *b++; acc += (long long)*a++ * *b++; acc += (long long)*a++ * *b++; acc += (long long)*a++ * *b++; return acc; This translates to: lw $6,4($5) # D.1416, MEM[(const int *)b_5(D) + 4B] lw $7,4($4) # D.1414, MEM[(const int *)a_2(D) + 4B] lw $3,0($4) # D.1414, *a_2(D) lw $2,0($5) # D.1416, *b_5(D) mult $7,$6 # D.1414, D.1416 lw $8,8($4) # D.1414, MEM[(const int *)a_2(D) + 8B] madd $3,$2 # D.1414, D.1416 lw $2,8($5) # D.1416, MEM[(const int *)b_5(D) + 8B] lw $3,12($4) # D.1414, MEM[(const int *)a_2(D) + 12B] madd $8,$2 # D.1414, D.1416 lw $2,12($5) # D.1416, MEM[(const int *)b_5(D) + 12B] madd $3,$2 # D.1414, D.1416 mfhi $6 # move $2,$6 #, j $31 mflo $3 # tmp2 (This is beautiful! Check the interleaved load...) > >> --- gcc-4.7-20120526-orig/gcc/config/mips/mips.c 2012-06-03 >> 19:28:02.137960837 +0800 >> +++ gcc-4.7-20120526/gcc/config/mips/mips.c 2012-06-03 19:31:12.587399458 >> +0800 [...] > ...this says that it is better to use LO as scratch space than spilling > to memory -- and better by some margin -- which often isn't the case. By a margin? - I can not imagine when spilling is cheaper than using internal regs. Spilling to memory, require a stack frame and save and restore of regs. Say I have 300MHz cpu, 100MHz 16bit RAM, 3 clk to first data, 32 byte cache line. So depending on the cache configuration, the cpu might have to first read the cache line, before the write buffer can go there. 20 cycles at 100MHz or 60 cpu cycles before the spilled value can be read back. > > As Vlad says, the behaviour you're seeing with the second pass isn't > deliberate. I am happy to test patches. BR, Klaus