On Mon, Jun 4, 2012 at 1:44 AM, Richard Sandiford
<rdsandif...@googlemail.com> wrote:
> Klaus Pedersen <proje...@gmail.com> writes:
[...]
>> My original fix, that use sane cost for the ACC_REGS: gpr_acc_cost_3.patch
>
> Why sane?  Transfers from and (especially) to HI and LO really are
> expensive on many processors.  Obviously it'd be nice at some point to
> make this legacy code take processor-specific costs into account, but...

At least on my target, moving to and from LO/HI is really cheap. If gcc
can avoid using mult and div when LO/HI is used a scratch - I am fine
with that.

But the real reason I say "sane" is that otherwise it is really difficult to
convince gcc to use 2-op madd's instead of the 3-op version.

Potentially 2-op madd should have a big benefit over 3-op version
as it allow the madd to run in the background until you need the result.

I suspect ACC_REG cost is the reason this loop:
        int i;
        long long acc = 0;
        for (i=0; i<16; i++)
                acc += (long long)*a++ * *b++;
        return acc;

doesn't take take advantage of the fact that LO/HI is alive:
        addiu   $8,$4,64         # D.1481, a,
        move    $7,$0    # acc,
        move    $6,$0    # acc,
.L2:
        lw      $3,0($4)         # D.1443, MEM[base: a_25, offset: 0B]
        lw      $2,0($5)         # D.1445, MEM[base: b_26, offset: 0B]
        mtlo    $7       # acc
        addiu   $4,$4,4  # a, a,
        addiu   $5,$5,4  # b, b,
        mthi    $6       # acc
        madd    $3,$2    # D.1443, D.1445
        mflo    $7       # acc
        bne     $4,$8,.L2        #, a, D.1481,
        mfhi    $6       # acc

        move    $2,$6    #, acc
        j       $31
        mflo    $3       # tmp2

But for some reason the unrolled version works fine:
        long long acc = 0;
        acc += (long long)*a++ * *b++;
        acc += (long long)*a++ * *b++;
        acc += (long long)*a++ * *b++;
        acc += (long long)*a++ * *b++;
        return acc;

This translates to:
        lw      $6,4($5)         # D.1416, MEM[(const int *)b_5(D) + 4B]
        lw      $7,4($4)         # D.1414, MEM[(const int *)a_2(D) + 4B]
        lw      $3,0($4)         # D.1414, *a_2(D)
        lw      $2,0($5)         # D.1416, *b_5(D)
        mult    $7,$6    # D.1414, D.1416
        lw      $8,8($4)         # D.1414, MEM[(const int *)a_2(D) + 8B]
        madd    $3,$2    # D.1414, D.1416
        lw      $2,8($5)         # D.1416, MEM[(const int *)b_5(D) + 8B]
        lw      $3,12($4)        # D.1414, MEM[(const int *)a_2(D) + 12B]
        madd    $8,$2    # D.1414, D.1416
        lw      $2,12($5)        # D.1416, MEM[(const int *)b_5(D) + 12B]
        madd    $3,$2    # D.1414, D.1416
        mfhi    $6       #
        move    $2,$6    #,
        j       $31
        mflo    $3       # tmp2

(This is beautiful! Check the interleaved load...)

>
>> --- gcc-4.7-20120526-orig/gcc/config/mips/mips.c      2012-06-03
>> 19:28:02.137960837 +0800
>> +++ gcc-4.7-20120526/gcc/config/mips/mips.c   2012-06-03 19:31:12.587399458 
>> +0800
[...]
> ...this says that it is better to use LO as scratch space than spilling
> to memory -- and better by some margin -- which often isn't the case.

By a margin? - I can not imagine when spilling is cheaper than using
internal regs.
Spilling to memory, require a stack frame and save and restore of regs.

Say I have 300MHz cpu, 100MHz 16bit RAM, 3 clk to first data, 32 byte
cache line.
So depending on the cache configuration, the cpu might have to first
read the cache line, before the write buffer can go there. 20 cycles at
100MHz or 60 cpu cycles before the spilled value can be read back.


>
> As Vlad says, the behaviour you're seeing with the second pass isn't
> deliberate.

I am happy to test patches.


BR,  Klaus

Reply via email to