[Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression

2015-06-23 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

Richard Biener rguenth at gcc dot gnu.org changed:

   What|Removed |Added

   Target Milestone|4.8.5   |4.9.3

--- Comment #54 from Richard Biener rguenth at gcc dot gnu.org ---
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.


[Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression

2015-05-20 Thread wschmidt at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #53 from Bill Schmidt wschmidt at gcc dot gnu.org ---
I'm not a fan of a tree-level unroller.  It's impossible to make good decisions
about unroll factors that early.  But your second approach sounds quite
promising to me.


[Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression

2015-05-19 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

amker at gcc dot gnu.org changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #52 from amker at gcc dot gnu.org ---
I don't understand powerpc assembly well, but this looks like the same problem
on aarch64/arm.  Ah, and we are even looking at same function...

I think this is a general issue caused by inconsistency between tree level
ivopt and rtl level loop unroller.  To be specific, how we handle unrolled
induction variable registers after unrolling.

The core loop on aarch64 with options -O3 -funroll-all-loops -mcpu=cortex-a57
gave below output:

.L3:
add x2, x0, 16
ldr q16, [x17, x0]
add x10, x0, 32
add x9, x0, 48
add x8, x0, 64
ldr q17, [x17, x2]
add x3, x0, 80
add x6, x0, 96
add x5, x0, 112
add w1, w1, 8
ldr q19, [x17, x10]
cmp w1, w14
ldr q18, [x17, x9]
ldr q20, [x17, x8]
ldr q21, [x17, x3]
ldr q22, [x17, x6]
ldr q23, [x17, x5]
str q16, [x18, x0]
add x0, x0, 128
str q17, [x18, x2]
str q19, [x18, x10]
str q18, [x18, x9]
str q20, [x18, x8]
str q21, [x18, x3]
str q22, [x18, x6]
str q23, [x18, x5]
bcc .L3 

The tree ivopt dump is quite neat:

  bb 6:
  # ivtmp.16_28 = PHI ivtmp.16_25(9), 0(5)
  # ivtmp.19_42 = PHI ivtmp.19_41(9), 0(5)
  vect__4.13_62 = MEM[base: vectp_a.12_58, index: ivtmp.19_42, offset: 0B];
  MEM[base: vectp_c.15_63, index: ivtmp.19_42, offset: 0B] = vect__4.13_62;
  ivtmp.16_25 = ivtmp.16_28 + 1;
  ivtmp.19_41 = ivtmp.19_42 + 16;
  if (ivtmp.16_25  bnd.7_36)
goto bb 9;
  else
goto bb 7;

  ...

  bb 9:
  goto bb 6;

But after rtl unroller, we have options like -fsplit-ivs-in-unroller and
-fweb.  These two options try to split the long live range of induction
vairables into seperated ones.  Evetually, with folloing fwprop and IRA, we
have multiple ivs for each original iv.  

I see two possible fixes here.  One is to implement a tree level unroller
before IVOPT and remove the rtl one.  The rtl one is some kind of too
aggressive that we didn't enable it by default with O3.
Another is change how we handle unrolled iv in rtl unroller.  It splits
unrolled iv to avoid pseudo register with long live range since that may affect
rtl optimizers.  This assumption may hold before, but seems not true to me
nowadays, especially for induction variables.  Because on tree level ivopts, we
already made the assumption that each iv occupies a register, also ivs are
intensively used thus should live in one single hard register.  For this
specific case, we can refactor [base+index] out of memory reference and use
[new_base], [new_base+4], [new_base+8], ... etc. in unrolling.  If tree ivopts
choosses [reg+offset] addressing mode, we only need to generate instruction
sequence like [reg+offset], [reg+(offset+4)], [reg+(offset+8)]... reg = reg +
urolled_times*step

Thanks,
bin


[Bug target/29256] [4.8/4.9/5/6 regression] loop performance regression

2015-05-19 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=29256

--- Comment #51 from Jeffrey A. Law law at redhat dot com ---
Configure for powerpc-linux-gnuspec target with the --eanble-e500_double
option:


/home/gcc/GIT-2/gcc/configure powerpc-linux-gnuspe --enable-e500_double

Testcase:
# define N  200
extern double   a[N],c[N];
void tuned_STREAM_Copy()
{
int j;
for (j=0; jN; j++)
c[j] = a[j];
}


./cc1 -O3 -funroll-loops -funroll-all-loops -fno-tree-loop-distribute-patterns
j.c -I./ -mspe

Results in:

tuned_STREAM_Copy:
stwu 1,-16(1)
lis 7,0x3
lis 8,c@ha
lis 10,a@ha
ori 0,7,0xd090
evstdd 31,8(1)
li 9,0
la 8,c@l(8)
la 10,a@l(10)
mtctr 0
.L2:
evlddx 31,10,9
addi 7,9,8
addi 0,9,16
addi 11,9,24
addi 3,9,32
evstddx 31,8,9
addi 4,9,40
evlddx 31,10,7
addi 5,9,48
addi 6,9,56
evlddx 12,10,6
addi 9,9,64
evstddx 31,8,7
evlddx 7,10,0
evstddx 7,8,0
evlddx 0,10,11
evstddx 0,8,11
evlddx 11,10,3
evstddx 11,8,3
evlddx 3,10,4
evstddx 3,8,4
evlddx 4,10,5
evstddx 4,8,5
evstddx 12,8,6
bdnz .L2
evldd 31,8(1)
addi 1,1,16
blr
.size   tuned_STREAM_Copy, .-tuned_STREAM_Copy

Which looks to me like ivopts has mucked things up badly.