On Fri, Aug 29, 2014 at 10:46 PM, Maciej W. Rozycki
<ma...@codesourcery.com> wrote:
> Hi,
>
>  The loop-19.c test case has regressed from 4.8 to 4.9 and trunk on
> classic FPU Power targets, these failures are now seen:
>
> FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: 
> &|symbol: )a," 2
> FAIL: gcc.dg/tree-ssa/loop-19.c scan-tree-dump-times optimized "MEM.(base: 
> &|symbol: )c," 2
>
>  However upon the inpection of generated code it is obvious that its
> quality has improved, the autoincrement rather than indexed addressing
> mode is now used in the loop produced, reducing the number of instructions
> in the loop from 4 to 3 and also removing another instruction from outside
> the loop, i.e. (new code):
>
>         .globl tuned_STREAM_Copy
>         .type   tuned_STREAM_Copy, @function
> tuned_STREAM_Copy:
>         lis 8,0x1e
>         lis 10,a-8@ha
>         ori 8,8,33920
>         lis 9,c-8@ha
>         mtctr 8
>         la 10,a-8@l(10)
>         la 9,c-8@l(9)
> .L2:
>         lfdu 0,8(10)
>         stfdu 0,8(9)
>         bdnz .L2
>         blr
>         .size   tuned_STREAM_Copy, .-tuned_STREAM_Copy
>
> vs (old code):
>
>         .globl tuned_STREAM_Copy
>         .type   tuned_STREAM_Copy, @function
> tuned_STREAM_Copy:
>         lis 7,0x1e
>         ori 7,7,33920
>         mtctr 7
>         lis 8,c@ha
>         lis 10,a@ha
>         li 9,0
>         la 8,c@l(8)
>         la 10,a@l(10)
> .L3:
>         lfdx 0,10,9
>         stfdx 0,8,9
>         addi 9,9,8
>         bdnz .L3
>         blr
>         .size   tuned_STREAM_Copy,.-tuned_STREAM_Copy
>
> The only Power targets that still pass this test are e500v2 ones such as
> `-mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe' that use the SPE unit
> for FP operations, because the indexed mode is still used (there's no
> autoincrement addressing mode available for the memory access instructions
> concerned):
>
>         .globl tuned_STREAM_Copy
>         .type   tuned_STREAM_Copy, @function
> tuned_STREAM_Copy:
>         lis 10,0x1e
>         lis 7,c@ha
>         lis 8,a@ha
>         ori 10,10,0x8480
>         li 9,0
>         la 7,c@l(7)
>         la 8,a@l(8)
>         mtctr 10
> .L2:
>         evlddx 10,8,9
>         evstddx 10,7,9
>         addi 9,9,8
>         bdnz .L2
>         blr
>         .size   tuned_STREAM_Copy,.-tuned_STREAM_Copy
>
> [I have removed "-fno-common" from the current test flags for the purpose
> of this consideration to compare apples to apples; 4.8 didn't have it.
> The presence or absence of this flag does not appear to make a difference
> for this test case for Power targets.]
>
>  The obvious reason of the failure is the offset of -8 now seen in new
> classic FP code for preinitialising the pointers before entering the loop.
> The initial offset is needed so that it is cancelled by the offset of 8
> used in the loop itself to autoincrement these pointers.  So the new code
> not only is better, but it actually has to use these offsets as well or
> autoincrementation would not work.
>
>  Therefore I think at this point the test case is invalid for classic FP
> Power, so I propose that we exclude it from testing here, only leaving SPE
> FP Power for whatever value the test case may have for it, and especially
> x86 variants where there's actual code size penalty for using an immediate
> offset (displacement) in addition to a base register.
>
>  For the record here are the optimization dumps examined by the test case,
> for the old generated code that passes:
>
> ;; Function tuned_STREAM_Copy (tuned_STREAM_Copy, funcdef_no=0, 
> decl_uid=1382, cgraph_uid=0)
>
> tuned_STREAM_Copy ()
> {
>   sizetype ivtmp.10;
>   double _4;
>
>   <bb 2>:
>
>   <bb 3>:
>   # ivtmp.10_8 = PHI <ivtmp.10_2(4), 0(2)>
>   _4 = MEM[symbol: a, index: ivtmp.10_8, offset: 0B];
>   MEM[symbol: c, index: ivtmp.10_8, offset: 0B] = _4;
>   ivtmp.10_2 = ivtmp.10_8 + 8;
>   if (ivtmp.10_2 != 16000000)
>     goto <bb 4>;
>   else
>     goto <bb 5>;
>
>   <bb 4>:
>   goto <bb 3>;
>
>   <bb 5>:
>   return;
>
> }
>
> and for the new code that fails:
>
> ;; Function tuned_STREAM_Copy (tuned_STREAM_Copy, funcdef_no=0, 
> decl_uid=2191, symbol_order=2)
>
> Removing basic block 5
> tuned_STREAM_Copy ()
> {
>   unsigned int ivtmp.13;
>   unsigned int ivtmp.9;
>   double _4;
>   void * _15;
>   void * _16;
>   unsigned int _17;
>
>   <bb 2>:
>   ivtmp.9_11 = (unsigned int) &MEM[(void *)&a + 4294967288B];
>   ivtmp.13_14 = (unsigned int) &MEM[(void *)&c + 4294967288B];
>   _17 = (unsigned int) &MEM[(void *)&a + 15999992B];
>
>   <bb 3>:
>   # ivtmp.9_8 = PHI <ivtmp.9_2(3), ivtmp.9_11(2)>
>   # ivtmp.13_12 = PHI <ivtmp.13_13(3), ivtmp.13_14(2)>
>   ivtmp.9_2 = ivtmp.9_8 + 8;
>   _15 = (void *) ivtmp.9_2;
>   _4 = MEM[base: _15, offset: 0B];
>   ivtmp.13_13 = ivtmp.13_12 + 8;
>   _16 = (void *) ivtmp.13_13;
>   MEM[base: _16, offset: 0B] = _4;
>   if (ivtmp.9_2 != _17)
>     goto <bb 3>;
>   else
>     goto <bb 4>;
>
>   <bb 4>:
>   return;
>
> }
>
>  Tested with the following powerpc-gnu-linux multilibs with the respective
> results noted on the right:
>
> -mcpu=603e                                              UNSUPPORTED
> -mcpu=603e -msoft-float                                 UNSUPPORTED
> -mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=spe      UNSUPPORTED
> -mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe      PASS
> -mcpu=7400 -maltivec -mabi=altivec                      UNSUPPORTED
> -mcpu=e6500 -maltivec -mabi=altivec                     UNSUPPORTED
> -mcpu=e5500 -m64                                        UNSUPPORTED
> -mcpu=e6500 -m64 -maltivec -mabi=altivec                UNSUPPORTED
>
> Original results:
>
> -mcpu=603e                                              FAIL
> -mcpu=603e -msoft-float                                 UNSUPPORTED
> -mcpu=8540 -mfloat-gprs=single -mspe=yes -mabi=spe      UNSUPPORTED
> -mcpu=8548 -mfloat-gprs=double -mspe=yes -mabi=spe      PASS
> -mcpu=7400 -maltivec -mabi=altivec                      FAIL
> -mcpu=e6500 -maltivec -mabi=altivec                     FAIL
> -mcpu=e5500 -m64                                        FAIL
> -mcpu=e6500 -m64 -maltivec -mabi=altivec                FAIL
>
>  OK to apply (for trunk and 4.9)?
>
> 2014-08-30  Maciej W. Rozycki  <ma...@codesourcery.com>
>
>         * gcc.dg/tree-ssa/loop-19.c: Exclude classic FPU Power targets.
>
>   Maciej
>
> gcc-test-power-loop-19.diff
> Index: gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c
> ===================================================================
> --- gcc-fsf-trunk-quilt.orig/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c    
> 2014-08-29 16:45:27.748122597 +0100
> +++ gcc-fsf-trunk-quilt/gcc/testsuite/gcc.dg/tree-ssa/loop-19.c 2014-08-30 
> 02:53:03.658955978 +0100
> @@ -4,7 +4,7 @@
>
>     The testcase comes from PR 29256 (and originally, the stream benchmark).  
> */
>
> -/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || 
> powerpc_hard_double } } } } */
> +/* { dg-do compile { target { i?86-*-* || { x86_64-*-* || { 
> powerpc_hard_double && { ! powerpc_fprs } } } } } } */
>  /* { dg-require-effective-target nonpic } */
>  /* { dg-options "-O3 -fno-tree-loop-distribute-patterns 
> -fno-prefetch-loop-arrays -fdump-tree-optimized -fno-common" } */
>

Okay.

Thanks, David

Reply via email to