RE: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Ramana Radhakrishnan
[Resent because of account funnies. Apologies to those who get this twice]

Hi,

  This problem is reported every once in a while, all targets with
 small
  load-immediate instructions suffer from this, especially since GCC
 4.0
  (i.e. since tree-ssa).  But it seems there is just not enough
 interest
  in having it fixed somehow, or someone would have taken care of it by
  now.
 
  I've summed up before how the problem _could_ be fixed, but I can't
  find where.  So here we go again.
 
  This could be solved in CSE by extending the notion of related
  expressions to constants that can be generated from other constants
  by a shift. Alternatively, you could create a simple, separate pass
  that applies CSE's related expressions thing in dominator tree
 walk.
 
 See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for
 handling
 something similar when related expressions differ by a small additive
 constant.  I am planning to finish this and submit it for 4.5.

Wouldn't doing this in CSE only solve the problem within an extended basic
block and not necessarily across the program ? Surely you'd want to do it
globally or am I missing something very basic here ?

Ramana





RE: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Adam Nemet
Ramana Radhakrishnan writes:
 [Resent because of account funnies. Apologies to those who get this twice]
 
 Hi,
 
   This problem is reported every once in a while, all targets with
  small
   load-immediate instructions suffer from this, especially since GCC
  4.0
   (i.e. since tree-ssa).  But it seems there is just not enough
  interest
   in having it fixed somehow, or someone would have taken care of it by
   now.
  
   I've summed up before how the problem _could_ be fixed, but I can't
   find where.  So here we go again.
  
   This could be solved in CSE by extending the notion of related
   expressions to constants that can be generated from other constants
   by a shift. Alternatively, you could create a simple, separate pass
   that applies CSE's related expressions thing in dominator tree
  walk.
  
  See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for
  handling
  something similar when related expressions differ by a small additive
  constant.  I am planning to finish this and submit it for 4.5.
 
 Wouldn't doing this in CSE only solve the problem within an extended basic
 block and not necessarily across the program ? Surely you'd want to do it
 globally or am I missing something very basic here ?

No, you're not.  There are plans moving some of what's in CSE to a new LCM
(global) pass.  Also note that for a global a pass you clearly need some more
sophisticated cost model for deciding when CSEing is beneficial.  On a
multi-scalar architecture, instructions synthesizing consts sometimes appear
to be free whereas holding a value a in a register for an extended period of
time is not.

Adam


Re: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Steven Bosscher
On Mon, Mar 16, 2009 at 2:52 PM, Ramana Radhakrishnan
ramana.radhakrish...@arm.com wrote:
 Wouldn't doing this in CSE only solve the problem within an extended basic
 block and not necessarily across the program ? Surely you'd want to do it
 globally or am I missing something very basic here ?

Why so serious^Wsurely?

I think doing this optimization over extended basic blocks would catch
90% of the cases.  The loop-carried form is covered by auto-increment
generation (and yes I know that pass also needs to be improved ;-)

Ciao!
Steven


Re: ARM compiler rewriting code to be longer and slower

2009-03-16 Thread Daniel Berlin
On Mon, Mar 16, 2009 at 12:11 PM, Adam Nemet ane...@caviumnetworks.com wrote:
 Ramana Radhakrishnan writes:
 [Resent because of account funnies. Apologies to those who get this twice]

 Hi,

   This problem is reported every once in a while, all targets with
  small
   load-immediate instructions suffer from this, especially since GCC
  4.0
   (i.e. since tree-ssa).  But it seems there is just not enough
  interest
   in having it fixed somehow, or someone would have taken care of it by
   now.
  
   I've summed up before how the problem _could_ be fixed, but I can't
   find where.  So here we go again.
  
   This could be solved in CSE by extending the notion of related
   expressions to constants that can be generated from other constants
   by a shift. Alternatively, you could create a simple, separate pass
   that applies CSE's related expressions thing in dominator tree
  walk.
 
  See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for
  handling
  something similar when related expressions differ by a small additive
  constant.  I am planning to finish this and submit it for 4.5.

 Wouldn't doing this in CSE only solve the problem within an extended basic
 block and not necessarily across the program ? Surely you'd want to do it
 globally or am I missing something very basic here ?

 No, you're not.  There are plans moving some of what's in CSE to a new LCM
 (global) pass.  Also note that for a global a pass you clearly need some more
 sophisticated cost model for deciding when CSEing is beneficial.  On a
 multi-scalar architecture, instructions synthesizing consts sometimes appear
 to be free whereas holding a value a in a register for an extended period of
 time is not.


Right. You probably want something closer to nigel horspool's
isothermal speculative PRE which takes into account (using
heuristics and profiles) where the best place to put things is based
on costs, instead of LCM, which uses a notion of lifetime optimality

See http://webhome.cs.uvic.ca/~nigelh/pubs.html for Fast
Profile-Based Partial Redundancy Elimination

There was a working implementation of this done for GCC 4.1 that used
profile info and execution counts.
If you are interested, and can hunt down David Pereira (He isn't at
uvic anymore, and i haven't talked to him since so i don't have his
email), he'd probably give you the code :)


Re: ARM compiler rewriting code to be longer and slower

2009-03-15 Thread Ramana Radhakrishnan
Hi Zoltan,

some parts snipped
On Fri, Mar 13, 2009 at 9:16 AM,  zol...@bendor.com.au wrote:

 Note that it is sub-optimal on two counts.

 First, each loading of a constant takes 3 instructions and 3 clocks.
 Storing the constant and fetching it using an ldr also takes 3 clocks but
 only two 32-bit words and identical constants need to be stored only once.
 The speed increase is only true on the ARM7TDMI-S, which has no caches, so
 that's just a minor issue, but the memory saving is true no matter what
 ARM core you have (note that -Os was specified).

 Second, and this is the real problem, if the compiler did not want to be
 overly clever and compiled the code as it was written, then instead of
 loading the constants 4 times, at the cost of 3 instuctions each, it could
 have loaded it only once and then generated the next constants at the cost
 of a single-word, single clock shift. The code would have been rather
 shorter *and* faster, plus some of the jumps could have been eliminated.
 Practically each C statement line (except the braces) corresponds to one
 assembly instruction, so without being clever, just translating what's
 written, it could be done in 20 words instead of 30.

I took a look at this for some time on Friday and I found that the
conditional constant propagation pass has pushed down the value
(tree-ssa-ccp.c). This is done by the CCP pass up in the optimization
pipeline because in general constant propagation is a good idea . In
any case there are a bunch of tree optimizers that identify these and
generally bring in constants into expressions as generally a good
idea. One might argue that constant propagation in general is a good
thing but the problem appears to be that the moment one has an
architecture where costs of loading immediate's is higher than the
cost of simple arithmetic operations the final code generated might
not be the most efficient.


With some more experimentation in the last hour or so I found that for
this particular case, I can get the following code

divby1e9:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
ldr r3, .L7
cmp r0, r3
mov r2, #0
bcc .L2
mov r3, r3, asl #2
cmp r0, r3
rsbcs   r0, r3, r0
addcs   r2, r2, #4
bcs .L2
mov r3, r3, lsr #1
cmp r0, r3
rsbcs   r0, r3, r0
mov r3, r3, lsr #1
movcs   r2, #2
cmp r0, r3
rsbcs   r0, r3, r0
addcs   r2, r2, #1
.L2:
str r2, [r1, #0]
bx  lr
.L8:
.align  2
.L7:
.word   10
.size   divby1e9, .-divby1e9
.ident  GCC: (GNU) 4.4.0 20090313 (experimental) [trunk revision 
143499]


but with the following command line options.

./xgcc -B`pwd` -S -Os newpr.c -fno-tree-ccp -fno-tree-fre
-fno-tree-vrp -fno-tree-dominator-opts -fno-gcse


I'm not sure about the best way to fix this but I've filed this for
the moment as

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468



cheers
Ramana

---
Ramana Radhakrishnan
ARM Ltd.


Re: ARM compiler rewriting code to be longer and slower

2009-03-15 Thread Steven Bosscher
On Sun, Mar 15, 2009 at 11:19 PM, Ramana Radhakrishnan
raman...@gmail.com wrote:
 I'm not sure about the best way to fix this but I've filed this for
 the moment as

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468

This problem is reported every once in a while, all targets with small
load-immediate instructions suffer from this, especially since GCC 4.0
(i.e. since tree-ssa).  But it seems there is just not enough interest
in having it fixed somehow, or someone would have taken care of it by
now.

I've summed up before how the problem _could_ be fixed, but I can't
find where.  So here we go again.

This could be solved in CSE by extending the notion of related
expressions to constants that can be generated from other constants
by a shift. Alternatively, you could create a simple, separate pass
that applies CSE's related expressions thing in dominator tree walk.

Another way to fix this, would be to learn RTL PRE about injured
expressions (like in Knoop's lazy strength reduction). This would be
the hardest but most complete fix.

And finally, you could try to teach the related values patch from
PR20211 (http://gcc.gnu.org/PR20211) about constants related via
shifts, although it isn't very likely that this pass in the form of
the patch will ever make it into the FSF GCC.

Hope this helps,

Ciao!
Steven


Re: ARM compiler rewriting code to be longer and slower

2009-03-15 Thread Adam Nemet
Steven Bosscher stevenb@gmail.com writes:
 On Sun, Mar 15, 2009 at 11:19 PM, Ramana Radhakrishnan
 raman...@gmail.com wrote:
 I'm not sure about the best way to fix this but I've filed this for
 the moment as

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468

 This problem is reported every once in a while, all targets with small
 load-immediate instructions suffer from this, especially since GCC 4.0
 (i.e. since tree-ssa).  But it seems there is just not enough interest
 in having it fixed somehow, or someone would have taken care of it by
 now.

 I've summed up before how the problem _could_ be fixed, but I can't
 find where.  So here we go again.

 This could be solved in CSE by extending the notion of related
 expressions to constants that can be generated from other constants
 by a shift. Alternatively, you could create a simple, separate pass
 that applies CSE's related expressions thing in dominator tree walk.

See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for handling
something similar when related expressions differ by a small additive
constant.  I am planning to finish this and submit it for 4.5.

Adam


Re: ARM compiler rewriting code to be longer and slower

2009-03-12 Thread Ian Lance Taylor
zol...@bendor.com.au writes:

 Is it a problem that is worth being put onto bugzilla or I just have to do
 some trickery to save the compiler from being smarter than it is?

I think this is worth being put into bugzilla.

Ian