RE: ARM compiler rewriting code to be longer and slower
[Resent because of account funnies. Apologies to those who get this twice] Hi, This problem is reported every once in a while, all targets with small load-immediate instructions suffer from this, especially since GCC 4.0 (i.e. since tree-ssa). But it seems there is just not enough interest in having it fixed somehow, or someone would have taken care of it by now. I've summed up before how the problem _could_ be fixed, but I can't find where. So here we go again. This could be solved in CSE by extending the notion of related expressions to constants that can be generated from other constants by a shift. Alternatively, you could create a simple, separate pass that applies CSE's related expressions thing in dominator tree walk. See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for handling something similar when related expressions differ by a small additive constant. I am planning to finish this and submit it for 4.5. Wouldn't doing this in CSE only solve the problem within an extended basic block and not necessarily across the program ? Surely you'd want to do it globally or am I missing something very basic here ? Ramana
RE: ARM compiler rewriting code to be longer and slower
Ramana Radhakrishnan writes: [Resent because of account funnies. Apologies to those who get this twice] Hi, This problem is reported every once in a while, all targets with small load-immediate instructions suffer from this, especially since GCC 4.0 (i.e. since tree-ssa). But it seems there is just not enough interest in having it fixed somehow, or someone would have taken care of it by now. I've summed up before how the problem _could_ be fixed, but I can't find where. So here we go again. This could be solved in CSE by extending the notion of related expressions to constants that can be generated from other constants by a shift. Alternatively, you could create a simple, separate pass that applies CSE's related expressions thing in dominator tree walk. See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for handling something similar when related expressions differ by a small additive constant. I am planning to finish this and submit it for 4.5. Wouldn't doing this in CSE only solve the problem within an extended basic block and not necessarily across the program ? Surely you'd want to do it globally or am I missing something very basic here ? No, you're not. There are plans moving some of what's in CSE to a new LCM (global) pass. Also note that for a global a pass you clearly need some more sophisticated cost model for deciding when CSEing is beneficial. On a multi-scalar architecture, instructions synthesizing consts sometimes appear to be free whereas holding a value a in a register for an extended period of time is not. Adam
Re: ARM compiler rewriting code to be longer and slower
On Mon, Mar 16, 2009 at 2:52 PM, Ramana Radhakrishnan ramana.radhakrish...@arm.com wrote: Wouldn't doing this in CSE only solve the problem within an extended basic block and not necessarily across the program ? Surely you'd want to do it globally or am I missing something very basic here ? Why so serious^Wsurely? I think doing this optimization over extended basic blocks would catch 90% of the cases. The loop-carried form is covered by auto-increment generation (and yes I know that pass also needs to be improved ;-) Ciao! Steven
Re: ARM compiler rewriting code to be longer and slower
On Mon, Mar 16, 2009 at 12:11 PM, Adam Nemet ane...@caviumnetworks.com wrote: Ramana Radhakrishnan writes: [Resent because of account funnies. Apologies to those who get this twice] Hi, This problem is reported every once in a while, all targets with small load-immediate instructions suffer from this, especially since GCC 4.0 (i.e. since tree-ssa). But it seems there is just not enough interest in having it fixed somehow, or someone would have taken care of it by now. I've summed up before how the problem _could_ be fixed, but I can't find where. So here we go again. This could be solved in CSE by extending the notion of related expressions to constants that can be generated from other constants by a shift. Alternatively, you could create a simple, separate pass that applies CSE's related expressions thing in dominator tree walk. See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for handling something similar when related expressions differ by a small additive constant. I am planning to finish this and submit it for 4.5. Wouldn't doing this in CSE only solve the problem within an extended basic block and not necessarily across the program ? Surely you'd want to do it globally or am I missing something very basic here ? No, you're not. There are plans moving some of what's in CSE to a new LCM (global) pass. Also note that for a global a pass you clearly need some more sophisticated cost model for deciding when CSEing is beneficial. On a multi-scalar architecture, instructions synthesizing consts sometimes appear to be free whereas holding a value a in a register for an extended period of time is not. Right. You probably want something closer to nigel horspool's isothermal speculative PRE which takes into account (using heuristics and profiles) where the best place to put things is based on costs, instead of LCM, which uses a notion of lifetime optimality See http://webhome.cs.uvic.ca/~nigelh/pubs.html for Fast Profile-Based Partial Redundancy Elimination There was a working implementation of this done for GCC 4.1 that used profile info and execution counts. If you are interested, and can hunt down David Pereira (He isn't at uvic anymore, and i haven't talked to him since so i don't have his email), he'd probably give you the code :)
Re: ARM compiler rewriting code to be longer and slower
Hi Zoltan, some parts snipped On Fri, Mar 13, 2009 at 9:16 AM, zol...@bendor.com.au wrote: Note that it is sub-optimal on two counts. First, each loading of a constant takes 3 instructions and 3 clocks. Storing the constant and fetching it using an ldr also takes 3 clocks but only two 32-bit words and identical constants need to be stored only once. The speed increase is only true on the ARM7TDMI-S, which has no caches, so that's just a minor issue, but the memory saving is true no matter what ARM core you have (note that -Os was specified). Second, and this is the real problem, if the compiler did not want to be overly clever and compiled the code as it was written, then instead of loading the constants 4 times, at the cost of 3 instuctions each, it could have loaded it only once and then generated the next constants at the cost of a single-word, single clock shift. The code would have been rather shorter *and* faster, plus some of the jumps could have been eliminated. Practically each C statement line (except the braces) corresponds to one assembly instruction, so without being clever, just translating what's written, it could be done in 20 words instead of 30. I took a look at this for some time on Friday and I found that the conditional constant propagation pass has pushed down the value (tree-ssa-ccp.c). This is done by the CCP pass up in the optimization pipeline because in general constant propagation is a good idea . In any case there are a bunch of tree optimizers that identify these and generally bring in constants into expressions as generally a good idea. One might argue that constant propagation in general is a good thing but the problem appears to be that the moment one has an architecture where costs of loading immediate's is higher than the cost of simple arithmetic operations the final code generated might not be the most efficient. With some more experimentation in the last hour or so I found that for this particular case, I can get the following code divby1e9: @ Function supports interworking. @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. ldr r3, .L7 cmp r0, r3 mov r2, #0 bcc .L2 mov r3, r3, asl #2 cmp r0, r3 rsbcs r0, r3, r0 addcs r2, r2, #4 bcs .L2 mov r3, r3, lsr #1 cmp r0, r3 rsbcs r0, r3, r0 mov r3, r3, lsr #1 movcs r2, #2 cmp r0, r3 rsbcs r0, r3, r0 addcs r2, r2, #1 .L2: str r2, [r1, #0] bx lr .L8: .align 2 .L7: .word 10 .size divby1e9, .-divby1e9 .ident GCC: (GNU) 4.4.0 20090313 (experimental) [trunk revision 143499] but with the following command line options. ./xgcc -B`pwd` -S -Os newpr.c -fno-tree-ccp -fno-tree-fre -fno-tree-vrp -fno-tree-dominator-opts -fno-gcse I'm not sure about the best way to fix this but I've filed this for the moment as http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468 cheers Ramana --- Ramana Radhakrishnan ARM Ltd.
Re: ARM compiler rewriting code to be longer and slower
On Sun, Mar 15, 2009 at 11:19 PM, Ramana Radhakrishnan raman...@gmail.com wrote: I'm not sure about the best way to fix this but I've filed this for the moment as http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468 This problem is reported every once in a while, all targets with small load-immediate instructions suffer from this, especially since GCC 4.0 (i.e. since tree-ssa). But it seems there is just not enough interest in having it fixed somehow, or someone would have taken care of it by now. I've summed up before how the problem _could_ be fixed, but I can't find where. So here we go again. This could be solved in CSE by extending the notion of related expressions to constants that can be generated from other constants by a shift. Alternatively, you could create a simple, separate pass that applies CSE's related expressions thing in dominator tree walk. Another way to fix this, would be to learn RTL PRE about injured expressions (like in Knoop's lazy strength reduction). This would be the hardest but most complete fix. And finally, you could try to teach the related values patch from PR20211 (http://gcc.gnu.org/PR20211) about constants related via shifts, although it isn't very likely that this pass in the form of the patch will ever make it into the FSF GCC. Hope this helps, Ciao! Steven
Re: ARM compiler rewriting code to be longer and slower
Steven Bosscher stevenb@gmail.com writes: On Sun, Mar 15, 2009 at 11:19 PM, Ramana Radhakrishnan raman...@gmail.com wrote: I'm not sure about the best way to fix this but I've filed this for the moment as http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39468 This problem is reported every once in a while, all targets with small load-immediate instructions suffer from this, especially since GCC 4.0 (i.e. since tree-ssa). But it seems there is just not enough interest in having it fixed somehow, or someone would have taken care of it by now. I've summed up before how the problem _could_ be fixed, but I can't find where. So here we go again. This could be solved in CSE by extending the notion of related expressions to constants that can be generated from other constants by a shift. Alternatively, you could create a simple, separate pass that applies CSE's related expressions thing in dominator tree walk. See http://gcc.gnu.org/ml/gcc-patches/2009-03/msg00158.html for handling something similar when related expressions differ by a small additive constant. I am planning to finish this and submit it for 4.5. Adam
Re: ARM compiler rewriting code to be longer and slower
zol...@bendor.com.au writes: Is it a problem that is worth being put onto bugzilla or I just have to do some trickery to save the compiler from being smarter than it is? I think this is worth being put into bugzilla. Ian