Re: libgcc: strange optimization
On 02/08/11 13:22, Richard Guenther wrote: On Tue, Aug 2, 2011 at 2:06 PM, Mikael Pettersson mi...@it.uu.se wrote: Michael Walle writes: Hi, To confirm that try -fno-tree-ter. lm32-gcc -O1 -fno-tree-ter -S -c test.c generates the following working assembly code: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli__ashrsi3 addi r8, r0, 10 scall lw ra, (sp+4) addi sp, sp, 4 bra -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4. It's of course only a workaround, not a real fix as nothing prevents other optimizers from performing the re-scheduling TER does. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). Richard. Better still would be to change the specification and implementation of local register variables to only guarantee them at the beginning of ASM statements. At other times they are simply the same as other local variables. Now we have a problem that the register allocator knows how to solve. In other words, if the user writes bar (int y) { register int x asm (r0) = y; foo() asm volatile (mov r1, r0); } The compiler will generate (set (reg:SI 999 x) (reg:SI y)) (call foo) (set (reg:SI 0 r0) (reg:SI 999 x)) (asm mov r1, r0) (set (reg:SI 999 x) (reg:SI 0 r0)) That is, it inserts appropriate set insns around asm blocks. Of course, the register allocator can try to allocate reg 999 to r0 and if it succeeds, then the sets become dead. But if it fails then at least the code will continue to execute as intended. R.
Re: libgcc: strange optimization
Richard Earnshaw wrote: Better still would be to change the specification and implementation of local register variables to only guarantee them at the beginning of ASM statements. At other times they are simply the same as other local variables. Now we have a problem that the register allocator knows how to solve. This seems to be pretty much the same as my proposal here: http://gcc.gnu.org/ml/gcc/2011-08/msg00064.html But there was some push-back on requiring additional semantics by some users ... Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: libgcc: strange optimization
On Tue, 9 Aug 2011, Ulrich Weigand wrote: Richard Earnshaw wrote: Better still would be to change the specification and implementation of local register variables to only guarantee them at the beginning of ASM statements. At other times they are simply the same as other local variables. Now we have a problem that the register allocator knows how to solve. This seems to be pretty much the same as my proposal here: http://gcc.gnu.org/ml/gcc/2011-08/msg00064.html But there was some push-back on requiring additional semantics by some users ... Don't feel bad, at least we seem to have overwhelming consensus on what to do for local asm-declared register variables when they feed asm statements! :) I found an example where I have an asm-declared register that was used not just for the primary asm statement, but I'm ok with those other uses not using the declared register, just as warned by the documentation. (I don't think gcc can better assign another register, but that's beside the point.) brgds, H-P
Re: libgcc: strange optimization
On Tue, 9 Aug 2011, Richard Earnshaw wrote: Better still would be to change the specification and implementation of local register variables to only guarantee them at the beginning of ASM statements. Only for those asm statements taking the same asm-register variables as arguments. At other times they are simply the same as other local variables. Now we have a problem that the register allocator knows how to solve. In other words, if the user writes bar (int y) { register int x asm (r0) = y; foo() asm volatile (mov r1, r0); } The compiler will generate (set (reg:SI 999 x) (reg:SI y)) (call foo) (set (reg:SI 0 r0) (reg:SI 999 x)) (asm mov r1, r0) (set (reg:SI 999 x) (reg:SI 0 r0)) It should rather eliminate the variable x and its assignment as it isn't used in a way properly conveyed to gcc: the occurrence of the string r0 in the asm should not be considered. I like Ulrich Weigand's proposal better, not the least because it's how it's already documented to work. brgds, H-P
Re: libgcc: strange optimization
On Sat, Aug 6, 2011 at 5:00 PM, Paolo Bonzini bonz...@gnu.org wrote: On 08/04/2011 01:10 PM, Andrew Haley wrote: It's the sort of thing that gets done in threaded interpreters, where you really need to keep a few pointers in registers and the interpreter itself is a very long function. gcc has always done a dreadful job of register allocation in such cases. Sure, but what I have seen people use global register variables for this (which means they get taken away from the register allocator). Not always though, and the x86 has so few registers that using a global register variable is very problematic. I suppose you could compile the threaded interpreter in a file of its own, but I'm not sure that has quite the same semantics as local register variables. Indeed, local register variables give almost the same benefit as globals with half the burden. The idea is that you don't care about the exact register that holds the contents but, by specifying a callee-save register, GCC will use those instead of memory across calls. This reduces _a lot_ the number of spills. The problem is that people who care about this stuff very much don't always read...@gcc.gnu.org so won't be heard. But in their own world (LISP, Forth) nice features like register variables and labels as values have led to gcc being the preferred compiler for this kind of work. /me raises hands. For GNU Smalltalk, using #if defined(__i386__) # define __DECL_REG1 __asm(%esi) # define __DECL_REG2 __asm(%edi) # define __DECL_REG3 /* no more caller-save regs if PIC is in use! */ #endif #if defined(__x86_64__) # define __DECL_REG1 __asm(%r12) # define __DECL_REG2 __asm(%r13) # define __DECL_REG3 __asm(%rbx) #endif ... register unsigned char *ip __DECL_REG1; register OOP * sp __DECL_REG2; register intptr_t arg __DECL_REG3; improves performance by up to 20% if I remember correctly. I can benchmark it if desired. It does not come for free, in some cases the register allocator does some stupid things due to the hard register declaration. But it gets much better code overall, so who cares about the microoptimization. Of course, if the register allocator did the right thing, or if I could use simply unsigned char *ip __attribute__(__do_not_spill_me__(20))); OOP *sp __attribute__(__do_not_spill_me__(10))); intptr_t arg __attrbite__(__do_not_spill_me__(0))); that would be just fine. Like if register unsigned char *ip; would increase spill cost of ip compared to unsigned char *ip; ? It is, after all, a cost issue - forcefully pinning down registers can lead to problems. We'd of course have to somehow preserve the register state of ip for all relevant pseudos (and avoid coalescing with non-register ones). Richard. Paolo
Re: libgcc: strange optimization
On 08/08/2011 10:06 AM, Richard Guenther wrote: Like if register unsigned char *ip; would increase spill cost of ip compared to unsigned char *ip; ? Remember we're talking about a function with 11000 pseudos and 4000 allocnos (not to mention a 1500 basic blocks). You cannot really blame IRA for not doing the right thing. And actually, ip and sp are live everywhere, so there's no hope of reserving a register for them, especially since all x86 callee-save registers have special uses in string functions. If I understand the huge dumps correctly, the missing part is trying to use callee-save registers for spilling, rather than memory. However, perhaps another way to do it is a specialized region management scheme for large switch statements, treating each switch arm as a separate region?? There are few registers live across the switch, and all of them are used either a lot or almost never (and always in cold blocks). BTW, here are some measurements on x86-64: 1) with regalloc hints: 450060432 bytecodes/sec; 12819996 calls/sec 2) without regalloc hints: 263002439 bytecodes/sec; 9458816 sends/sec Probably even worse on x86-32. None of -fira-region=all, -fira-region=one, -fira-algorithm=priority had significant changes. In fact, it's pretty much a binary result: I'd expect register allocation results to be either on par with (1) or similar to (2); everything else is mostly noise. Paolo
Re: libgcc: strange optimization
On 08/04/2011 01:10 PM, Andrew Haley wrote: It's the sort of thing that gets done in threaded interpreters, where you really need to keep a few pointers in registers and the interpreter itself is a very long function. gcc has always done a dreadful job of register allocation in such cases. Sure, but what I have seen people use global register variables for this (which means they get taken away from the register allocator). Not always though, and the x86 has so few registers that using a global register variable is very problematic. I suppose you could compile the threaded interpreter in a file of its own, but I'm not sure that has quite the same semantics as local register variables. Indeed, local register variables give almost the same benefit as globals with half the burden. The idea is that you don't care about the exact register that holds the contents but, by specifying a callee-save register, GCC will use those instead of memory across calls. This reduces _a lot_ the number of spills. The problem is that people who care about this stuff very much don't always read...@gcc.gnu.org so won't be heard. But in their own world (LISP, Forth) nice features like register variables and labels as values have led to gcc being the preferred compiler for this kind of work. /me raises hands. For GNU Smalltalk, using #if defined(__i386__) # define __DECL_REG1 __asm(%esi) # define __DECL_REG2 __asm(%edi) # define __DECL_REG3 /* no more caller-save regs if PIC is in use! */ #endif #if defined(__x86_64__) # define __DECL_REG1 __asm(%r12) # define __DECL_REG2 __asm(%r13) # define __DECL_REG3 __asm(%rbx) #endif ... register unsigned char *ip __DECL_REG1; register OOP * sp __DECL_REG2; register intptr_t arg __DECL_REG3; improves performance by up to 20% if I remember correctly. I can benchmark it if desired. It does not come for free, in some cases the register allocator does some stupid things due to the hard register declaration. But it gets much better code overall, so who cares about the microoptimization. Of course, if the register allocator did the right thing, or if I could use simply unsigned char *ip __attribute__(__do_not_spill_me__(20))); OOP *sp __attribute__(__do_not_spill_me__(10))); intptr_t arg __attrbite__(__do_not_spill_me__(0))); that would be just fine. Paolo
Re: libgcc: strange optimization
Hans-Peter Nilsson h...@bitrange.com writes: To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? One frequent candidate is the global pointer. Andreas. -- Andreas Schwab, sch...@redhat.com GPG Key fingerprint = D4E8 DBE3 3813 BB5D FA84 5EC7 45C6 250E 6F00 984E And now for something completely different.
Re: libgcc: strange optimization
On 08/04/2011 01:19 AM, Hans-Peter Nilsson wrote: To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? Or can we get away with just saying that local asm registers haven't had any other documented meaning for the last seven years? It's the sort of thing that gets done in threaded interpreters, where you really need to keep a few pointers in registers and the interpreter itself is a very long function. gcc has always done a dreadful job of register allocation in such cases. Andrew.
Re: libgcc: strange optimization
On Thu, Aug 4, 2011 at 11:50 AM, Andrew Haley a...@redhat.com wrote: On 08/04/2011 01:19 AM, Hans-Peter Nilsson wrote: To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? Or can we get away with just saying that local asm registers haven't had any other documented meaning for the last seven years? It's the sort of thing that gets done in threaded interpreters, where you really need to keep a few pointers in registers and the interpreter itself is a very long function. gcc has always done a dreadful job of register allocation in such cases. Sure, but what I have seen people use global register variables for this (which means they get taken away from the register allocator). Richard. Andrew.
Re: libgcc: strange optimization
On 08/04/2011 10:52 AM, Richard Guenther wrote: On Thu, Aug 4, 2011 at 11:50 AM, Andrew Haley a...@redhat.com wrote: On 08/04/2011 01:19 AM, Hans-Peter Nilsson wrote: To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? Or can we get away with just saying that local asm registers haven't had any other documented meaning for the last seven years? It's the sort of thing that gets done in threaded interpreters, where you really need to keep a few pointers in registers and the interpreter itself is a very long function. gcc has always done a dreadful job of register allocation in such cases. Sure, but what I have seen people use global register variables for this (which means they get taken away from the register allocator). Not always though, and the x86 has so few registers that using a global register variable is very problematic. I suppose you could compile the threaded interpreter in a file of its own, but I'm not sure that has quite the same semantics as local register variables. The problem is that people who care about this stuff very much don't always read gcc@gcc.gnu.org so won't be heard. But in their own world (LISP, Forth) nice features like register variables and labels as values have led to gcc being the preferred compiler for this kind of work. Andrew.
Re: libgcc: strange optimization
On Thu, Aug 4, 2011 at 1:10 PM, Andrew Haley a...@redhat.com wrote: On 08/04/2011 10:52 AM, Richard Guenther wrote: On Thu, Aug 4, 2011 at 11:50 AM, Andrew Haley a...@redhat.com wrote: On 08/04/2011 01:19 AM, Hans-Peter Nilsson wrote: To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? Or can we get away with just saying that local asm registers haven't had any other documented meaning for the last seven years? It's the sort of thing that gets done in threaded interpreters, where you really need to keep a few pointers in registers and the interpreter itself is a very long function. gcc has always done a dreadful job of register allocation in such cases. Sure, but what I have seen people use global register variables for this (which means they get taken away from the register allocator). Not always though, and the x86 has so few registers that using a global register variable is very problematic. I suppose you could compile the threaded interpreter in a file of its own, but I'm not sure that has quite the same semantics as local register variables. The problem is that people who care about this stuff very much don't always read gcc@gcc.gnu.org so won't be heard. But in their own world (LISP, Forth) nice features like register variables and labels as values have led to gcc being the preferred compiler for this kind of work. Well, the uses won't break with the idea - they would simply work like if they were not using local register variables. Richard. Andrew.
Re: libgcc: strange optimization
On Thu, 4 Aug 2011, Andreas Schwab wrote: Hans-Peter Nilsson h...@bitrange.com writes: To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? One frequent candidate is the global pointer. Yes, that too, but it's usually fixed isn't it? What I really meant was not being a fixed register but I don't think many willing to grep a whole distro can tell which registers in which gcc port are fixed and remember to look for uses of -ffixed-reg-. brgds, H-P
Re: libgcc: strange optimization
On 08/04/2011 12:19 PM, Richard Guenther wrote: On Thu, Aug 4, 2011 at 1:10 PM, Andrew Haley a...@redhat.com wrote: On 08/04/2011 10:52 AM, Richard Guenther wrote: On Thu, Aug 4, 2011 at 11:50 AM, Andrew Haley a...@redhat.com wrote: On 08/04/2011 01:19 AM, Hans-Peter Nilsson wrote: To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? Or can we get away with just saying that local asm registers haven't had any other documented meaning for the last seven years? It's the sort of thing that gets done in threaded interpreters, where you really need to keep a few pointers in registers and the interpreter itself is a very long function. gcc has always done a dreadful job of register allocation in such cases. Sure, but what I have seen people use global register variables for this (which means they get taken away from the register allocator). Not always though, and the x86 has so few registers that using a global register variable is very problematic. I suppose you could compile the threaded interpreter in a file of its own, but I'm not sure that has quite the same semantics as local register variables. The problem is that people who care about this stuff very much don't always read gcc@gcc.gnu.org so won't be heard. But in their own world (LISP, Forth) nice features like register variables and labels as values have led to gcc being the preferred compiler for this kind of work. Well, the uses won't break with the idea - they would simply work like if they were not using local register variables. I don't understand this remark. Surely if they work like they were not using local register variables, you'll get dreadful register allocation. But this is a big reason to use gcc. Efficient code really does matter to people writing this kind of thing. Andrew.
Re: libgcc: strange optimization
Richard Guenther wrote: On Tue, Aug 2, 2011 at 3:23 PM, Ian Lance Taylor i...@google.com wrote: Richard Guenther richard.guent...@gmail.com writes: I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Let's just implement those requirements in the compiler itself. Doesn't work for existing code, no? And if thinking new code then I'd rather have explicit dependences (and a way to represent them). Thus, for example asm (scall : : asm(r0) (10), ...) thus, why force new constraints when we already can figure out local register vars by register name? Why not extend the constraint syntax somehow to allow specifying the same effect? Maybe it would be possible to implement this while keeping the syntax of existing code by (re-)defining the semantics of register asm to basically say that: If a variable X is declared as register asm for register Y, and X is later on used as operand to an inline asm, the register allocator will choose register Y to hold that asm operand. (And this is the full specification of register asm semantics, nothing beyond this is guaranteed.) It seems this semantics could be implemented very early on, probably in the frontend itself. The frontend would mark the *asm* statement as using the specified register (there would be no special handling of the *variable* as such, after the frontend is done). The optimizers would then simply be required to pass the asm-statement register annotations though, much like today they pass constraints through. At the point where register allocation decisions are made, those register annotations would then be acted on. Bye, Ulrich -- Dr. Ulrich Weigand GNU Toolchain for Linux on System z and Cell BE ulrich.weig...@de.ibm.com
Re: libgcc: strange optimization
Ulrich Weigand wrote: Richard Guenther wrote: On Tue, Aug 2, 2011 at 3:23 PM, Ian Lance Taylor i...@google.com wrote: Richard Guenther richard.guent...@gmail.com writes: I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Let's just implement those requirements in the compiler itself. Doesn't work for existing code, no? And if thinking new code then I'd rather have explicit dependences (and a way to represent them). Thus, for example asm (scall : : asm(r0) (10), ...) thus, why force new constraints when we already can figure out local register vars by register name? Why not extend the constraint syntax somehow to allow specifying the same effect? Yes this would be exact equivalence of register int var asm (r0) = 10; ... asm (scall : : r (var), ...) Maybe it would be possible to implement this while keeping the syntax of existing code by (re-)defining the semantics of register asm to basically say that: If a variable X is declared as register asm for register Y, and X is later on used as operand to an inline asm, the register allocator will choose register Y to hold that asm operand. (And this is the full specification of register asm semantics, nothing beyond this is guaranteed.) Yes, that's reasonable. As I understand the docs, in code like void foo () { register int var asm (r1) = 10; asm (;; use r1); } there is nothing that connects var to the asm and assuming that r1 holds 10 in the asm is a user error. The only place where the asm attached to a variable needs to have effect are the inline asm sequences that explicitly refer to respective variables. If there is no inline asm referencing a local register variable, there is on difference to a non-register auto variable; there could even be a warning that in such a case that register int var asm (r1) = 10; is equivalent to int var = 10; It seems this semantics could be implemented very early on, probably in the frontend itself. The frontend would mark the *asm* statement as using the specified register (there would be no special handling of the *variable* as such, after the frontend is done). The optimizers would then simply be required to pass the asm-statement register annotations though, much like today they pass constraints through. At the point where register allocation decisions are made, those register annotations would then be acted on. Bye, Ulrich I wonder why it does not work like that in the current implementation. Local register variable is just like using a similar constraint (with the only difference that in general there is no such constraint, otherwise the developer would use it). A pass like .asmcons could take care of it just the same way it does for constraints and no optimizer passed would have to bother if a variable is a local register or not. This would render local register variables even more functional because no one needed to care if there were implicit library calls or things like that. Johann
Re: libgcc: strange optimization
On Wed, Aug 3, 2011 at 11:50 AM, Georg-Johann Lay a...@gjlay.de wrote: Ulrich Weigand wrote: Richard Guenther wrote: On Tue, Aug 2, 2011 at 3:23 PM, Ian Lance Taylor i...@google.com wrote: Richard Guenther richard.guent...@gmail.com writes: I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Let's just implement those requirements in the compiler itself. Doesn't work for existing code, no? And if thinking new code then I'd rather have explicit dependences (and a way to represent them). Thus, for example asm (scall : : asm(r0) (10), ...) thus, why force new constraints when we already can figure out local register vars by register name? Why not extend the constraint syntax somehow to allow specifying the same effect? Yes this would be exact equivalence of register int var asm (r0) = 10; ... asm (scall : : r (var), ...) Maybe it would be possible to implement this while keeping the syntax of existing code by (re-)defining the semantics of register asm to basically say that: If a variable X is declared as register asm for register Y, and X is later on used as operand to an inline asm, the register allocator will choose register Y to hold that asm operand. (And this is the full specification of register asm semantics, nothing beyond this is guaranteed.) Yes, that's reasonable. As I understand the docs, in code like void foo () { register int var asm (r1) = 10; asm (;; use r1); } there is nothing that connects var to the asm and assuming that r1 holds 10 in the asm is a user error. The only place where the asm attached to a variable needs to have effect are the inline asm sequences that explicitly refer to respective variables. If there is no inline asm referencing a local register variable, there is on difference to a non-register auto variable; there could even be a warning that in such a case that register int var asm (r1) = 10; is equivalent to int var = 10; It seems this semantics could be implemented very early on, probably in the frontend itself. The frontend would mark the *asm* statement as using the specified register (there would be no special handling of the *variable* as such, after the frontend is done). The optimizers would then simply be required to pass the asm-statement register annotations though, much like today they pass constraints through. At the point where register allocation decisions are made, those register annotations would then be acted on. Bye, Ulrich I wonder why it does not work like that in the current implementation. Local register variable is just like using a similar constraint (with the only difference that in general there is no such constraint, otherwise the developer would use it). A pass like .asmcons could take care of it just the same way it does for constraints and no optimizer passed would have to bother if a variable is a local register or not. This would render local register variables even more functional because no one needed to care if there were implicit library calls or things like that. Yes, I like that idea. Richard.
Re: libgcc: strange optimization
Hi, On Wed, 3 Aug 2011, Richard Guenther wrote: Yes, that's reasonable. As I understand the docs, in code like void foo () { register int var asm (r1) = 10; asm (;; use r1); } there is nothing that connects var to the asm and assuming that r1 holds 10 in the asm is a user error. The only place where the asm attached to a variable needs to have effect are the inline asm sequences that explicitly refer to respective variables. If there is no inline asm referencing a local register variable, there is on difference to a non-register auto variable; there could even be a warning that in such a case that register int var asm (r1) = 10; is equivalent to int var = 10; This would render local register variables even more functional because no one needed to care if there were implicit library calls or things like that. Yes, I like that idea. I do too. Except it doesn't work :) There's a common idiom of accessing registers read-only by declaring local register vars. E.g. to (*grasp*) the stack pointer. There won't be a DEF for that register var, and hence at use-points we couldn't reload any sensible values into those registers (and we really shouldn't clobber the stack pointer in this way). We could introduce that special semantic only for non-reserved registers, and require no writes to register vars for reserved registers. Or we could simply do: if (any_local_reg_vars) optimize = 0; But I already see people wanting to _do_ optimization also with local reg vars, just not the wrong optimizations ;-/ Ciao, Michael.
Re: libgcc: strange optimization
On Wed, Aug 3, 2011 at 3:27 PM, Michael Matz m...@suse.de wrote: Hi, On Wed, 3 Aug 2011, Richard Guenther wrote: Yes, that's reasonable. As I understand the docs, in code like void foo () { register int var asm (r1) = 10; asm (;; use r1); } there is nothing that connects var to the asm and assuming that r1 holds 10 in the asm is a user error. The only place where the asm attached to a variable needs to have effect are the inline asm sequences that explicitly refer to respective variables. If there is no inline asm referencing a local register variable, there is on difference to a non-register auto variable; there could even be a warning that in such a case that register int var asm (r1) = 10; is equivalent to int var = 10; This would render local register variables even more functional because no one needed to care if there were implicit library calls or things like that. Yes, I like that idea. I do too. Except it doesn't work :) There's a common idiom of accessing registers read-only by declaring local register vars. E.g. to (*grasp*) the stack pointer. There won't be a DEF for that register var, and hence at use-points we couldn't reload any sensible values into those registers (and we really shouldn't clobber the stack pointer in this way). We could introduce that special semantic only for non-reserved registers, and require no writes to register vars for reserved registers. Or we could simply do: if (any_local_reg_vars) optimize = 0; But I already see people wanting to _do_ optimization also with local reg vars, just not the wrong optimizations ;-/ I'd say we should start rejecting all these bogus constructs by default (maybe accepting them with -fpermissive and then, well, maybe generate some dwim code). That is, local register var decls are only valid with an initializer, they are implicitly constant (you can't re-assign to them). Reserved registers are a no-go (like %esp), either global or local. Richard. Ciao, Michael.
Re: libgcc: strange optimization
Richard Guenther wrote: On Wed, Aug 3, 2011 at 3:27 PM, Michael Matz m...@suse.de wrote: Hi, On Wed, 3 Aug 2011, Richard Guenther wrote: Yes, that's reasonable. As I understand the docs, in code like void foo () { register int var asm (r1) = 10; asm (;; use r1); } there is nothing that connects var to the asm and assuming that r1 holds 10 in the asm is a user error. The only place where the asm attached to a variable needs to have effect are the inline asm sequences that explicitly refer to respective variables. If there is no inline asm referencing a local register variable, there is on difference to a non-register auto variable; there could even be a warning that in such a case that register int var asm (r1) = 10; is equivalent to int var = 10; This would render local register variables even more functional because no one needed to care if there were implicit library calls or things like that. Yes, I like that idea. I do too. Except it doesn't work :) There's a common idiom of accessing registers read-only by declaring local register vars. E.g. to (*grasp*) the stack pointer. There won't be a DEF for that register var, and hence at use-points we couldn't reload any sensible values into those registers (and we really shouldn't clobber the stack pointer in this way). We could introduce that special semantic only for non-reserved registers, and require no writes to register vars for reserved registers. Or we could simply do: if (any_local_reg_vars) optimize = 0; But I already see people wanting to _do_ optimization also with local reg vars, just not the wrong optimizations ;-/ Definitely yes. As I wrote above, if you see asm it's not unlikely that it is a piece of performance critical code. I'd say we should start rejecting all these bogus constructs by default (maybe accepting them with -fpermissive and then, well, maybe generate some dwim code). That is, local register var decls are only valid with an initializer, they are implicitly constant (you can't re-assign to them). Reserved registers are a no-go (like %esp), either global or local. Would that help? Like in code static inline void foo (int arg) { register const int reg asm (r1) = arg; asm (...::r(reg)); } And with output constraints like =r,0 or +r. Or in local blocks: static inline void foo (int arg) { register const int reg asm (r1) = arg; ... { register const int reg2 asm (r1) = reg; asm (...::r(reg2)); } } Do the current optimizers shred inline asm with ordinary constraints but without local registers? If yes, there is a considerable problem in the optimizers and/or in GCC. If not, why can't local register variables work similarly, i.e. propagate the register information into respective asms and forget about it for the variables? Johann Richard. Ciao, Michael.
Re: libgcc: strange optimization
On 08/03/2011 07:02 AM, Richard Guenther wrote: Reserved registers are a no-go (like %esp), either global or local. Local register variables referring to anything in fixed_regs are trivial to handle -- continue to treat them exactly as we currently do. They won't be clobbered by random code movement because they're fixed. r~
Re: libgcc: strange optimization
On Wed, 3 Aug 2011, Ulrich Weigand wrote: Richard Guenther wrote: asm (scall : : asm(r0) (10), ...) Maybe it would be possible to implement this while keeping the syntax of existing code by (re-)defining the semantics of register asm to basically say that: If a variable X is declared as register asm for register Y, and X is later on used as operand to an inline asm, the register allocator will choose register Y to hold that asm operand. me too: Nice idea! (And this is the full specification of register asm semantics, nothing beyond this is guaranteed.) You'd have to handle global registers differently, and local fixed registers not feeding into asms. For everything else, error or warning. That should be ok, because local asm registers are wonderfully already documented to have that restriction: Local register variables in specific registers do not reserve the registers, except at the point where they are used as input or output operands in an @code{asm} statement and the @code{asm} statement itself is not deleted. So, it's just a small matter of programming to make that happen for real. :-) To make sure, it'd be nice if someone could perhaps grep an entire GNU/Linux-or-other distribution including the kernel for uses of asm-declared *local* registers that don't directly feed into asms and not being the stack-pointer? Or can we get away with just saying that local asm registers haven't had any other documented meaning for the last seven years? It seems this semantics could be implemented very early on, probably in the frontend itself. The frontend would mark the *asm* statement as using the specified register (there would be no special handling of the *variable* as such, after the frontend is done). The optimizers would then simply be required to pass the asm-statement register annotations though, much like today they pass constraints through. At the point where register allocation decisions are made, those register annotations would then be acted on. People ask why it's not already like that, probably because they assume the ideal sequence of events. At least the quote above is a late addition (close to seven years now). IIUC, asms and register asms weren't originally tied together and the current implementation with early register tying just happened to work well together, well, that is until the SSA revolution. ;) brgds, H-P
Re: libgcc: strange optimization
On Mon, 1 Aug 2011, Georg-Johann Lay wrote: Michael Walle schrieb: Hi list, consider the following test code: static void inline f1(int arg) { register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm(scall : : r(a1), r(a2)); } void f2(int arg) { f1(arg 10); } If you compile this code with 'lm32-gcc -O1 -S -c test.c' (see end of this email), the a1 = 10; assignment is optimized away. Your asm has no output operands and no side effects, with more aggressive optimization the whole ask would disappear. No, for the record that's not supposed to happen for asms *without outputs*. If an @code{asm} has output operands, GCC assumes for optimization purposes the instruction has no side effects except to change the output operands. What you want is maybe something like asm volatile (scall : : r(a1), r(a2)); For the code at hand, the scall should be described to both have an output and be marked volatile, since the system call is a side effect that GCC can't see and might otherwise optimize away if the system call return value is unused. A plain volatile marking as the above should not be necessary, modulo gcc bugs. The real problem is quite worrysome. I don't think a port (lm32) should have to solve it with constraints; the (inline) function parameter *should* cause a non-clobbering temporary to hold any intermediate operations, but it looks as if you'll otherwise have to debug it yourself. brgds, H-P
Re: libgcc: strange optimization
On Mon, 1 Aug 2011, Richard Henderson wrote: On 08/01/2011 01:30 PM, Michael Walle wrote: 1) function inlining 2) deferred argument evaluation 3) because our target has no barrel shifter, (arg 10) is emitted as a function call to libgcc's __ashrsi3 (_in place_!) 4) BAM! dead code elimination optimizes r8 assignment away because calli may clobber r1-r10 (callee saved registers on lm32). I'm afraid the only solution I can think of is to force F1 out-of-line. Or another temporary - but the parameter should already have that effect. brgds, H-P
Re: libgcc: strange optimization
Michael Walle wrote: Hi, That was quick :) Your asm has no output operands and no side effects, with more aggressive optimization the whole ask would disappear. Sorry, that was just a small test file, the original code has output operands. The new test code: static int inline f1(int arg) { register int ret asm(r1); register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm volatile (scall : =r(ret) : r(a1), r(a2) : memory); return ret; } int f2(int arg1, int arg2) { return f1(arg1 10); } translates to the same assembly: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli__ashrsi3 scall lw ra, (sp+4) addi sp, sp, 4 bra PS. R1 is the return register in the target architecture ABI. I'd guess you ran into http://gcc.gnu.org/onlinedocs/gcc/Local-Reg-Vars.html#Local-Reg-Vars A common pitfall is to initialize multiple call-clobbered registers with arbitrary expressions, where a function call or library call for an arithmetic operator will overwrite a register value from a previous assignment, for example r0 below: register int *p1 asm (r0) = ...; register int *p2 asm (r1) = ...; In those cases, a solution is to use a temporary variable for each arbitrary expression. So I'd try to rewrite it as static int inline f1 (int arg0) { int arg = arg0; register int ret asm(r1); register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm volatile (scall : =r(ret) : r(a1), r(a2) : memory); return ret; } and if that does not help the rather hackish static int inline f1 (int arg0) { int arg = arg0; register int ret asm(r1); register int a1 asm(r8); register int a2 asm(r1); asm ( : +r (arg)); a1 = 10; a2 = arg; asm volatile (scall : =r(ret) : r(a1), r(a2) : memory); return ret; }
Re: libgcc: strange optimization
Hans-Peter Nilsson writes: On Mon, 1 Aug 2011, Richard Henderson wrote: On 08/01/2011 01:30 PM, Michael Walle wrote: 1) function inlining 2) deferred argument evaluation 3) because our target has no barrel shifter, (arg 10) is emitted as a function call to libgcc's __ashrsi3 (_in place_!) 4) BAM! dead code elimination optimizes r8 assignment away because calli may clobber r1-r10 (callee saved registers on lm32). I'm afraid the only solution I can think of is to force F1 out-of-line. Or another temporary - but the parameter should already have that effect. It should, but doesn't. See PR48863 for similar breakage on ARM. /Mikael
Re: libgcc: strange optimization
On Tue, Aug 2, 2011 at 10:48 AM, Mikael Pettersson mi...@it.uu.se wrote: Hans-Peter Nilsson writes: On Mon, 1 Aug 2011, Richard Henderson wrote: On 08/01/2011 01:30 PM, Michael Walle wrote: 1) function inlining 2) deferred argument evaluation 3) because our target has no barrel shifter, (arg 10) is emitted as a function call to libgcc's __ashrsi3 (_in place_!) 4) BAM! dead code elimination optimizes r8 assignment away because calli may clobber r1-r10 (callee saved registers on lm32). I'm afraid the only solution I can think of is to force F1 out-of-line. Or another temporary - but the parameter should already have that effect. It should, but doesn't. See PR48863 for similar breakage on ARM. On GIMPLE we don't see either the libcall nor those dependencies. Don't use register vars. Richard. /Mikael
Re: libgcc: strange optimization
Richard Guenther wrote: On Tue, Aug 2, 2011 at 10:48 AM, Mikael Pettersson mi...@it.uu.se wrote: Hans-Peter Nilsson writes: On Mon, 1 Aug 2011, Richard Henderson wrote: On 08/01/2011 01:30 PM, Michael Walle wrote: 1) function inlining 2) deferred argument evaluation 3) because our target has no barrel shifter, (arg 10) is emitted as a function call to libgcc's __ashrsi3 (_in place_!) 4) BAM! dead code elimination optimizes r8 assignment away because calli may clobber r1-r10 (callee saved registers on lm32). I'm afraid the only solution I can think of is to force F1 out-of-line. Or another temporary - but the parameter should already have that effect. It should, but doesn't. See PR48863 for similar breakage on ARM. On GIMPLE we don't see either the libcall nor those dependencies. Don't use register vars. IMO such code is supposed to work, e.g. in order to write an interface to a non-ABI assembler function. In general this cannot be expressed by means of constraints because that would imply plethora of different constraints for each register/mode. From the documentation a user will expect that he wrote correct code and is not supposed to bother with GCC inerts like implicit library calls or GIMPLE or whatever. Johann Richard.
Re: libgcc: strange optimization
On Tue, Aug 2, 2011 at 12:01 PM, Georg-Johann Lay a...@gjlay.de wrote: Richard Guenther wrote: On Tue, Aug 2, 2011 at 10:48 AM, Mikael Pettersson mi...@it.uu.se wrote: Hans-Peter Nilsson writes: On Mon, 1 Aug 2011, Richard Henderson wrote: On 08/01/2011 01:30 PM, Michael Walle wrote: 1) function inlining 2) deferred argument evaluation 3) because our target has no barrel shifter, (arg 10) is emitted as a function call to libgcc's __ashrsi3 (_in place_!) 4) BAM! dead code elimination optimizes r8 assignment away because calli may clobber r1-r10 (callee saved registers on lm32). I'm afraid the only solution I can think of is to force F1 out-of-line. Or another temporary - but the parameter should already have that effect. It should, but doesn't. See PR48863 for similar breakage on ARM. On GIMPLE we don't see either the libcall nor those dependencies. Don't use register vars. IMO such code is supposed to work, e.g. in order to write an interface to a non-ABI assembler function. In general this cannot be expressed by means of constraints because that would imply plethora of different constraints for each register/mode. From the documentation a user will expect that he wrote correct code and is not supposed to bother with GCC inerts like implicit library calls or GIMPLE or whatever. Well. I suppose what is happening is that we expand from register int a1 __asm__ (*edi); register int a2 __asm__ (*eax); int D.2700; bb 2: D.2700_2 = arg_1(D) 10; a1 = 10; a2 = D.2700_2; __asm__ __volatile__(scall : : r a1, r a2); return; and end up TERing D.2700_2 = arg_1(D) 10, materializing a libcall after the setting of a1 and before the asm. To confirm that try -fno-tree-ter. I don't see how we can easily avoid this without exposing libcalls at the gimple level. Maybe disable TER if we see any register variable use. Richard. Johann Richard.
Re: libgcc: strange optimization
Hi, To confirm that try -fno-tree-ter. lm32-gcc -O1 -fno-tree-ter -S -c test.c generates the following working assembly code: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli__ashrsi3 addi r8, r0, 10 scall lw ra, (sp+4) addi sp, sp, 4 bra -- Michael
Re: libgcc: strange optimization
Michael Walle writes: Hi, To confirm that try -fno-tree-ter. lm32-gcc -O1 -fno-tree-ter -S -c test.c generates the following working assembly code: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli__ashrsi3 addi r8, r0, 10 scall lw ra, (sp+4) addi sp, sp, 4 bra -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4.
Re: libgcc: strange optimization
On Tue, Aug 2, 2011 at 2:06 PM, Mikael Pettersson mi...@it.uu.se wrote: Michael Walle writes: Hi, To confirm that try -fno-tree-ter. lm32-gcc -O1 -fno-tree-ter -S -c test.c generates the following working assembly code: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli __ashrsi3 addi r8, r0, 10 scall lw ra, (sp+4) addi sp, sp, 4 b ra -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4. It's of course only a workaround, not a real fix as nothing prevents other optimizers from performing the re-scheduling TER does. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). Richard.
Re: libgcc: strange optimization
Richard Guenther wrote: On Tue, Aug 2, 2011 at 2:06 PM, Mikael Pettersson mi...@it.uu.se wrote: Michael Walle writes: Hi, To confirm that try -fno-tree-ter. lm32-gcc -O1 -fno-tree-ter -S -c test.c generates the following working assembly code: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli__ashrsi3 addi r8, r0, 10 scall lw ra, (sp+4) addi sp, sp, 4 bra -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4. It's of course only a workaround, not a real fix as nothing prevents other optimizers from performing the re-scheduling TER does. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). Strongly oppose. Local register variables are very useful; maybe not on a linux machine but on embedded systems there are situations you cannot do without. You ever counted the constraint alternatives that that would be needed? You'd need different constraints for QI/HI/SI/DI for each register resulting in myriads of register classes increasing register allocation time and dumps would become impossible to read. I once tried it for a target...never again. Besides that, with local register vars the developer can write code that meets his requirements, whereas for constraints you will always have to change the compiler and make existing sources incompatible. If GCC provides such a feature it must work properly and not be sacrifices for this or that optimization. Correct code is always preferred over non-functional code. Johann Richard.
Re: libgcc: strange optimization
On Tue, 2 Aug 2011, Richard Guenther wrote: On Tue, Aug 2, 2011 at 2:06 PM, Mikael Pettersson mi...@it.uu.se wrote: Michael Walle writes: Hi, To confirm that try -fno-tree-ter. lm32-gcc -O1 -fno-tree-ter -S -c test.c generates the following working assembly code: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli __ashrsi3 addi r8, r0, 10 scall lw ra, (sp+4) addi sp, sp, 4 b ra -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4. It's of course only a workaround, not a real fix as nothing prevents other optimizers from performing the re-scheduling TER does. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. I'd be ok with that, FWIW; I see the problem with keeping the scheduling of operations in a working order (yuck) and I don't see how else to keep it working ...except perhaps make gcc flag functions with register asms as non-inlinable, maybe even flag down any of the dangerous re-scheduling? Maybe I can do that with some hand-holding? Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). They do make sense when implementing e.g. system calls, and they're documented to work as discussed. (I almost regret making that happen, though.) Fortunately such functions are small, and not relatively much helped by inlining (it's a *syscall*; much more happening beyond the call than is affected by inlining some parameter initialization). Sure, new targets are much better off by implementing that through other means, but preferably intrinsic functions to asms. brgds, H-P
Re: libgcc: strange optimization
On Tue, Aug 2, 2011 at 2:53 PM, Hans-Peter Nilsson h...@bitrange.com wrote: On Tue, 2 Aug 2011, Richard Guenther wrote: On Tue, Aug 2, 2011 at 2:06 PM, Mikael Pettersson mi...@it.uu.se wrote: Michael Walle writes: Hi, To confirm that try -fno-tree-ter. lm32-gcc -O1 -fno-tree-ter -S -c test.c generates the following working assembly code: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli __ashrsi3 addi r8, r0, 10 scall lw ra, (sp+4) addi sp, sp, 4 b ra -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4. It's of course only a workaround, not a real fix as nothing prevents other optimizers from performing the re-scheduling TER does. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. I'd be ok with that, FWIW; I see the problem with keeping the scheduling of operations in a working order (yuck) and I don't see how else to keep it working ...except perhaps make gcc flag functions with register asms as non-inlinable, maybe even flag down any of the dangerous re-scheduling? But then can't people use a pure assembler stub instead? Without inlining there isn't much benefit left from writing void f1(int arg) { register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm(scall : : r(a1), r(a2)); } instead of f1: mov r8, 10 mov r1, rX scall ret in a .s file no? I doubt much prologue/epilogue is needed. Or even write void f1(int arg) { asm(mov r8, %0; mov r1 %1; scall; : : g(a1), g(a2) : r8, r1); } which should be inlinable again (yes, in inlined for not optimally register-allocated, but compared to the non-inline routine?). Richard. Maybe I can do that with some hand-holding? Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). They do make sense when implementing e.g. system calls, and they're documented to work as discussed. (I almost regret making that happen, though.) Fortunately such functions are small, and not relatively much helped by inlining (it's a *syscall*; much more happening beyond the call than is affected by inlining some parameter initialization). Sure, new targets are much better off by implementing that through other means, but preferably intrinsic functions to asms. brgds, H-P
Re: libgcc: strange optimization
On Tue, 2 Aug 2011, Richard Guenther wrote: I'd be ok with that, FWIW; I see the problem with keeping the scheduling of operations in a working order (yuck) and I don't see how else to keep it working ...except perhaps make gcc flag functions with register asms as non-inlinable, maybe even flag down any of the dangerous re-scheduling? But then can't people use a pure assembler stub instead? I see your point, but you're thinking new code, I'm thinking let's keep existing code used by several targets and documented as working the last seven years working. Maybe breakable with a major release; in gcc-5. (Oh no, I see what's coming. :) brgds, H-P
Re: libgcc: strange optimization
Richard Guenther richard.guent...@gmail.com writes: Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). No, local register variables are documented as working and many programs rely on them. They are a straightforward way to get an asm argument in a specific register, and I don't see any reason to break that. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Let's just implement those requirements in the compiler itself. Ian
Re: libgcc: strange optimization
On Tue, Aug 2, 2011 at 3:23 PM, Ian Lance Taylor i...@google.com wrote: Richard Guenther richard.guent...@gmail.com writes: Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). No, local register variables are documented as working and many programs rely on them. They are a straightforward way to get an asm argument in a specific register, and I don't see any reason to break that. Well, maybe they look like so. But in reality there is _no_ connection from the register setup to the actual asm. Which is the problem the compiler faces here (apart from the libcall issue). If there should be an implicit dependence of all asms to all local register var setters and users then this isn't implemented on gimple (or rather it works by chance there as we treat register vars as memory and do not disambiguate anything across asms (yet)). I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Let's just implement those requirements in the compiler itself. Doesn't work for existing code, no? And if thinking new code then I'd rather have explicit dependences (and a way to represent them). Thus, for example asm (scall : : asm(r0) (10), ...) thus, why force new constraints when we already can figure out local register vars by register name? Why not extend the constraint syntax somehow to allow specifying the same effect? Richard. Ian
Re: libgcc: strange optimization
Richard Guenther richard.guent...@gmail.com writes: On Tue, Aug 2, 2011 at 3:23 PM, Ian Lance Taylor i...@google.com wrote: Richard Guenther richard.guent...@gmail.com writes: Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). No, local register variables are documented as working and many programs rely on them. They are a straightforward way to get an asm argument in a specific register, and I don't see any reason to break that. Well, maybe they look like so. But in reality there is _no_ connection from the register setup to the actual asm. Which is the problem the compiler faces here (apart from the libcall issue). If there should be an implicit dependence of all asms to all local register var setters and users then this isn't implemented on gimple (or rather it works by chance there as we treat register vars as memory and do not disambiguate anything across asms (yet)). I'm not sure why we need to do anything at the GIMPLE level other than disable some optimizations. There is a connection from the register variable to the asm--the asm refers to the variable. There is nothing specific about the register in there, but at the GIMPLE level there doesn't have to be. We should not break a useful existing feature because we find it inconvenient. Let's just disable some optimizations so that it continues to work. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Let's just implement those requirements in the compiler itself. Doesn't work for existing code, no? Why not? And if thinking new code then I'd rather have explicit dependences (and a way to represent them). Thus, for example asm (scall : : asm(r0) (10), ...) thus, why force new constraints when we already can figure out local register vars by register name? Why not extend the constraint syntax somehow to allow specifying the same effect? I agree that it would be a good idea to permit asms to indicate the specific register the operand should go into. Ian
Re: libgcc: strange optimization
On 08/02/2011 05:22 AM, Richard Guenther wrote: -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4. It's of course only a workaround, not a real fix as nothing prevents other optimizers from performing the re-scheduling TER does. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). Neither of these is a viable option. What we might be able to do is throttle TER when the destination is a local register variable. This should unbreak the common case of local regs immediately surrounding an asm. r~
Re: libgcc: strange optimization
Richard Guenther schrieb: I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Richard. That's completely counterproductive. If a developer invents asm or local register variables he has a very good reason for that choice like to meet hard (with hard as in HARD) real time constraints. Disabling inlining a function that uses local register vars would make many places of local register vars unusable because thre would no more be a way to write down the exact register usage footprint of a piece of asm. Typical use-cases are interfacing to a function that has a smaller register footprint than an ordinary block-box function or doing some arithmetic that needs special hard regs for which there are no fitting constraints. All this will be impossible if inlining is disabled for such functions; then it is no more possible to describe such a low-overhead piece of code without calling a black box function, clobber all call-clobbered registers, render a maybe tail-function into a non tail-call function etc. Embedded systems like ARM or PowerPC based get more and more important over the years and GCC should put more attention to them and their needs; not only to bolide servers/PCs. This includes systems with hard real time constraints. Johann
Re: libgcc: strange optimization
On Tue, Aug 2, 2011 at 6:02 PM, Richard Henderson r...@redhat.com wrote: On 08/02/2011 05:22 AM, Richard Guenther wrote: -fno-tree-ter also unbreaks the ARM test case in PR48863 comment #4. It's of course only a workaround, not a real fix as nothing prevents other optimizers from performing the re-scheduling TER does. I suggest to amend the documentation for local call-clobbered register variables to say that the only valid sequence using them is from a non-inlinable function that contains only direct initializations of the register variables from constants or parameters. Or go one step further and deprecate local register variables alltogether (they IMHO don't make much sense, and rather the targets should provide a way to properly constrain asm inputs and outputs). Neither of these is a viable option. What we might be able to do is throttle TER when the destination is a local register variable. This should unbreak the common case of local regs immediately surrounding an asm. Sure, similar to disabling TER for functions containing such vars. But it isn't a solution for the general issue that nothing prevents scheduling gimple statements between register variable def and use. Richard. r~
Re: libgcc: strange optimization
Richard Guenther richard.guent...@gmail.com writes: But then can't people use a pure assembler stub instead? Without inlining there isn't much benefit left from writing void f1(int arg) { register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm(scall : : r(a1), r(a2)); } instead of f1: mov r8, 10 mov r1, rX scall ret in a .s file no? I doubt much prologue/epilogue is needed. Or even write void f1(int arg) { asm(mov r8, %0; mov r1 %1; scall; : : g(a1), g(a2) : r8, r1); } Of course in practice people _do_ want to use it with f1 inlined, where using reg variables (or alternatively, some expanded constraint language for the asm parameters) can really get rid of tons of unnecessary asm moves, and they want the compiler to guard against conflicts. -Miles -- Whatever you do will be insignificant, but it is very important that you do it. Mahatma Gandhi
libgcc: strange optimization
Hi list, consider the following test code: static void inline f1(int arg) { register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm(scall : : r(a1), r(a2)); } void f2(int arg) { f1(arg 10); } If you compile this code with 'lm32-gcc -O1 -S -c test.c' (see end of this email), the a1 = 10; assignment is optimized away. According to my understanding the following happens: 1) function inlining 2) deferred argument evaluation 3) because our target has no barrel shifter, (arg 10) is emitted as a function call to libgcc's __ashrsi3 (_in place_!) 4) BAM! dead code elimination optimizes r8 assignment away because calli may clobber r1-r10 (callee saved registers on lm32). If you use: void f2(int arg) { f1(__ashrsi3(arg, 10)); } everything works as expected, __ashrsi3 is evaluated before the body of f1. According to wikipedia [1], function calls are sequence points and all side effects for the arguments are completed before entering the function. So in my understanding the deferred argument evaluation is wrong if that operation is emitted as a call to a libgcc helper. I tried that on other architectures too (microblaze and avr). All show the same behaviour. If an integer arithmetic opcode is translated to a call to libgcc, every assignment to a register which is clobbered by the call is optimized away. The GCC mentions some caveats when using explicit register variables [2]: In the above example, beware that a register that is call-clobbered by the target ABI will be overwritten by any function call in the assignment, including library calls for arithmetic operators. Also a register may be clobbered when generating some operations, like variable shift, memory copy or memory move on x86. Assuming it is a call-clobbered register, this may happen to r0 above by the assignment to p2. If you have to use such a register, use temporary variables for expressions between the register assignment. But i think, this may not apply to the case above, where the arithmetic operator is an argument of the called function. Eg. there is a sequence point and the statements must not be reordered. Assembler output (lm32-gcc -O1 -S -c test.c): f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli__ashrsi3 scall lw ra, (sp+4) addi sp, sp, 4 bra Assembler output with no DCE (lm32-gcc -O1 -S -fno-dce -c test.c) f2: addi sp, sp, -4 sw (sp+4), ra addi r8, r0, 10 addi r2, r0, 10 calli__ashrsi3 scall lw ra, (sp+4) addi sp, sp, 4 bra [1] http://en.wikipedia.org/wiki/Sequence_point [2] http://gcc.gnu.org/onlinedocs/gcc/Extended- Asm.html#Example%20of%20asm%20with%20clobbered%20asm%20reg -- Michael
Re: libgcc: strange optimization
Michael Walle schrieb: Hi list, consider the following test code: static void inline f1(int arg) { register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm(scall : : r(a1), r(a2)); } void f2(int arg) { f1(arg 10); } If you compile this code with 'lm32-gcc -O1 -S -c test.c' (see end of this email), the a1 = 10; assignment is optimized away. Your asm has no output operands and no side effects, with more aggressive optimization the whole ask would disappear. What you want is maybe something like asm volatile (scall : : r(a1), r(a2)); Johann
Re: libgcc: strange optimization
Hi, That was quick :) Your asm has no output operands and no side effects, with more aggressive optimization the whole ask would disappear. Sorry, that was just a small test file, the original code has output operands. The new test code: static int inline f1(int arg) { register int ret asm(r1); register int a1 asm(r8) = 10; register int a2 asm(r1) = arg; asm volatile (scall : =r(ret) : r(a1), r(a2) : memory); return ret; } int f2(int arg1, int arg2) { return f1(arg1 10); } translates to the same assembly: f2: addi sp, sp, -4 sw (sp+4), ra addi r2, r0, 10 calli__ashrsi3 scall lw ra, (sp+4) addi sp, sp, 4 bra PS. R1 is the return register in the target architecture ABI. -- Michael
Re: libgcc: strange optimization
On 08/01/2011 01:30 PM, Michael Walle wrote: 1) function inlining 2) deferred argument evaluation 3) because our target has no barrel shifter, (arg 10) is emitted as a function call to libgcc's __ashrsi3 (_in place_!) 4) BAM! dead code elimination optimizes r8 assignment away because calli may clobber r1-r10 (callee saved registers on lm32). I'm afraid the only solution I can think of is to force F1 out-of-line. That's the only safe way to make sure that arguments are completely evaluated before forcing them into hard register variables. Alternately, expose new constraints such that you don't need the hard register variables at all. E.g. asm(scall : : R08(a1), R01(a2)); where Rxx is defined in constraints.md for every relevant register. That'll prevent a reference to the hard register until register allocation, at which point we'll have done the right thing with the shift. r~