Re: [RFC] Do not consider volatile asms as optimization barriers #1
Jakub Jelinek wrote: Richard Sandiford wrote: OK, how about this? It looks like the builtins.c and stmt.c stuff wasn't merged until 4.9, and at this stage it seemed safer to just add the same use/clobber sequence to both places. Please wait a little bit, the patch has been committed to the trunk only very recently, we want to see if it has any fallout. It has been two weeks since Richard commited this to trunk. Perhaps it's ok to backport to 4.8 branch now? -Y
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Eric Botcazou ebotca...@adacore.com writes: Thanks, and to Bernd for the review. I went ahead and applied it to trunk. Thanks. We need something for the 4.8 branch as well, probably the builtins.c hunk and the reversion of the cse.c/cselib.c/dse.c changes to the 4.7 state. OK, how about this? It looks like the builtins.c and stmt.c stuff wasn't merged until 4.9, and at this stage it seemed safer to just add the same use/clobber sequence to both places. Tested on x86_64-linux-gnu. OK to install? Thanks, Richard gcc/ * builtins.c (expand_builtin_setjmp_receiver): Emit a use of the hard frame pointer. Synchronize commentary with mainline. * cse.c (cse_insn): Only check for volatile asms. * cselib.c (cselib_process_insn): Likewise. * dse.c (scan_insn): Likewise. * stmt.c (expand_nl_goto_receiver): Emit a use and a clobber of the hard frame pointer. Index: gcc/builtins.c === --- gcc/builtins.c 2014-03-12 18:24:02.919132339 + +++ gcc/builtins.c 2014-03-12 18:24:17.679262346 + @@ -905,9 +905,24 @@ expand_builtin_setjmp_receiver (rtx rece if (! HAVE_nonlocal_goto) #endif { + /* First adjust our frame pointer to its actual value. It was +previously set to the start of the virtual area corresponding to +the stacked variables when we branched here and now needs to be +adjusted to the actual hardware fp value. + +Assignments to virtual registers are converted by +instantiate_virtual_regs into the corresponding assignment +to the underlying register (fp in this case) that makes +the original assignment true. +So the following insn will actually be decrementing fp by +STARTING_FRAME_OFFSET. */ emit_move_insn (virtual_stack_vars_rtx, hard_frame_pointer_rtx); - /* This might change the hard frame pointer in ways that aren't -apparent to early optimization passes, so force a clobber. */ + + /* Restoring the frame pointer also modifies the hard frame pointer. +Mark it used (so that the previous assignment remains live once +the frame pointer is eliminated) and clobbered (to represent the +implicit update from the assignment). */ + emit_use (hard_frame_pointer_rtx); emit_clobber (hard_frame_pointer_rtx); } @@ -948,8 +963,7 @@ expand_builtin_setjmp_receiver (rtx rece /* We must not allow the code we just generated to be reordered by scheduling. Specifically, the update of the frame pointer must - happen immediately, not later. Similarly, we must block - (frame-related) register values to be used across this code. */ + happen immediately, not later. */ emit_insn (gen_blockage ()); } Index: gcc/cse.c === --- gcc/cse.c 2014-03-12 18:24:02.919132339 + +++ gcc/cse.c 2014-03-12 18:24:17.680262355 + @@ -5659,9 +5659,10 @@ cse_insn (rtx insn) invalidate (XEXP (dest, 0), GET_MODE (dest)); } - /* A volatile ASM or an UNSPEC_VOLATILE invalidates everything. */ + /* A volatile ASM invalidates everything. */ if (NONJUMP_INSN_P (insn) - volatile_insn_p (PATTERN (insn))) + GET_CODE (PATTERN (insn)) == ASM_OPERANDS + MEM_VOLATILE_P (PATTERN (insn))) flush_hash_table (); /* Don't cse over a call to setjmp; on some machines (eg VAX) Index: gcc/cselib.c === --- gcc/cselib.c2014-03-12 18:24:02.919132339 + +++ gcc/cselib.c2014-03-12 18:24:17.681262364 + @@ -2623,12 +2623,13 @@ cselib_process_insn (rtx insn) cselib_current_insn = insn; - /* Forget everything at a CODE_LABEL, a volatile insn, or a setjmp. */ + /* Forget everything at a CODE_LABEL, a volatile asm, or a setjmp. */ if ((LABEL_P (insn) || (CALL_P (insn) find_reg_note (insn, REG_SETJMP, NULL)) || (NONJUMP_INSN_P (insn) - volatile_insn_p (PATTERN (insn + GET_CODE (PATTERN (insn)) == ASM_OPERANDS + MEM_VOLATILE_P (PATTERN (insn !cselib_preserve_constants) { cselib_reset_table (next_uid); Index: gcc/dse.c === --- gcc/dse.c 2014-03-12 18:24:02.919132339 + +++ gcc/dse.c 2014-03-12 18:24:17.681262364 + @@ -2518,7 +2518,8 @@ scan_insn (bb_info_t bb_info, rtx insn) /* Cselib clears the table for this case, so we have to essentially do the same. */ if (NONJUMP_INSN_P (insn) - volatile_insn_p (PATTERN (insn))) + GET_CODE (PATTERN (insn)) == ASM_OPERANDS + MEM_VOLATILE_P (PATTERN (insn))) { add_wild_read (bb_info); insn_info-cannot_delete = true; Index: gcc/stmt.c
Re: [RFC] Do not consider volatile asms as optimization barriers #1
On Thu, Mar 13, 2014 at 07:15:34AM +, Richard Sandiford wrote: Eric Botcazou ebotca...@adacore.com writes: Thanks, and to Bernd for the review. I went ahead and applied it to trunk. Thanks. We need something for the 4.8 branch as well, probably the builtins.c hunk and the reversion of the cse.c/cselib.c/dse.c changes to the 4.7 state. OK, how about this? It looks like the builtins.c and stmt.c stuff wasn't merged until 4.9, and at this stage it seemed safer to just add the same use/clobber sequence to both places. Please wait a little bit, the patch has been committed to the trunk only very recently, we want to see if it has any fallout. Jakub
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Thanks, and to Bernd for the review. I went ahead and applied it to trunk. Thanks. We need something for the 4.8 branch as well, probably the builtins.c hunk and the reversion of the cse.c/cselib.c/dse.c changes to the 4.7 state. -- Eric Botcazou
Re: [RFC] Do not consider volatile asms as optimization barriers #1
On Mon, 3 Mar 2014, Richard Sandiford wrote: AIUI: Reading back the references don't yield any dissenting flash-backs, FWIW. So, a (use fp) then a (clobber fp)? That was probably just too weird for me to think of, much like a hypercorrect ending of the previous clause. :) Thanks for dealing with this, and for not making my initial nightmarish interpretation of $SUBJECT come true: Do not consider volatile asms as anything we have to consider. At least I hope so. Dig up this horse in 6 months? brgds, H-P
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Hans-Peter Nilsson h...@bitrange.com writes: On Mon, 3 Mar 2014, Richard Sandiford wrote: AIUI: Reading back the references don't yield any dissenting flash-backs, FWIW. So, a (use fp) then a (clobber fp)? That was probably just too weird for me to think of, much like a hypercorrect ending of the previous clause. :) Thanks for dealing with this, and for not making my initial nightmarish interpretation of $SUBJECT come true: Do not consider volatile asms as anything we have to consider. At least I hope so. Dig up this horse in 6 months? Thanks, and to Bernd for the review. I went ahead and applied it to trunk. Richard
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Richard wrote: The thread seems to have died down, so should I go ahead and install it? Looks like all reviews were more or less positive... -Y
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Bernd Schmidt ber...@codesourcery.com writes: On 03/03/2014 11:01 PM, Richard Sandiford wrote: I'll run a full test overnight, but does this look like it might be a way out, at least for 4.9? Pretty much agree with everything you've written in this thread. I think this patch is fine. Thanks. The thread seems to have died down, so should I go ahead and install it? I didn't want to be too hasty since it's obviously a bit of a controversial area. Richard
Re: [RFC] Do not consider volatile asms as optimization barriers #1
On Mon, Mar 3, 2014 at 11:01 PM, Richard Sandiford rdsandif...@googlemail.com wrote: AIUI: (1) The main point of http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01172.html was that we had: emit_move_insn (virtual_stack_vars_rtx, hard_frame_pointer_rtx); /* This might change the hard frame pointer in ways that aren't apparent to early optimization passes, so force a clobber. */ emit_clobber (hard_frame_pointer_rtx); and, after elimination, the clobber killed the assignment to the hard frame pointer. Which it could. :-) (2) Aassuming for simplicity that STARTING_FRAME_OFFSET == 0, we end up with an assignment like: (set frame_pointer_rtx hard_frame_pointer_rtx) And the problem is that frame_pointer_rtx often isn't a real register: it gets eliminated to hard_frame_pointer_rtx+X, where X isn't known until RA time. I.e. the assignment is really: (set (plus hard_frame_pointer_rtx X) hard_frame_pointer_rtx) which becomes: (set hard_frame_pointer_rtx (plus hard_frame_pointer_rtx -X)) Before RA it isn't obvious that the assignment clobbers hard_frame_pointer_rtx, so it would seem to most passes that frame_pointer_rtx and hard_frame_pointer_rtx are equivalent after the set. That was why the clobber was there. (3) We chose to fix this by deleting the explicit clobber and relying on the magicness of unspec_volatile to imply the clobber instead. But I don't think (3) is a good idea. We should instead fix the dataflow so that it's accurate. Long term it would be good to have a different representation for (2), An UNSPEC that sets frame_pointer_rtx, uses and clobbers hard_frame_pointer_rtx. In fact the plain move_insn is a lie and the clobber should have been at least in a parallel with the set of frame_pointer_rtx ... ? but I don't have any bright ideas what that might be. Until then I think we can model the dataflow accurately by having a use as well as a clobber of hard_frame_pointer_rtx. I went back to r192682: Indeed. http://gcc.gnu.org/ml/gcc-testresults/2012-10/msg02350.html and saw the -m32 tests fail without the builtins.c part of the patch. They passed after it. I then went to r200643: 2013-07-03 Hans-Peter Nilsson h...@bitrange.com PR middle-end/55030 * stmt.c (expand_nl_goto_receiver): Remove almost-copy of expand_builtin_setjmp_receiver. (expand_label): Adjust, call expand_builtin_setjmp_receiver with NULL for the label parameter. * builtins.c (expand_builtin_setjmp_receiver): Don't clobber the frame-pointer. Adjust comments. [HAVE_builtin_setjmp_receiver]: Emit builtin_setjmp_receiver only if LABEL is non-NULL. and applied the full patch. The tests still passed, despite the removal of the volatile checks. (To recap, the volatile checks touched here were the same ones touched by: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00868.html Rather than reverting that part I'm removing the check entirely, since we seem to agree that the original asm handling wasn't necessary.) I'll run a full test overnight, but does this look like it might be a way out, at least for 4.9? Shouldn't the use and clobber be part of the move and thus wrapped inside a parallel? Or am I misunderstanding how parallel works? What keeps those three insns adjacent otherwise? Thansk, Richard. Thanks, Richard gcc/ * builtins.c (expand_builtin_setjmp_receiver): Use and clobber hard_frame_pointer_rtx. * cse.c (cse_insn): Remove volatile check. * cselib.c (cselib_process_insn): Likewise. * dse.c (scan_insn): Likewise. Index: gcc/builtins.c === --- gcc/builtins.c 2014-03-03 21:47:59.749026019 + +++ gcc/builtins.c 2014-03-03 21:48:00.550030853 + @@ -910,18 +910,27 @@ expand_builtin_setjmp_receiver (rtx rece #ifdef HAVE_nonlocal_goto if (! HAVE_nonlocal_goto) #endif -/* First adjust our frame pointer to its actual value. It was - previously set to the start of the virtual area corresponding to - the stacked variables when we branched here and now needs to be - adjusted to the actual hardware fp value. - - Assignments to virtual registers are converted by - instantiate_virtual_regs into the corresponding assignment - to the underlying register (fp in this case) that makes - the original assignment true. - So the following insn will actually be decrementing fp by - STARTING_FRAME_OFFSET. */ -emit_move_insn (virtual_stack_vars_rtx, hard_frame_pointer_rtx); +{ + /* First adjust our frame pointer to its actual value. It was +previously set to the start of the virtual area corresponding to +the stacked variables when we branched here and now needs
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Richard Biener richard.guent...@gmail.com writes: On Mon, Mar 3, 2014 at 11:01 PM, Richard Sandiford rdsandif...@googlemail.com wrote: AIUI: (1) The main point of http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01172.html was that we had: emit_move_insn (virtual_stack_vars_rtx, hard_frame_pointer_rtx); /* This might change the hard frame pointer in ways that aren't apparent to early optimization passes, so force a clobber. */ emit_clobber (hard_frame_pointer_rtx); and, after elimination, the clobber killed the assignment to the hard frame pointer. Which it could. :-) (2) Aassuming for simplicity that STARTING_FRAME_OFFSET == 0, we end up with an assignment like: (set frame_pointer_rtx hard_frame_pointer_rtx) And the problem is that frame_pointer_rtx often isn't a real register: it gets eliminated to hard_frame_pointer_rtx+X, where X isn't known until RA time. I.e. the assignment is really: (set (plus hard_frame_pointer_rtx X) hard_frame_pointer_rtx) which becomes: (set hard_frame_pointer_rtx (plus hard_frame_pointer_rtx -X)) Before RA it isn't obvious that the assignment clobbers hard_frame_pointer_rtx, so it would seem to most passes that frame_pointer_rtx and hard_frame_pointer_rtx are equivalent after the set. That was why the clobber was there. (3) We chose to fix this by deleting the explicit clobber and relying on the magicness of unspec_volatile to imply the clobber instead. But I don't think (3) is a good idea. We should instead fix the dataflow so that it's accurate. Long term it would be good to have a different representation for (2), An UNSPEC that sets frame_pointer_rtx, uses and clobbers hard_frame_pointer_rtx. In fact the plain move_insn is a lie and the clobber should have been at least in a parallel with the set of frame_pointer_rtx ... ? Yeah, the plain move is lie, which is why ideally I'd like to change it. But the problem is that the elimination code specifically expects this kind of pattern to be used. It eliminates the target of the move to reg+offset and treats the new insn as being an assignment to reg with offset subtracted from the source. It needs to be something simple like a plain move or a: (set (reg) (plus (reg) (const_int))) for subtracting offset to work correctly. So changing the representation of the move would mean changing the way that those eliminations are handled too. Not 4.9 material :-) but I don't have any bright ideas what that might be. Until then I think we can model the dataflow accurately by having a use as well as a clobber of hard_frame_pointer_rtx. I went back to r192682: Indeed. http://gcc.gnu.org/ml/gcc-testresults/2012-10/msg02350.html and saw the -m32 tests fail without the builtins.c part of the patch. They passed after it. I then went to r200643: 2013-07-03 Hans-Peter Nilsson h...@bitrange.com PR middle-end/55030 * stmt.c (expand_nl_goto_receiver): Remove almost-copy of expand_builtin_setjmp_receiver. (expand_label): Adjust, call expand_builtin_setjmp_receiver with NULL for the label parameter. * builtins.c (expand_builtin_setjmp_receiver): Don't clobber the frame-pointer. Adjust comments. [HAVE_builtin_setjmp_receiver]: Emit builtin_setjmp_receiver only if LABEL is non-NULL. and applied the full patch. The tests still passed, despite the removal of the volatile checks. (To recap, the volatile checks touched here were the same ones touched by: http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00868.html Rather than reverting that part I'm removing the check entirely, since we seem to agree that the original asm handling wasn't necessary.) I'll run a full test overnight, but does this look like it might be a way out, at least for 4.9? Shouldn't the use and clobber be part of the move and thus wrapped inside a parallel? Or am I misunderstanding how parallel works? What keeps those three insns adjacent otherwise? Strict adjacency doesn't really matter. We just need to kill the equivalence between the soft and hard frame pointers for anything after the clobber. I did wonder about using an empty asm that takes the old hard frame pointer as input and produces a new hard frame pointer as output, but normal asms aren't allowed to change the frame pointer. Thanks, Richard
Re: [RFC] Do not consider volatile asms as optimization barriers #1
On 03/03/2014 11:01 PM, Richard Sandiford wrote: I'll run a full test overnight, but does this look like it might be a way out, at least for 4.9? Pretty much agree with everything you've written in this thread. I think this patch is fine. gcc/ * builtins.c (expand_builtin_setjmp_receiver): Use and clobber hard_frame_pointer_rtx. * cse.c (cse_insn): Remove volatile check. * cselib.c (cselib_process_insn): Likewise. * dse.c (scan_insn): Likewise. Bernd
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Eric Botcazou ebotca...@adacore.com writes: Thanks for doing this. FWIW I agree it's probably the best stop-gap fix. But the implication seems to be that unspec_volatile and volatile asms are volatile in different ways. IMO they're volatile in the same way and the problems for volatile asms apply to unspec_volatile too. I disagree, we need a simple way for the RTL middle-end as well as the back- ends to block most optimizations across a specific point (e.g. a non-local label as in HP's fix) and UNSPEC_VOLATILE is the best candidate, at least in the short term. I don't agree it's the best candidate if... E.g. although cse.c will flush the table for unspec_volatile, it isn't the case that unspec_volatile forces a containing function to save all call-saved registers. That would be excessive for a plain blockage instruction. So again we seem to be assuming one thing in places like cse.c and another in the register allocator. Code that uses the DF framework will also assume that registers are not implicitly clobbered by an unspec_volatile: [...] Also, ira-lives.c (which tracks the liveness of both pseudo and hard registers) doesn't mention volatile at all. Yes, the definition of a blockage instruction is somewhat vague and I agree that it shoudn't cause registers to be spilled. But it needs to block most, if not all, optimizations. ...it's so loosely defined. If non-local labels are the specific problem, I think it'd be better to limit the flush to that. I'm back to throwing examples around, sorry, but take the MIPS testcase: volatile int x = 1; void foo (void) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } where __builtin_mips_set_fcsr is a handy way of getting unspec_volatile. (I'm not interested in what the function does here.) Even at -O2, the cse.c code successfully prevents %hi(x) from being shared, as you'd expect: li $3,1# 0x1 lui $2,%hi(x) sw $3,%lo(x)($2) move$2,$0 ctc1$2,$31 li $3,2# 0x2 lui $2,%hi(x) sw $3,%lo(x)($2) j $31 nop But put it in a loop: void frob (void) { for (;;) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } } and we get the rather bizarre code: lui $2,%hi(x) li $6,1# 0x1 move$5,$0 move$4,$2 li $3,2# 0x2 .align 3 .L3: sw $6,%lo(x)($2) ctc1$5,$31 sw $3,%lo(x)($4) j .L3 lui $2,%hi(x) Here the _second_ %hi(x), the 1 and the 2 have been hoisted but the first %hi(x) is reloaded each time. So what's the correct behaviour here? Should the hoisting of the second %hi(x) have been disabled because the loop contains an unspec_volatile? What about the 1 (from the first store) and the 2? If instead it was: for (i = 0; i 100; i++) would converting to a hardware do-loop be acceptable? So most passes assume that no pseudos or hard registers will be implicitly clobbered by unspec_volatile, just like for a volatile asm. And IMO that's right. I think the rule should be the same for volatile asms and unspec_volatiles, and the same for registers as it already is for memory: if the instruction clobbers something, it should say so explicitly. IMO that would buy us nothing and, on the contrary, would add complexity where there currently isn't. We really need a simple blockage instruction. Volatile itself should: (a) prevent deletion or duplication of the operation (b) prevent reordering wrt other volatiles (c) prevent the operation from being considered equivalent to any other operation (even if it's structurally identical and has the same inputs) but nothing beyond that. Maybe UNSPEC_VOLATILE is a misnomer then and we should allow volatile UNSPECs along the above lines. That'd be fine with me, especially since with the patch it sounds like using volatile asm would produce better code than a built-in function that expands to an unspec_volatile, whereas IMO the opposite should always be true. But I still think we need a similar list for what unspec_volatile means now, if we decide to keep something with the current meaning. Thanks, Richard
Re: [RFC] Do not consider volatile asms as optimization barriers #1
On Mon, Mar 3, 2014 at 9:01 AM, Richard Sandiford rdsandif...@googlemail.com wrote: Eric Botcazou ebotca...@adacore.com writes: Thanks for doing this. FWIW I agree it's probably the best stop-gap fix. But the implication seems to be that unspec_volatile and volatile asms are volatile in different ways. IMO they're volatile in the same way and the problems for volatile asms apply to unspec_volatile too. I disagree, we need a simple way for the RTL middle-end as well as the back- ends to block most optimizations across a specific point (e.g. a non-local label as in HP's fix) and UNSPEC_VOLATILE is the best candidate, at least in the short term. I don't agree it's the best candidate if... E.g. although cse.c will flush the table for unspec_volatile, it isn't the case that unspec_volatile forces a containing function to save all call-saved registers. That would be excessive for a plain blockage instruction. So again we seem to be assuming one thing in places like cse.c and another in the register allocator. Code that uses the DF framework will also assume that registers are not implicitly clobbered by an unspec_volatile: [...] Also, ira-lives.c (which tracks the liveness of both pseudo and hard registers) doesn't mention volatile at all. Yes, the definition of a blockage instruction is somewhat vague and I agree that it shoudn't cause registers to be spilled. But it needs to block most, if not all, optimizations. ...it's so loosely defined. If non-local labels are the specific problem, I think it'd be better to limit the flush to that. non-local labels should block most optimizations by the fact they are a receiver of control flow and thus should have an abnormal edge coming into them. If that's not the case (no abnormal edge) then that's the bug to fix. Otherwise I agree with Richard. Please sit down and _exactly_ define what 'volatile' in an asm provides for guarantees compared to non-volatile asms. Likewise do so for volatile UNSPECs. A volatile shouldn't be a cheap way out of properly enumerating all uses, defs and clobbers of a stmt. If volatile is used to tell the insn has additional uses/defs or clobbers to those explicitely given the only reason that may be valid is because we cannot explicitely enumerate those. But we should fix that instead (for example with the special register idea or by adding a middle-end wide special blockage that you can use/def/clobber). To better assess the problem at hand can you enumerate the cases where you need that special easy blockage instruction? With testcases please? Note that on GIMPLE even volatiles are not strictly ordered if they don't have a dependence that orders them (that doesn't mean that any existing transform deliberately re-orders them, but as shown with the loop example below such re-ordering can happen as a side-effect of a valid transform). I'm back to throwing examples around, sorry, but take the MIPS testcase: volatile int x = 1; void foo (void) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } where __builtin_mips_set_fcsr is a handy way of getting unspec_volatile. (I'm not interested in what the function does here.) Even at -O2, the cse.c code successfully prevents %hi(x) from being shared, as you'd expect: li $3,1# 0x1 lui $2,%hi(x) sw $3,%lo(x)($2) move$2,$0 ctc1$2,$31 li $3,2# 0x2 lui $2,%hi(x) sw $3,%lo(x)($2) j $31 nop But put it in a loop: void frob (void) { for (;;) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } } and we get the rather bizarre code: lui $2,%hi(x) li $6,1# 0x1 move$5,$0 move$4,$2 li $3,2# 0x2 .align 3 .L3: sw $6,%lo(x)($2) ctc1$5,$31 sw $3,%lo(x)($4) j .L3 lui $2,%hi(x) Here the _second_ %hi(x), the 1 and the 2 have been hoisted but the first %hi(x) is reloaded each time. So what's the correct behaviour here? Should the hoisting of the second %hi(x) have been disabled because the loop contains an unspec_volatile? What about the 1 (from the first store) and the 2? If instead it was: for (i = 0; i 100; i++) would converting to a hardware do-loop be acceptable? So most passes assume that no pseudos or hard registers will be implicitly clobbered by unspec_volatile, just like for a volatile asm. And IMO that's right. I think the rule should be the same for volatile asms and unspec_volatiles, and the same for registers as it already is for memory: if the instruction clobbers something, it should say so explicitly. IMO that would buy us nothing and, on the contrary, would add
Re: [RFC] Do not consider volatile asms as optimization barriers #1
non-local labels should block most optimizations by the fact they are a receiver of control flow and thus should have an abnormal edge coming into them. If that's not the case (no abnormal edge) then that's the bug to fix. It's (of course) more complicated, you need to look at HP's fix and testcase to see why we need a full optimization barrier. See also the prologue and epilogue of many architectures which also need a blockage when they are establishing or destroying the frame. Otherwise I agree with Richard. Please sit down and _exactly_ define what 'volatile' in an asm provides for guarantees compared to non-volatile asms. Likewise do so for volatile UNSPECs. Too late, we apparently all agree about what volatile asms and future volatile UNSPECs mean. :-) The remaining point is UNSPEC_VOLATILE, but the discussion can be deferred until the next stage 1. A volatile shouldn't be a cheap way out of properly enumerating all uses, defs and clobbers of a stmt. If volatile is used to tell the insn has additional uses/defs or clobbers to those explicitely given the only reason that may be valid is because we cannot explicitely enumerate those. But we should fix that instead (for example with the special register idea or by adding a middle-end wide special blockage that you can use/def/clobber). For the time being this special blockage is UNSPEC_VOLATILE for RTL. -- Eric Botcazou
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Eric Botcazou ebotca...@adacore.com writes: non-local labels should block most optimizations by the fact they are a receiver of control flow and thus should have an abnormal edge coming into them. If that's not the case (no abnormal edge) then that's the bug to fix. It's (of course) more complicated, you need to look at HP's fix and testcase to see why we need a full optimization barrier. See also the prologue and epilogue of many architectures which also need a blockage when they are establishing or destroying the frame. But the prologue/epilogue case often doesn't need to be a full blockage. We could move a load-immediate instruction -- or even an accesss to known- global memory -- before the allocation or after the deallocation. This can actually be important on architectures that use load-multiple to restore the return register and want the prefetcher to see the target address as early as possible. So I think the prologue and epilogue is one case where we really do want to spell out what's clobbered by the allocation and deallocation. Thanks, Richard
Re: [RFC] Do not consider volatile asms as optimization barriers #1
...it's so loosely defined. If non-local labels are the specific problem, I think it'd be better to limit the flush to that. No, there was e.g. written so non-local labels are not the only problem. I'm back to throwing examples around, sorry, but take the MIPS testcase: volatile int x = 1; void foo (void) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } where __builtin_mips_set_fcsr is a handy way of getting unspec_volatile. (I'm not interested in what the function does here.) Even at -O2, the cse.c code successfully prevents %hi(x) from being shared, as you'd expect: li $3,1# 0x1 lui $2,%hi(x) sw $3,%lo(x)($2) move$2,$0 ctc1$2,$31 li $3,2# 0x2 lui $2,%hi(x) sw $3,%lo(x)($2) j $31 nop But put it in a loop: void frob (void) { for (;;) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } } and we get the rather bizarre code: lui $2,%hi(x) li $6,1# 0x1 move$5,$0 move$4,$2 li $3,2# 0x2 .align 3 .L3: sw $6,%lo(x)($2) ctc1$5,$31 sw $3,%lo(x)($4) j .L3 lui $2,%hi(x) Here the _second_ %hi(x), the 1 and the 2 have been hoisted but the first %hi(x) is reloaded each time. So what's the correct behaviour here? Should the hoisting of the second %hi(x) have been disabled because the loop contains an unspec_volatile? What about the 1 (from the first store) and the 2? Well, I personally wouldn't spend much time on the code generated in a loop containing an UNSPEC_VOLATILE. If an instruction or a builtin is supposed to be performance-sensitive, then don't use an UNSPEC_VOLATILE by all means and properly model it instead! -- Eric Botcazou
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Eric Botcazou ebotca...@adacore.com writes: ...it's so loosely defined. If non-local labels are the specific problem, I think it'd be better to limit the flush to that. No, there was e.g. written so non-local labels are not the only problem. What are the others though? As discussed in the other subthread, I don't think prologue and epilogue barriers are quite the same. I'm back to throwing examples around, sorry, but take the MIPS testcase: volatile int x = 1; void foo (void) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } where __builtin_mips_set_fcsr is a handy way of getting unspec_volatile. (I'm not interested in what the function does here.) Even at -O2, the cse.c code successfully prevents %hi(x) from being shared, as you'd expect: li $3,1# 0x1 lui $2,%hi(x) sw $3,%lo(x)($2) move$2,$0 ctc1$2,$31 li $3,2# 0x2 lui $2,%hi(x) sw $3,%lo(x)($2) j $31 nop But put it in a loop: void frob (void) { for (;;) { x = 1; __builtin_mips_set_fcsr (0); x = 2; } } and we get the rather bizarre code: lui $2,%hi(x) li $6,1# 0x1 move$5,$0 move$4,$2 li $3,2# 0x2 .align 3 .L3: sw $6,%lo(x)($2) ctc1$5,$31 sw $3,%lo(x)($4) j .L3 lui $2,%hi(x) Here the _second_ %hi(x), the 1 and the 2 have been hoisted but the first %hi(x) is reloaded each time. So what's the correct behaviour here? Should the hoisting of the second %hi(x) have been disabled because the loop contains an unspec_volatile? What about the 1 (from the first store) and the 2? Well, I personally wouldn't spend much time on the code generated in a loop containing an UNSPEC_VOLATILE. If an instruction or a builtin is supposed to be performance-sensitive, then don't use an UNSPEC_VOLATILE by all means and properly model it instead! That doesn't really answer the question though. What's the correct behaviour for an unspec volatile in a loop? I don't think it's what we did in the example above, since it doesn't seem self-consistent. And not spending too much time is again a bit vague in terms of saying what's right and what's wrong. My point is that if the construct is well-defined enough to handle the important things we want it to handle, the answer should be known to somebody, even if it isn't to me. :-) Thanks, Richard
Re: [RFC] Do not consider volatile asms as optimization barriers #1
On Mar 1, 2014, at 6:36 AM, Eric Botcazou ebotca...@adacore.com wrote: It introduces a new, temporary predicate really_volatile_insn_p really is a really horrible name. Hint, if cs domain specific wikipedia describe what you were doing, what would the page be called? really is unlikely to be it.
Re: [RFC] Do not consider volatile asms as optimization barriers #1
That doesn't really answer the question though. What's the correct behaviour for an unspec volatile in a loop? I don't think it's what we did in the example above, since it doesn't seem self-consistent. And not spending too much time is again a bit vague in terms of saying what's right and what's wrong. not spending too much time is a polite way to say I don't really care. :-) If you do, feel free to post a formal definition, a implementation plan and maybe a patch at some point. -- Eric Botcazou
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Richard Sandiford rdsandif...@googlemail.com writes: I'll run a full test overnight, but does this look like it might be a way out, at least for 4.9? FWIW, it passed testing on x86_64-linux-gnu ({,-m32}, all,ada). Here it is again with an updated cselib.c comment. OK to install? Thanks, Richard gcc/ * builtins.c (expand_builtin_setjmp_receiver): Use and clobber hard_frame_pointer_rtx. * cse.c (cse_insn): Remove volatile check. * cselib.c (cselib_process_insn): Likewise. * dse.c (scan_insn): Likewise. Index: gcc/builtins.c === --- gcc/builtins.c 2014-03-03 21:47:59.749026019 + +++ gcc/builtins.c 2014-03-03 21:48:00.550030853 + @@ -910,18 +910,27 @@ expand_builtin_setjmp_receiver (rtx rece #ifdef HAVE_nonlocal_goto if (! HAVE_nonlocal_goto) #endif -/* First adjust our frame pointer to its actual value. It was - previously set to the start of the virtual area corresponding to - the stacked variables when we branched here and now needs to be - adjusted to the actual hardware fp value. - - Assignments to virtual registers are converted by - instantiate_virtual_regs into the corresponding assignment - to the underlying register (fp in this case) that makes - the original assignment true. - So the following insn will actually be decrementing fp by - STARTING_FRAME_OFFSET. */ -emit_move_insn (virtual_stack_vars_rtx, hard_frame_pointer_rtx); +{ + /* First adjust our frame pointer to its actual value. It was +previously set to the start of the virtual area corresponding to +the stacked variables when we branched here and now needs to be +adjusted to the actual hardware fp value. + +Assignments to virtual registers are converted by +instantiate_virtual_regs into the corresponding assignment +to the underlying register (fp in this case) that makes +the original assignment true. +So the following insn will actually be decrementing fp by +STARTING_FRAME_OFFSET. */ + emit_move_insn (virtual_stack_vars_rtx, hard_frame_pointer_rtx); + + /* Restoring the frame pointer also modifies the hard frame pointer. +Mark it used (so that the previous assignment remains live once +the frame pointer is eliminated) and clobbered (to represent the +implicit update from the assignment). */ + emit_use (hard_frame_pointer_rtx); + emit_clobber (hard_frame_pointer_rtx); +} #if !HARD_FRAME_POINTER_IS_ARG_POINTER if (fixed_regs[ARG_POINTER_REGNUM]) @@ -965,8 +974,7 @@ expand_builtin_setjmp_receiver (rtx rece /* We must not allow the code we just generated to be reordered by scheduling. Specifically, the update of the frame pointer must - happen immediately, not later. Similarly, we must block - (frame-related) register values to be used across this code. */ + happen immediately, not later. */ emit_insn (gen_blockage ()); } Index: gcc/cse.c === --- gcc/cse.c 2014-03-03 21:47:59.869026741 + +++ gcc/cse.c 2014-03-03 21:48:00.625031305 + @@ -5664,11 +5664,6 @@ cse_insn (rtx insn) invalidate (XEXP (dest, 0), GET_MODE (dest)); } - /* A volatile ASM or an UNSPEC_VOLATILE invalidates everything. */ - if (NONJUMP_INSN_P (insn) - volatile_insn_p (PATTERN (insn))) -flush_hash_table (); - /* Don't cse over a call to setjmp; on some machines (eg VAX) the regs restored by the longjmp come from a later time than the setjmp. */ Index: gcc/cselib.c === --- gcc/cselib.c2014-03-03 21:47:59.870026748 + +++ gcc/cselib.c2014-03-03 22:09:24.211994918 + @@ -2626,12 +2626,10 @@ cselib_process_insn (rtx insn) cselib_current_insn = insn; - /* Forget everything at a CODE_LABEL, a volatile insn, or a setjmp. */ + /* Forget everything at a CODE_LABEL or a setjmp. */ if ((LABEL_P (insn) || (CALL_P (insn) - find_reg_note (insn, REG_SETJMP, NULL)) - || (NONJUMP_INSN_P (insn) - volatile_insn_p (PATTERN (insn + find_reg_note (insn, REG_SETJMP, NULL))) !cselib_preserve_constants) { cselib_reset_table (next_uid); Index: gcc/dse.c === --- gcc/dse.c 2014-03-03 21:47:59.871026754 + +++ gcc/dse.c 2014-03-03 21:48:00.627031317 + @@ -2470,16 +2470,6 @@ scan_insn (bb_info_t bb_info, rtx insn) return; } - /* Cselib clears the table for this case, so we have to essentially - do the same. */ - if (NONJUMP_INSN_P (insn) - volatile_insn_p (PATTERN (insn))) -{ - add_wild_read (bb_info); - insn_info-cannot_delete = true; -
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Eric Botcazou ebotca...@adacore.com writes: That doesn't really answer the question though. What's the correct behaviour for an unspec volatile in a loop? I don't think it's what we did in the example above, since it doesn't seem self-consistent. And not spending too much time is again a bit vague in terms of saying what's right and what's wrong. not spending too much time is a polite way to say I don't really care. :-) If you do, feel free to post a formal definition, a implementation plan and maybe a patch at some point. Well, I still reckon unspec_volatile and volatile asm should be volatile in the same way. There should be no implicit magic clobbers. Thanks, Richard
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Thanks for doing this. FWIW I agree it's probably the best stop-gap fix. But the implication seems to be that unspec_volatile and volatile asms are volatile in different ways. IMO they're volatile in the same way and the problems for volatile asms apply to unspec_volatile too. I disagree, we need a simple way for the RTL middle-end as well as the back- ends to block most optimizations across a specific point (e.g. a non-local label as in HP's fix) and UNSPEC_VOLATILE is the best candidate, at least in the short term. E.g. although cse.c will flush the table for unspec_volatile, it isn't the case that unspec_volatile forces a containing function to save all call-saved registers. That would be excessive for a plain blockage instruction. So again we seem to be assuming one thing in places like cse.c and another in the register allocator. Code that uses the DF framework will also assume that registers are not implicitly clobbered by an unspec_volatile: [...] Also, ira-lives.c (which tracks the liveness of both pseudo and hard registers) doesn't mention volatile at all. Yes, the definition of a blockage instruction is somewhat vague and I agree that it shoudn't cause registers to be spilled. But it needs to block most, if not all, optimizations. So most passes assume that no pseudos or hard registers will be implicitly clobbered by unspec_volatile, just like for a volatile asm. And IMO that's right. I think the rule should be the same for volatile asms and unspec_volatiles, and the same for registers as it already is for memory: if the instruction clobbers something, it should say so explicitly. IMO that would buy us nothing and, on the contrary, would add complexity where there currently isn't. We really need a simple blockage instruction. Volatile itself should: (a) prevent deletion or duplication of the operation (b) prevent reordering wrt other volatiles (c) prevent the operation from being considered equivalent to any other operation (even if it's structurally identical and has the same inputs) but nothing beyond that. Maybe UNSPEC_VOLATILE is a misnomer then and we should allow volatile UNSPECs along the above lines. -- Eric Botcazou
[RFC] Do not consider volatile asms as optimization barriers #1
There seems to be a sufficiently large consensus that volatile asms should not be treated as full optimization barriers by the compiler. This patch is a first step towards implementing this conclusion and aims only at addressing the code quality regressions introduced by http://gcc.gnu.org/viewcvs/gcc?view=revisionrevision=193802 on the 4.8 branch and mainline for volatile asms. It introduces a new, temporary predicate really_volatile_insn_p and invokes it from the 3 places in cse.c, cselib.c and dse.c which were touched above. But this comes with a side effect: the blockage standard pattern needs to be adjusted to always use an UNSPEC_VOLATILE. That's already the case for all the architectures that define it, but 21 architectures (aarch64 arc avr bfin cr16 cris epiphany h8300 m32c mcore mmix mn10300 moxie msp430 nds32 picochip rl78 rx score v850 xtensa) don't define it and therefore need to be adjusted. Tested on x86_64-suse-linux and by building cc1 and compiling a relevant testcase for the 21 aforementioned architectures. 2014-03-01 Eric Botcazou ebotca...@adacore.com * doc/md.texi (blockage): Do not allow volatile asms. * rtl.def (UNSPEC_VOLATILE): Adjust description. * rtl.h (really_volatile_insn_p): Declare. * rtlanal.c (volatile_insn_1): New predicate copied from... (volatile_insn_p): ...this. Call it. (really_volatile_insn_p): New predicate. * cse.c (cse_insn): Call really_volatile_insn_p. * cselib.c (cselib_process_insn): Likewise. * dse.c (scan_insn): Likewise. * emit-rtl.c (gen_blockage): Delete. * emit-rtl.h (gen_blockage): Likewise. * config/aarch64/aarch64.md (UNSPECV_BLOCKAGE): New enum value. (blockage): New insn. * config/arc/arc.md (VUNSPEC_BLOCKAGE): New constant. (blockage): New insn. * config/avr/avr.md (UNSPECV_BLOCKAGE): New enum value. (blockage): New insn. * config/bfin/bfin.md (UNSPEC_VOLATILE_BLOCKAGE): New constant. (blockage): New insn. * config/cr16/cr16.md (UNSPECV_BLOCKAGE): New constant. (blockage): New insn. * config/cris/cris.md (CRIS_UNSPECV_BLOCKAGE): New enum value. (blockage): New insn. * config/epiphany/epiphany.md (UNSPECV_BLOCKAGE): New constant. (blockage): New insn. * config/h8300/h8300.md (UNSPEC_BLOCKAGE): New constant. (blockage): New insn. * config/m32c/m32c.md (UNS_BLOCKAGE): New constant. (blockage): New insn. * config/mcore/mcore.md (UNSPECV_BLOCKAGE, UNSPECV_CONSTANT, UNSPECV_ALIGN, UNSPECV_TABLE): New enum values. (blockage): New insn. (consttable_4): Use UNSPECV_CONSTANT. (align_4): Use UNSPECV_ALIGN. (consttable_end): Use UNSPECV_TABLE. * config/mmix/mmix.md (UNSPECV_BLOCKAGE, UNSPECV_SYNC, UNSPECV_NONLOCAL): New enum values. (blockage): New insn. (nonlocal_goto_receiver): Use UNSPECV_NONLOCAL. (nonlocal_goto_receiver_expanded): Likewise. (sync_icache): Use UNSPECV_SYNC. * config/mn10300/mn10300.md (UNSPECV_BLOCKAGE): New constant. (blockage): New insn. * config/moxie/moxie.md (UNSPECV_BLOCKAGE): New enum value. (blockage): New insn. * config/msp430/msp430.md (UNS_BLOCKAGE): New enum value. (blockage): New insn. * config/nds32/constants.md (UNSPEC_VOLATILE_BLOCKAGE): New enum value. * config/nds32/nds32.md (blockage): New insn. * config/nds32/nds32.c (nds32_valid_stack_push_pop): Return false if the insn is of the wrong kind. * config/picochip/picochip.md (UNSPEC_BLOCKAGE): New constant. (blockage): New insn. * config/rl78/rl78.md (UNS_BLOCKAGE): New constant. (blockage): New insn. * config/rx/rx.md (UNSPEC_BLOCKAGE): New constant. (blockage): New insn. * config/score/score.md (BLOCKAGE): New constant. (blockage): New insn. * config/v850/v850.md (UNSPECV_BLOCKAGE): New constant. (blockage): New insn. * config/xtensa/xtensa.md (UNSPECV_BLOCKAGE): New constant. (blockage): New insn. -- Eric BotcazouIndex: doc/md.texi === --- doc/md.texi (revision 208241) +++ doc/md.texi (working copy) @@ -6228,7 +6228,7 @@ the values of operands 1 and 2. This pattern defines a pseudo insn that prevents the instruction scheduler and other passes from moving instructions and using register equivalences across the boundary defined by the blockage insn. -This needs to be an UNSPEC_VOLATILE pattern or a volatile ASM. +This needs to be an UNSPEC_VOLATILE pattern. @cindex @code{memory_barrier} instruction pattern @item @samp{memory_barrier} Index: rtlanal.c === --- rtlanal.c (revision 208241) +++ rtlanal.c
Re: [RFC] Do not consider volatile asms as optimization barriers #1
Eric Botcazou ebotca...@adacore.com writes: There seems to be a sufficiently large consensus that volatile asms should not be treated as full optimization barriers by the compiler. This patch is a first step towards implementing this conclusion and aims only at addressing the code quality regressions introduced by http://gcc.gnu.org/viewcvs/gcc?view=revisionrevision=193802 on the 4.8 branch and mainline for volatile asms. It introduces a new, temporary predicate really_volatile_insn_p and invokes it from the 3 places in cse.c, cselib.c and dse.c which were touched above. But this comes with a side effect: the blockage standard pattern needs to be adjusted to always use an UNSPEC_VOLATILE. That's already the case for all the architectures that define it, but 21 architectures (aarch64 arc avr bfin cr16 cris epiphany h8300 m32c mcore mmix mn10300 moxie msp430 nds32 picochip rl78 rx score v850 xtensa) don't define it and therefore need to be adjusted. Tested on x86_64-suse-linux and by building cc1 and compiling a relevant testcase for the 21 aforementioned architectures. Thanks for doing this. FWIW I agree it's probably the best stop-gap fix. But the implication seems to be that unspec_volatile and volatile asms are volatile in different ways. IMO they're volatile in the same way and the problems for volatile asms apply to unspec_volatile too. E.g. although cse.c will flush the table for unspec_volatile, it isn't the case that unspec_volatile forces a containing function to save all call-saved registers. That would be excessive for a plain blockage instruction. So again we seem to be assuming one thing in places like cse.c and another in the register allocator. Code that uses the DF framework will also assume that registers are not implicitly clobbered by an unspec_volatile: case ASM_OPERANDS: case UNSPEC_VOLATILE: case TRAP_IF: case ASM_INPUT: { /* Traditional and volatile asm instructions must be considered to use and clobber all hard registers, all pseudo-registers and all of memory. So must TRAP_IF and UNSPEC_VOLATILE operations. Consider for instance a volatile asm that changes the fpu rounding mode. An insn should not be moved across this even if it only uses pseudo-regs because it might give an incorrectly rounded result. However, flow.c's liveness computation did *not* do this, giving the reasoning as ?!? Unfortunately, marking all hard registers as live causes massive problems for the register allocator and marking all pseudos as live creates mountains of uninitialized variable warnings. In order to maintain the status quo with regard to liveness and uses, we do what flow.c did and just mark any regs we can find in ASM_OPERANDS as used. In global asm insns are scanned and regs_asm_clobbered is filled out. For all ASM_OPERANDS, we must traverse the vector of input operands. We can not just fall through here since then we would be confused by the ASM_INPUT rtx inside ASM_OPERANDS, which do not indicate traditional asms unlike their normal usage. */ if (code == ASM_OPERANDS) { int j; for (j = 0; j ASM_OPERANDS_INPUT_LENGTH (x); j++) df_uses_record (collection_rec, ASM_OPERANDS_INPUT (x, j), DF_REF_REG_USE, bb, insn_info, flags); return; } break; } Also, ira-lives.c (which tracks the liveness of both pseudo and hard registers) doesn't mention volatile at all. So most passes assume that no pseudos or hard registers will be implicitly clobbered by unspec_volatile, just like for a volatile asm. And IMO that's right. I think the rule should be the same for volatile asms and unspec_volatiles, and the same for registers as it already is for memory: if the instruction clobbers something, it should say so explicitly. Volatile itself should: (a) prevent deletion or duplication of the operation (b) prevent reordering wrt other volatiles (c) prevent the operation from being considered equivalent to any other operation (even if it's structurally identical and has the same inputs) but nothing beyond that. So in terms of where we go from here, I'd hope we'd add something like fake hard registers to track any processor mode of interest, such as FP rounding mode once that is modelled properly. Then anything that relies on a specific mode would use that register and anything that changes the mode would set the register, meaning we have a proper dependency chain. I think we should do the same for whatever unspec_volatiles caused the RTL CSE co. to be so conservative. Trying to leave it implicit seems too error-prone, because then you have to make sure that every rtl pass agrees