Re: RFC: IRA patch to reduce lifetimes
On Mon, May 21, 2012 at 9:33 AM, H.J. Lu hjl.to...@gmail.com wrote: On Wed, Apr 11, 2012 at 7:35 AM, Bernd Schmidt ber...@codesourcery.com wrote: On 12/23/2011 05:31 PM, Vladimir Makarov wrote: On 12/21/2011 09:09 AM, Bernd Schmidt wrote: This patch was an experiment to see if we can get the same improvement with modifications to IRA, making it more tolerant to over-aggressive scheduling. THe idea is that if an instruction sets a register A, and all its inputs are live and unmodified for the lifetime of A, then moving the instruction downwards towards its first use is going to be beneficial from a register pressure point of view. That alone, however, turns out to be too aggressive, performance drops presumably because we undo too many scheduling decisions. So, the patch detects such situations, and splits the pseudo; a new pseudo is introduced in the original setting instruction, and a copy is added before the first use. If the new pseudo does not get a hard register, it is removed again and instead the setting instruction is moved to the point of the copy. This gets up to 6.5% on 456.hmmer on the mips target I was working on; an embedded benchmark suite also seems to have a (small) geomean improvement. On x86_64, I've tested spec2k, where specint is unchanged and specfp has a tiny performance regression. All these tests were done with a gcc-4.6 based tree. Thoughts? Currently the patch feels somewhat bolted on to the side of IRA, maybe there's a nicer way to achieve this? I think that is an excellent idea. I used analogous approach for splitting pseudo in IRA on loop bounds even if it gets hard register inside and outside loops. The copies are removed if the live ranges were not spilled in reload. I have no problem with this patch. It is just a small change in IRA. Sounds like you're happier with the patch than I am, so who am I to argue. Here's an updated version against current trunk, with some cc0 bugfixes that I've since discovered to be necessary. Bootstrapped and tested (but not benchmarked again) on i686-linux. Ok? Bernd This caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53411 This also caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53495 -- H.J.
Re: RFC: IRA patch to reduce lifetimes
On Wed, Apr 11, 2012 at 7:35 AM, Bernd Schmidt ber...@codesourcery.com wrote: On 12/23/2011 05:31 PM, Vladimir Makarov wrote: On 12/21/2011 09:09 AM, Bernd Schmidt wrote: This patch was an experiment to see if we can get the same improvement with modifications to IRA, making it more tolerant to over-aggressive scheduling. THe idea is that if an instruction sets a register A, and all its inputs are live and unmodified for the lifetime of A, then moving the instruction downwards towards its first use is going to be beneficial from a register pressure point of view. That alone, however, turns out to be too aggressive, performance drops presumably because we undo too many scheduling decisions. So, the patch detects such situations, and splits the pseudo; a new pseudo is introduced in the original setting instruction, and a copy is added before the first use. If the new pseudo does not get a hard register, it is removed again and instead the setting instruction is moved to the point of the copy. This gets up to 6.5% on 456.hmmer on the mips target I was working on; an embedded benchmark suite also seems to have a (small) geomean improvement. On x86_64, I've tested spec2k, where specint is unchanged and specfp has a tiny performance regression. All these tests were done with a gcc-4.6 based tree. Thoughts? Currently the patch feels somewhat bolted on to the side of IRA, maybe there's a nicer way to achieve this? I think that is an excellent idea. I used analogous approach for splitting pseudo in IRA on loop bounds even if it gets hard register inside and outside loops. The copies are removed if the live ranges were not spilled in reload. I have no problem with this patch. It is just a small change in IRA. Sounds like you're happier with the patch than I am, so who am I to argue. Here's an updated version against current trunk, with some cc0 bugfixes that I've since discovered to be necessary. Bootstrapped and tested (but not benchmarked again) on i686-linux. Ok? Bernd This caused: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=53411 -- H.J.
Re: RFC: IRA patch to reduce lifetimes
On 12/23/2011 05:31 PM, Vladimir Makarov wrote: On 12/21/2011 09:09 AM, Bernd Schmidt wrote: This patch was an experiment to see if we can get the same improvement with modifications to IRA, making it more tolerant to over-aggressive scheduling. THe idea is that if an instruction sets a register A, and all its inputs are live and unmodified for the lifetime of A, then moving the instruction downwards towards its first use is going to be beneficial from a register pressure point of view. That alone, however, turns out to be too aggressive, performance drops presumably because we undo too many scheduling decisions. So, the patch detects such situations, and splits the pseudo; a new pseudo is introduced in the original setting instruction, and a copy is added before the first use. If the new pseudo does not get a hard register, it is removed again and instead the setting instruction is moved to the point of the copy. This gets up to 6.5% on 456.hmmer on the mips target I was working on; an embedded benchmark suite also seems to have a (small) geomean improvement. On x86_64, I've tested spec2k, where specint is unchanged and specfp has a tiny performance regression. All these tests were done with a gcc-4.6 based tree. Thoughts? Currently the patch feels somewhat bolted on to the side of IRA, maybe there's a nicer way to achieve this? I think that is an excellent idea. I used analogous approach for splitting pseudo in IRA on loop bounds even if it gets hard register inside and outside loops. The copies are removed if the live ranges were not spilled in reload. I have no problem with this patch. It is just a small change in IRA. Sounds like you're happier with the patch than I am, so who am I to argue. Here's an updated version against current trunk, with some cc0 bugfixes that I've since discovered to be necessary. Bootstrapped and tested (but not benchmarked again) on i686-linux. Ok? Bernd * dbgcnt.def (ira_move): New counter. * ira-int.h (ira_create_new_reg): Declare function. (first_moveable_pseudo, last_moveable_pseudo): Declare variables. * ira-emit.c (ira_create_new_reg): Renamed from craete_new_reg and no longer static. All callers changed. * ira.c: Include dbgcnt.h. (rtx_moveable_p, insn_dominated_by_p, find_moveable_pseudos, move_unallocated_pseudos): New static functions. (first_moveable_pseudo, last_moveable_pseudo): New global variables. (pseudo_replaced_reg, pseudo_move_insn): New static variables. (ira): Call find_moveable_pseudos and move_unallocated_pseudos. * ira-costs.c (find_costs_and_classes): Assign a memory cost of zero to the pseudos generated in find_moveable_pseudos. * Makefile.in (ira.o): Add $(DBGCNT_H). Index: gcc/dbgcnt.def === --- gcc/dbgcnt.def (revision 186270) +++ gcc/dbgcnt.def (working copy) @@ -184,3 +184,4 @@ DEBUG_COUNTER (sms_sched_loop) DEBUG_COUNTER (store_motion) DEBUG_COUNTER (split_for_sched2) DEBUG_COUNTER (tail_call) +DEBUG_COUNTER (ira_move) Index: gcc/ira-int.h === --- gcc/ira-int.h (revision 186270) +++ gcc/ira-int.h (working copy) @@ -1416,3 +1416,6 @@ ira_allocate_and_set_or_copy_costs (int reg_costs[i] = val; } } + +extern rtx ira_create_new_reg (rtx); +extern int first_moveable_pseudo, last_moveable_pseudo; Index: gcc/ira-emit.c === --- gcc/ira-emit.c (revision 186270) +++ gcc/ira-emit.c (working copy) @@ -330,8 +330,8 @@ add_to_edge_list (edge e, move_t move, b /* Create and return new pseudo-register with the same attributes as ORIGINAL_REG. */ -static rtx -create_new_reg (rtx original_reg) +rtx +ira_create_new_reg (rtx original_reg) { rtx new_reg; @@ -625,7 +625,7 @@ change_loop (ira_loop_tree_node_t node) fprintf (ira_dump_file, %i vs parent %i:, ALLOCNO_HARD_REGNO (allocno), ALLOCNO_HARD_REGNO (parent_allocno)); - set_allocno_reg (allocno, create_new_reg (original_reg)); + set_allocno_reg (allocno, ira_create_new_reg (original_reg)); } } } @@ -646,7 +646,7 @@ change_loop (ira_loop_tree_node_t node) if (! used_p) continue; bitmap_set_bit (renamed_regno_bitmap, regno); - set_allocno_reg (allocno, create_new_reg (allocno_emit_reg (allocno))); + set_allocno_reg (allocno, ira_create_new_reg (allocno_emit_reg (allocno))); } } @@ -852,7 +852,7 @@ modify_move_list (move_t list) ALLOCNO_ASSIGNED_P (new_allocno) = true; ALLOCNO_HARD_REGNO (new_allocno) = -1; ALLOCNO_EMIT_DATA (new_allocno)-reg - = create_new_reg (allocno_emit_reg (set_move-to));
Re: RFC: IRA patch to reduce lifetimes
On 04/11/2012 10:35 AM, Bernd Schmidt wrote: On 12/23/2011 05:31 PM, Vladimir Makarov wrote: On 12/21/2011 09:09 AM, Bernd Schmidt wrote: This patch was an experiment to see if we can get the same improvement with modifications to IRA, making it more tolerant to over-aggressive scheduling. THe idea is that if an instruction sets a register A, and all its inputs are live and unmodified for the lifetime of A, then moving the instruction downwards towards its first use is going to be beneficial from a register pressure point of view. That alone, however, turns out to be too aggressive, performance drops presumably because we undo too many scheduling decisions. So, the patch detects such situations, and splits the pseudo; a new pseudo is introduced in the original setting instruction, and a copy is added before the first use. If the new pseudo does not get a hard register, it is removed again and instead the setting instruction is moved to the point of the copy. This gets up to 6.5% on 456.hmmer on the mips target I was working on; an embedded benchmark suite also seems to have a (small) geomean improvement. On x86_64, I've tested spec2k, where specint is unchanged and specfp has a tiny performance regression. All these tests were done with a gcc-4.6 based tree. Thoughts? Currently the patch feels somewhat bolted on to the side of IRA, maybe there's a nicer way to achieve this? I think that is an excellent idea. I used analogous approach for splitting pseudo in IRA on loop bounds even if it gets hard register inside and outside loops. The copies are removed if the live ranges were not spilled in reload. I have no problem with this patch. It is just a small change in IRA. Sounds like you're happier with the patch than I am, so who am I to argue. Here's an updated version against current trunk, with some cc0 bugfixes that I've since discovered to be necessary. Bootstrapped and tested (but not benchmarked again) on i686-linux. Ok? It is ok. At least it will be useful for gcc4.8. But I am not sure about the longevity of this code. Since my last email a lot was changed on LRA project (which I hope will be ready for gcc4.9). I've implemented live range splitting which works analogously: some pseudo ranges are splited and if a split range does not change the assignment, pseudo live range split is undone. The difference in your approach is that it is done with usage of global view (global RA) and mine is done locally. So it needs more investigation how different the results are. It seems to me that they will complement each other. Probably I'll investigate this when/if LRA is merged. In any case, thanks, Bernd. It is ok to commit this patch.
Re: RFC: IRA patch to reduce lifetimes
On 12/21/2011 09:09 AM, Bernd Schmidt wrote: For a customer I've looked into improving code for 456.hmmer on a mips64 target. The benchmark responds to -fsched-pressure, which reduces lifetimes of a few registers. This patch was an experiment to see if we can get the same improvement with modifications to IRA, making it more tolerant to over-aggressive scheduling. THe idea is that if an instruction sets a register A, and all its inputs are live and unmodified for the lifetime of A, then moving the instruction downwards towards its first use is going to be beneficial from a register pressure point of view. That alone, however, turns out to be too aggressive, performance drops presumably because we undo too many scheduling decisions. So, the patch detects such situations, and splits the pseudo; a new pseudo is introduced in the original setting instruction, and a copy is added before the first use. If the new pseudo does not get a hard register, it is removed again and instead the setting instruction is moved to the point of the copy. This gets up to 6.5% on 456.hmmer on the mips target I was working on; an embedded benchmark suite also seems to have a (small) geomean improvement. On x86_64, I've tested spec2k, where specint is unchanged and specfp has a tiny performance regression. All these tests were done with a gcc-4.6 based tree. Thoughts? Currently the patch feels somewhat bolted on to the side of IRA, maybe there's a nicer way to achieve this? I think that is an excellent idea. I used analogous approach for splitting pseudo in IRA on loop bounds even if it gets hard register inside and outside loops. The copies are removed if the live ranges were not spilled in reload. I have no problem with this patch. It is just a small change in IRA.