Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
On 08/08/2016 01:04 PM, Jiong Wang wrote: [...] There is very tiny performance regression on SPEC2K6INT 464.h264ref. Checking the codegen, there are some bad instruction scheduling, it looks to me caused by REG_ALLOC_ORDER is not used consistently inside IRA that parts of the code are using new allocation order while some other parts are still using original order. I see in ira.c:setup_class_hard_regs we are checking whether REG_ALLOC_ORDER is defined: #ifdef REG_ALLOC_ORDER hard_regno = reg_alloc_order[i]; #else hard_regno = i; #endif but in ira-color.c:assign_hard_reg, there is no similar check: /* We don't care about giving callee saved registers to allocnos no living through calls because call clobbered registers are allocated first (it is usual practice to put them first in REG_ALLOC_ORDER). */ mode = ALLOCNO_MODE (a); for (i = 0; i < class_size; i++) { hard_regno = ira_class_hard_regs[aclass][i]; We might want to use reg_alloc_order to map above "i" if REG_ALLOC_ORDER is defined? Vlad, any comments? The order is already present in ira_class_hard_regs. So there is no need to use it in ira-color.c::assign_hard_reg.
Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
Jiong Wang writes: > Andrew Pinski writes: > >> On Mon, Jul 27, 2015 at 3:36 AM, James Greenhalgh >>wrote: >>> On Mon, Jul 27, 2015 at 10:52:58AM +0100, pins...@gmail.com wrote: > On Jul 27, 2015, at 2:26 AM, Jiong Wang wrote: > > Andrew Pinski writes: > >>> On Fri, Jul 24, 2015 at 2:07 AM, Jiong Wang wrote: >>> >>> James Greenhalgh writes: >>> > On Wed, May 20, 2015 at 01:35:41PM +0100, Jiong Wang wrote: > Current IRA still use both target macros in a few places. > > Tell IRA to use the order we defined rather than with it's own cost > calculation. Allocate caller saved first, then callee saved. > > This is especially useful for LR/x30, as it's free to allocate and is > pure caller saved when used in leaf function. > > Haven't noticed significant impact on benchmarks, but by grepping > some > keywords like "Spilling", "Push.*spill" etc in ira rtl dump, the > number > is smaller. > > OK for trunk? OK, sorry for the delay. It might be mail client mangling, but please check that the trailing slashes line up in the version that gets committed. Thanks, James > 2015-05-19 Jiong. Wang > > gcc/ > PR 63521 > * config/aarch64/aarch64.h (REG_ALLOC_ORDER): Define. > (HONOR_REG_ALLOC_ORDER): Define. >>> >>> Patch reverted. >> >> I did not see a reason why this patch was reverted. Maybe I am >> missing an email or something. > > There are several execution regressions under gcc testsuite, although as > far as I can see it's this patch exposed hidding bugs in those > testcases, but there might be one other issue, so to be conservative, I > temporarily reverted this patch. If you are talking about: gcc.target/aarch64/aapcs64/func-ret-2.c execution Etc. These test cases are too dependent on the original register allocation order and really can be safely ignored. Really these three tests should be moved or written in a more sane way. >>> >>> Yup, completely agreed - but the testcases do throw up something >>> interesting. If we are allocating registers to hold 128-bit values, and >>> we pick x7 as highest preference, we implicitly allocate x8 along with it. >>> I think we probably see the same thing if the first thing we do in a >>> function is a structure copy through a back-end expanded movmem, which >>> will likely begin with a 128-bit LDP using x7, x8. >>> >>> If the argument for this patch is that we prefer to allocate x7-x0 first, >>> followed by x8, then we've potentially made a sub-optimal decision, our >>> allocation order for 128-bit values is x7,x8,x5,x6 etc. >>> >>> My hunch is that we *might* get better code generation in this corner case >>> out of some permutation of the allocation order for argument >>> registers. I'm thinking something along the lines of >>> >>> {x6, x5, x4, x7, x3, x2, x1, x0, x8, ... } >>> >>> I asked Jiong to take a look at that, and I agree with his decision to >>> reduce the churn on trunk and just revert the patch until we've come to >>> a conclusion based on some evidence - rather than just my hunch! I agree >>> that it would be harmless on trunk from a testing point of view, but I >>> think Jiong is right to revert the patch until we better understand the >>> code-generation implications. >>> >>> Of course, it might be that I am completely wrong! If you've already taken >>> a look at using a register allocation order like the example I gave and >>> have something to share, I'd be happy to read your advice! >> >> Any news on this patch? It has been a year since it was reverted for >> a bad test that was failing. > > Hi Andrew, > > Yeah, those tests actually are expected to fail once register > allocation order changed, it's clearly documented in the comments: > > gcc.target/aarch64/aapcs64/abitest-2.h: > > /* ... > >Note that for value that is returned in the caller-allocated memory >block, we get the address from the saved x8 register. x8 is saved >just after the callee is returned; we assume that x8 has not been >clobbered at then, although there is no requirement for the callee >preserve the value stored in x8. Luckily, all test cases here are >simple enough that x8 doesn't normally get clobbered (although not >guaranteed). */ > > I had a local fix which use the redundant value returned to x0 to > repair the clobbered value in x8 as they will be identical for > structure type return, however that trick can't play anymore as we > recently defined TARGET_OMIT_STRUCT_RETURN_REG to true which will >
Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
Andrew Pinski writes: > On Mon, Jul 27, 2015 at 3:36 AM, James Greenhalgh >wrote: >> On Mon, Jul 27, 2015 at 10:52:58AM +0100, pins...@gmail.com wrote: >>> > On Jul 27, 2015, at 2:26 AM, Jiong Wang wrote: >>> > >>> > Andrew Pinski writes: >>> > >>> >>> On Fri, Jul 24, 2015 at 2:07 AM, Jiong Wang wrote: >>> >>> >>> >>> James Greenhalgh writes: >>> >>> >>> > On Wed, May 20, 2015 at 01:35:41PM +0100, Jiong Wang wrote: >>> > Current IRA still use both target macros in a few places. >>> > >>> > Tell IRA to use the order we defined rather than with it's own cost >>> > calculation. Allocate caller saved first, then callee saved. >>> > >>> > This is especially useful for LR/x30, as it's free to allocate and is >>> > pure caller saved when used in leaf function. >>> > >>> > Haven't noticed significant impact on benchmarks, but by grepping some >>> > keywords like "Spilling", "Push.*spill" etc in ira rtl dump, the >>> > number >>> > is smaller. >>> > >>> > OK for trunk? >>> >>> OK, sorry for the delay. >>> >>> It might be mail client mangling, but please check that the trailing >>> slashes >>> line up in the version that gets committed. >>> >>> Thanks, >>> James >>> >>> > 2015-05-19 Jiong. Wang >>> > >>> > gcc/ >>> > PR 63521 >>> > * config/aarch64/aarch64.h (REG_ALLOC_ORDER): Define. >>> > (HONOR_REG_ALLOC_ORDER): Define. >>> >>> >>> >>> Patch reverted. >>> >> >>> >> I did not see a reason why this patch was reverted. Maybe I am >>> >> missing an email or something. >>> > >>> > There are several execution regressions under gcc testsuite, although as >>> > far as I can see it's this patch exposed hidding bugs in those >>> > testcases, but there might be one other issue, so to be conservative, I >>> > temporarily reverted this patch. >>> >>> If you are talking about: >>> gcc.target/aarch64/aapcs64/func-ret-2.c execution >>> Etc. >>> >>> These test cases are too dependent on the original register allocation order >>> and really can be safely ignored. Really these three tests should be moved >>> or >>> written in a more sane way. >> >> Yup, completely agreed - but the testcases do throw up something >> interesting. If we are allocating registers to hold 128-bit values, and >> we pick x7 as highest preference, we implicitly allocate x8 along with it. >> I think we probably see the same thing if the first thing we do in a >> function is a structure copy through a back-end expanded movmem, which >> will likely begin with a 128-bit LDP using x7, x8. >> >> If the argument for this patch is that we prefer to allocate x7-x0 first, >> followed by x8, then we've potentially made a sub-optimal decision, our >> allocation order for 128-bit values is x7,x8,x5,x6 etc. >> >> My hunch is that we *might* get better code generation in this corner case >> out of some permutation of the allocation order for argument >> registers. I'm thinking something along the lines of >> >> {x6, x5, x4, x7, x3, x2, x1, x0, x8, ... } >> >> I asked Jiong to take a look at that, and I agree with his decision to >> reduce the churn on trunk and just revert the patch until we've come to >> a conclusion based on some evidence - rather than just my hunch! I agree >> that it would be harmless on trunk from a testing point of view, but I >> think Jiong is right to revert the patch until we better understand the >> code-generation implications. >> >> Of course, it might be that I am completely wrong! If you've already taken >> a look at using a register allocation order like the example I gave and >> have something to share, I'd be happy to read your advice! > > Any news on this patch? It has been a year since it was reverted for > a bad test that was failing. Hi Andrew, Yeah, those tests actually are expected to fail once register allocation order changed, it's clearly documented in the comments: gcc.target/aarch64/aapcs64/abitest-2.h: /* ... Note that for value that is returned in the caller-allocated memory block, we get the address from the saved x8 register. x8 is saved just after the callee is returned; we assume that x8 has not been clobbered at then, although there is no requirement for the callee preserve the value stored in x8. Luckily, all test cases here are simple enough that x8 doesn't normally get clobbered (although not guaranteed). */ I had a local fix which use the redundant value returned to x0 to repair the clobbered value in x8 as they will be identical for structure type return, however that trick can't play anymore as we recently defined TARGET_OMIT_STRUCT_RETURN_REG to true which will remove that redundant x8 to x0 copy. Anyway, I will come back with some benchmark results of this patch on top of latest trunk after the weekend run, and also answers
Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
On Mon, Jul 27, 2015 at 3:36 AM, James Greenhalghwrote: > On Mon, Jul 27, 2015 at 10:52:58AM +0100, pins...@gmail.com wrote: >> > On Jul 27, 2015, at 2:26 AM, Jiong Wang wrote: >> > >> > Andrew Pinski writes: >> > >> >>> On Fri, Jul 24, 2015 at 2:07 AM, Jiong Wang wrote: >> >>> >> >>> James Greenhalgh writes: >> >>> >> > On Wed, May 20, 2015 at 01:35:41PM +0100, Jiong Wang wrote: >> > Current IRA still use both target macros in a few places. >> > >> > Tell IRA to use the order we defined rather than with it's own cost >> > calculation. Allocate caller saved first, then callee saved. >> > >> > This is especially useful for LR/x30, as it's free to allocate and is >> > pure caller saved when used in leaf function. >> > >> > Haven't noticed significant impact on benchmarks, but by grepping some >> > keywords like "Spilling", "Push.*spill" etc in ira rtl dump, the number >> > is smaller. >> > >> > OK for trunk? >> >> OK, sorry for the delay. >> >> It might be mail client mangling, but please check that the trailing >> slashes >> line up in the version that gets committed. >> >> Thanks, >> James >> >> > 2015-05-19 Jiong. Wang >> > >> > gcc/ >> > PR 63521 >> > * config/aarch64/aarch64.h (REG_ALLOC_ORDER): Define. >> > (HONOR_REG_ALLOC_ORDER): Define. >> >>> >> >>> Patch reverted. >> >> >> >> I did not see a reason why this patch was reverted. Maybe I am >> >> missing an email or something. >> > >> > There are several execution regressions under gcc testsuite, although as >> > far as I can see it's this patch exposed hidding bugs in those >> > testcases, but there might be one other issue, so to be conservative, I >> > temporarily reverted this patch. >> >> If you are talking about: >> gcc.target/aarch64/aapcs64/func-ret-2.c execution >> Etc. >> >> These test cases are too dependent on the original register allocation order >> and really can be safely ignored. Really these three tests should be moved or >> written in a more sane way. > > Yup, completely agreed - but the testcases do throw up something > interesting. If we are allocating registers to hold 128-bit values, and > we pick x7 as highest preference, we implicitly allocate x8 along with it. > I think we probably see the same thing if the first thing we do in a > function is a structure copy through a back-end expanded movmem, which > will likely begin with a 128-bit LDP using x7, x8. > > If the argument for this patch is that we prefer to allocate x7-x0 first, > followed by x8, then we've potentially made a sub-optimal decision, our > allocation order for 128-bit values is x7,x8,x5,x6 etc. > > My hunch is that we *might* get better code generation in this corner case > out of some permutation of the allocation order for argument > registers. I'm thinking something along the lines of > > {x6, x5, x4, x7, x3, x2, x1, x0, x8, ... } > > I asked Jiong to take a look at that, and I agree with his decision to > reduce the churn on trunk and just revert the patch until we've come to > a conclusion based on some evidence - rather than just my hunch! I agree > that it would be harmless on trunk from a testing point of view, but I > think Jiong is right to revert the patch until we better understand the > code-generation implications. > > Of course, it might be that I am completely wrong! If you've already taken > a look at using a register allocation order like the example I gave and > have something to share, I'd be happy to read your advice! Any news on this patch? It has been a year since it was reverted for a bad test that was failing. Thanks, Andrew > > Thanks, > James >
Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
On Mon, Jul 27, 2015 at 10:52:58AM +0100, pins...@gmail.com wrote: On Jul 27, 2015, at 2:26 AM, Jiong Wang jiong.w...@arm.com wrote: Andrew Pinski writes: On Fri, Jul 24, 2015 at 2:07 AM, Jiong Wang jiong.w...@arm.com wrote: James Greenhalgh writes: On Wed, May 20, 2015 at 01:35:41PM +0100, Jiong Wang wrote: Current IRA still use both target macros in a few places. Tell IRA to use the order we defined rather than with it's own cost calculation. Allocate caller saved first, then callee saved. This is especially useful for LR/x30, as it's free to allocate and is pure caller saved when used in leaf function. Haven't noticed significant impact on benchmarks, but by grepping some keywords like Spilling, Push.*spill etc in ira rtl dump, the number is smaller. OK for trunk? OK, sorry for the delay. It might be mail client mangling, but please check that the trailing slashes line up in the version that gets committed. Thanks, James 2015-05-19 Jiong. Wang jiong.w...@arm.com gcc/ PR 63521 * config/aarch64/aarch64.h (REG_ALLOC_ORDER): Define. (HONOR_REG_ALLOC_ORDER): Define. Patch reverted. I did not see a reason why this patch was reverted. Maybe I am missing an email or something. There are several execution regressions under gcc testsuite, although as far as I can see it's this patch exposed hidding bugs in those testcases, but there might be one other issue, so to be conservative, I temporarily reverted this patch. If you are talking about: gcc.target/aarch64/aapcs64/func-ret-2.c execution Etc. These test cases are too dependent on the original register allocation order and really can be safely ignored. Really these three tests should be moved or written in a more sane way. Yup, completely agreed - but the testcases do throw up something interesting. If we are allocating registers to hold 128-bit values, and we pick x7 as highest preference, we implicitly allocate x8 along with it. I think we probably see the same thing if the first thing we do in a function is a structure copy through a back-end expanded movmem, which will likely begin with a 128-bit LDP using x7, x8. If the argument for this patch is that we prefer to allocate x7-x0 first, followed by x8, then we've potentially made a sub-optimal decision, our allocation order for 128-bit values is x7,x8,x5,x6 etc. My hunch is that we *might* get better code generation in this corner case out of some permutation of the allocation order for argument registers. I'm thinking something along the lines of {x6, x5, x4, x7, x3, x2, x1, x0, x8, ... } I asked Jiong to take a look at that, and I agree with his decision to reduce the churn on trunk and just revert the patch until we've come to a conclusion based on some evidence - rather than just my hunch! I agree that it would be harmless on trunk from a testing point of view, but I think Jiong is right to revert the patch until we better understand the code-generation implications. Of course, it might be that I am completely wrong! If you've already taken a look at using a register allocation order like the example I gave and have something to share, I'd be happy to read your advice! Thanks, James
Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
Andrew Pinski writes: On Fri, Jul 24, 2015 at 2:07 AM, Jiong Wang jiong.w...@arm.com wrote: James Greenhalgh writes: On Wed, May 20, 2015 at 01:35:41PM +0100, Jiong Wang wrote: Current IRA still use both target macros in a few places. Tell IRA to use the order we defined rather than with it's own cost calculation. Allocate caller saved first, then callee saved. This is especially useful for LR/x30, as it's free to allocate and is pure caller saved when used in leaf function. Haven't noticed significant impact on benchmarks, but by grepping some keywords like Spilling, Push.*spill etc in ira rtl dump, the number is smaller. OK for trunk? OK, sorry for the delay. It might be mail client mangling, but please check that the trailing slashes line up in the version that gets committed. Thanks, James 2015-05-19 Jiong. Wang jiong.w...@arm.com gcc/ PR 63521 * config/aarch64/aarch64.h (REG_ALLOC_ORDER): Define. (HONOR_REG_ALLOC_ORDER): Define. Patch reverted. I did not see a reason why this patch was reverted. Maybe I am missing an email or something. There are several execution regressions under gcc testsuite, although as far as I can see it's this patch exposed hidding bugs in those testcases, but there might be one other issue, so to be conservative, I temporarily reverted this patch. Thanks, Andrew -- Regards, Jiong
Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
On Jul 27, 2015, at 2:26 AM, Jiong Wang jiong.w...@arm.com wrote: Andrew Pinski writes: On Fri, Jul 24, 2015 at 2:07 AM, Jiong Wang jiong.w...@arm.com wrote: James Greenhalgh writes: On Wed, May 20, 2015 at 01:35:41PM +0100, Jiong Wang wrote: Current IRA still use both target macros in a few places. Tell IRA to use the order we defined rather than with it's own cost calculation. Allocate caller saved first, then callee saved. This is especially useful for LR/x30, as it's free to allocate and is pure caller saved when used in leaf function. Haven't noticed significant impact on benchmarks, but by grepping some keywords like Spilling, Push.*spill etc in ira rtl dump, the number is smaller. OK for trunk? OK, sorry for the delay. It might be mail client mangling, but please check that the trailing slashes line up in the version that gets committed. Thanks, James 2015-05-19 Jiong. Wang jiong.w...@arm.com gcc/ PR 63521 * config/aarch64/aarch64.h (REG_ALLOC_ORDER): Define. (HONOR_REG_ALLOC_ORDER): Define. Patch reverted. I did not see a reason why this patch was reverted. Maybe I am missing an email or something. There are several execution regressions under gcc testsuite, although as far as I can see it's this patch exposed hidding bugs in those testcases, but there might be one other issue, so to be conservative, I temporarily reverted this patch. If you are talking about: gcc.target/aarch64/aapcs64/func-ret-2.c execution Etc. These test cases are too dependent on the original register allocation order and really can be safely ignored. Really these three tests should be moved or written in a more sane way. Thanks, Andrew Thanks, Andrew -- Regards, Jiong
Re: [Revert][AArch64] PR 63521 Define REG_ALLOC_ORDER/HONOR_REG_ALLOC_ORDER
On Fri, Jul 24, 2015 at 2:07 AM, Jiong Wang jiong.w...@arm.com wrote: James Greenhalgh writes: On Wed, May 20, 2015 at 01:35:41PM +0100, Jiong Wang wrote: Current IRA still use both target macros in a few places. Tell IRA to use the order we defined rather than with it's own cost calculation. Allocate caller saved first, then callee saved. This is especially useful for LR/x30, as it's free to allocate and is pure caller saved when used in leaf function. Haven't noticed significant impact on benchmarks, but by grepping some keywords like Spilling, Push.*spill etc in ira rtl dump, the number is smaller. OK for trunk? OK, sorry for the delay. It might be mail client mangling, but please check that the trailing slashes line up in the version that gets committed. Thanks, James 2015-05-19 Jiong. Wang jiong.w...@arm.com gcc/ PR 63521 * config/aarch64/aarch64.h (REG_ALLOC_ORDER): Define. (HONOR_REG_ALLOC_ORDER): Define. Patch reverted. I did not see a reason why this patch was reverted. Maybe I am missing an email or something. Thanks, Andrew