Re: Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 20/01/15 13:26, Maxim Kuvyrkov wrote: On Jan 20, 2015, at 1:24 PM, Richard Earnshaw rearn...@arm.com wrote: ... In general, how should someone tuning the compiler for this parameter select a value that isn't one of (-1, m_i_q_d+1)? From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere. If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day. You only mention 3 values: what was the fourth? Typo. No fourth. It might be better then to define a set of values that represent each of these cases and only allow the tuning parameters to select one of those. The init code then uses that set to select how to set up the various parameters to meet those goals. So something like ARM_SCHED_AUTOPREF_OFF ARM_SCHED_AUTOPREF_RANK ARM_SCHED_AUTOPREF_FULL A patch is attached. I bootstrapped it on arm-linux-gnueabihf. OK to apply? bootstrap failure on chromebook, reproduced on two chromebook. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65020 please. -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 19/01/15 18:14, Maxim Kuvyrkov wrote: On Jan 19, 2015, at 6:05 PM, Richard Earnshaw rearn...@arm.com wrote: On 16/01/15 15:06, Maxim Kuvyrkov wrote: @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune = true, true, /* Prefer 32-bit encodings. */ true,/* Prefer Neon for stringops. */ 8, /* Maximum insns to inline memset. */ - ARM_FUSE_NOTHING /* Fuseable pairs of instructions. */ + ARM_FUSE_NOTHING,/* Fuseable pairs of instructions. */ + max_insn_queue_index + 1 /* Sched L2 autopref depth. */ }; Hmm, two issues here: 1) This requires a static constructor for the tuning table entry (since the value of max_insn_queue_index has to be looked up at run time. Are you sure? I didn't check the object files, but, since max_insn_queue_index is a const int, I would expect a relocation that would be resolved at link time, not a constructor. I did miss this during the original review, sorry about that. In theory yes, but in practice apparently not. Probably something LTO might be able to handle, but in the absence of defaults to LTO I don't see how we can guarantee something like this. I do see a static constructor being put in for this on a cross compiler build. 00d79338 _Z41__static_initialization_and_destruction_0ii: d79338: 55 push %rbp d79339: 48 89 e5mov%rsp,%rbp d7933c: 89 7d fcmov%edi,-0x4(%rbp) d7933f: 89 75 f8mov%esi,-0x8(%rbp) d79342: 83 7d fc 01 cmpl $0x1,-0x4(%rbp) d79346: 75 27 jned7936f _Z41__static_initialization_and_destruction_0ii+0x37 d79348: 81 7d f8 ff ff 00 00cmpl $0x,-0x8(%rbp) d7934f: 75 1e jned7936f _Z41__static_initialization_and_destruction_0ii+0x37 d79351: 8b 05 f1 43 48 00 mov0x4843f1(%rip),%eax # 11fd748 max_insn_queue_index d79357: 83 c0 01add$0x1,%eax d7935a: 89 05 b4 27 a9 00 mov%eax,0xa927b4(%rip) # 180bb14 _ZL19arm_cortex_a15_tune+0x54 d79360: 8b 05 e2 43 48 00 mov0x4843e2(%rip),%eax # 11fd748 max_insn_queue_index d79366: 83 c0 01add$0x1,%eax d79369: 89 05 05 28 a9 00 mov%eax,0xa92805(%rip) # 180bb74 _ZL19arm_cortex_a57_tune+0x54 d7936f: 5d pop%rbp d79370: c3 retq Is it a problem to have a static constructor for the tables? 2) Doesn't this mean that the depth of searching will depend on properties of the automata rather than some machine specific values (so that potentially adding or removing unrelated scheduler rules could change the behaviour of the compiler)? No. The extra queue entries that will appear from extending an unrelated automaton will be empty, so the search will check them, but won't find anything. In general, how should someone tuning the compiler for this parameter select a value that isn't one of (-1, m_i_q_d+1)? From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere. If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day. There doesn't seem to be anything documented in the coding conventions for this specifically. In the absence of any such documentation, I'd rather not grow static constructors for something like this. Can we just shrink the acceptable values to 3 and just call it a day here ? regards Ramana -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 19/01/15 18:14, Maxim Kuvyrkov wrote: On Jan 19, 2015, at 6:05 PM, Richard Earnshaw rearn...@arm.com wrote: On 16/01/15 15:06, Maxim Kuvyrkov wrote: @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune = true, true, /* Prefer 32-bit encodings. */ true, /* Prefer Neon for stringops. */ 8,/* Maximum insns to inline memset. */ - ARM_FUSE_NOTHING /* Fuseable pairs of instructions. */ + ARM_FUSE_NOTHING,/* Fuseable pairs of instructions. */ + max_insn_queue_index + 1 /* Sched L2 autopref depth. */ }; Hmm, two issues here: 1) This requires a static constructor for the tuning table entry (since the value of max_insn_queue_index has to be looked up at run time. Are you sure? I didn't check the object files, but, since max_insn_queue_index is a const int, I would expect a relocation that would be resolved at link time, not a constructor. Yes, I'm sure. Relocations can only resolve addresses of objects, not their contents. LTO might eliminate the need for the reloc, but otherwise the compiler will never see the definition and will need to create a static constructor. Is it a problem to have a static constructor for the tables? Needing constructors means that the compiler can't put the object into read-only sections of the image. It's not a huge problem, but if there are ways by which they can be avoided, that's likely to be preferable; there's a small run-time overhead to running them. 2) Doesn't this mean that the depth of searching will depend on properties of the automata rather than some machine specific values (so that potentially adding or removing unrelated scheduler rules could change the behaviour of the compiler)? No. The extra queue entries that will appear from extending an unrelated automaton will be empty, so the search will check them, but won't find anything. OK, so there's just a minor performance cost of checking values that never hit. In general, how should someone tuning the compiler for this parameter select a value that isn't one of (-1, m_i_q_d+1)? From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere. If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day. You only mention 3 values: what was the fourth? It might be better then to define a set of values that represent each of these cases and only allow the tuning parameters to select one of those. The init code then uses that set to select how to set up the various parameters to meet those goals. So something like ARM_SCHED_AUTOPREF_OFF ARM_SCHED_AUTOPREF_RANK ARM_SCHED_AUTOPREF_FULL R. -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Jan 20, 2015, at 1:24 PM, Richard Earnshaw rearn...@arm.com wrote: ... In general, how should someone tuning the compiler for this parameter select a value that isn't one of (-1, m_i_q_d+1)? From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere. If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day. You only mention 3 values: what was the fourth? Typo. No fourth. It might be better then to define a set of values that represent each of these cases and only allow the tuning parameters to select one of those. The init code then uses that set to select how to set up the various parameters to meet those goals. So something like ARM_SCHED_AUTOPREF_OFF ARM_SCHED_AUTOPREF_RANK ARM_SCHED_AUTOPREF_FULL A patch is attached. I bootstrapped it on arm-linux-gnueabihf. OK to apply? -- Maxim Kuvyrkov www.linaro.org 0001-Use-enum-for-sched_autopref-tune-settings.patch Description: Binary data
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 20/01/15 13:26, Maxim Kuvyrkov wrote: On Jan 20, 2015, at 1:24 PM, Richard Earnshaw rearn...@arm.com wrote: ... In general, how should someone tuning the compiler for this parameter select a value that isn't one of (-1, m_i_q_d+1)? From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere. If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day. You only mention 3 values: what was the fourth? Typo. No fourth. It might be better then to define a set of values that represent each of these cases and only allow the tuning parameters to select one of those. The init code then uses that set to select how to set up the various parameters to meet those goals. So something like ARM_SCHED_AUTOPREF_OFF ARM_SCHED_AUTOPREF_RANK ARM_SCHED_AUTOPREF_FULL A patch is attached. I bootstrapped it on arm-linux-gnueabihf. OK to apply? OK. Thanks. R. -- Maxim Kuvyrkov www.linaro.org 0001-Use-enum-for-sched_autopref-tune-settings.patch From 9d9ee7c33210960970d0d78ccc7a16a58b392f85 Mon Sep 17 00:00:00 2001 From: Maxim Kuvyrkov maxim.kuvyr...@linaro.org Date: Tue, 20 Jan 2015 12:30:37 + Subject: [PATCH 1/3] Use enum for sched_autopref tune settings * config/arm/arm-protos.h (enum arm_sched_autopref): New constants. (struct tune_params): Use the enum. * arm.c (arm_*_tune): Update. (arm_option_override): Update. --- gcc/config/arm/arm-protos.h |9 +++- gcc/config/arm/arm.c| 51 +-- 2 files changed, 38 insertions(+), 22 deletions(-) diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h index 3db7e16..307babb 100644 --- a/gcc/config/arm/arm-protos.h +++ b/gcc/config/arm/arm-protos.h @@ -257,6 +257,13 @@ struct cpu_vec_costs { struct cpu_cost_table; +enum arm_sched_autopref + { +ARM_SCHED_AUTOPREF_OFF, +ARM_SCHED_AUTOPREF_RANK, +ARM_SCHED_AUTOPREF_FULL + }; + struct tune_params { bool (*rtx_costs) (rtx, RTX_CODE, RTX_CODE, int *, bool); @@ -292,7 +299,7 @@ struct tune_params /* Bitfield encoding the fuseable pairs of instructions. */ unsigned int fuseable_ops; /* Depth of scheduling queue to check for L2 autoprefetcher. */ - int sched_autopref_queue_depth; + enum arm_sched_autopref sched_autopref; }; extern const struct tune_params *current_tune; diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c index fddd770..34672ce 100644 --- a/gcc/config/arm/arm.c +++ b/gcc/config/arm/arm.c @@ -1697,7 +1697,7 @@ const struct tune_params arm_slowmul_tune = false, /* Prefer Neon for stringops. */ 8, /* Maximum insns to inline memset. */ ARM_FUSE_NOTHING, /* Fuseable pairs of instructions. */ - -1 /* Sched L2 autopref depth. */ + ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref. */ }; const struct tune_params arm_fastmul_tune = @@ -1718,7 +1718,7 @@ const struct tune_params arm_fastmul_tune = false, /* Prefer Neon for stringops. */ 8, /* Maximum insns to inline memset. */ ARM_FUSE_NOTHING, /* Fuseable pairs of instructions. */ - -1 /* Sched L2 autopref depth. */ + ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref. */ }; /* StrongARM has early execution of branches, so a sequence that is worth @@ -1742,7 +1742,7 @@ const struct tune_params arm_strongarm_tune = false, /* Prefer Neon for stringops. */ 8, /* Maximum insns to inline memset. */ ARM_FUSE_NOTHING, /* Fuseable pairs of instructions. */ - -1 /* Sched L2 autopref depth. */ + ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref. */ }; const struct tune_params arm_xscale_tune = @@ -1763,7 +1763,7 @@ const struct tune_params arm_xscale_tune = false, /* Prefer Neon for stringops. */ 8, /* Maximum insns to inline memset. */ ARM_FUSE_NOTHING, /* Fuseable pairs of instructions. */ - -1 /* Sched L2 autopref depth. */ + ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref. */ }; const struct tune_params arm_9e_tune = @@ -1784,7 +1784,7 @@ const struct tune_params arm_9e_tune = false,
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 16/01/15 15:06, Maxim Kuvyrkov wrote: @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune = true, true, /* Prefer 32-bit encodings. */ true, /* Prefer Neon for stringops. */ 8, /* Maximum insns to inline memset. */ - ARM_FUSE_NOTHING /* Fuseable pairs of instructions. */ + ARM_FUSE_NOTHING, /* Fuseable pairs of instructions. */ + max_insn_queue_index + 1 /* Sched L2 autopref depth. */ }; Hmm, two issues here: 1) This requires a static constructor for the tuning table entry (since the value of max_insn_queue_index has to be looked up at run time. 2) Doesn't this mean that the depth of searching will depend on properties of the automata rather than some machine specific values (so that potentially adding or removing unrelated scheduler rules could change the behaviour of the compiler)? In general, how should someone tuning the compiler for this parameter select a value that isn't one of (-1, m_i_q_d+1)? R.
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Jan 19, 2015, at 6:05 PM, Richard Earnshaw rearn...@arm.com wrote: On 16/01/15 15:06, Maxim Kuvyrkov wrote: @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune = true, true, /* Prefer 32-bit encodings. */ true, /* Prefer Neon for stringops. */ 8, /* Maximum insns to inline memset. */ - ARM_FUSE_NOTHING /* Fuseable pairs of instructions. */ + ARM_FUSE_NOTHING, /* Fuseable pairs of instructions. */ + max_insn_queue_index + 1 /* Sched L2 autopref depth. */ }; Hmm, two issues here: 1) This requires a static constructor for the tuning table entry (since the value of max_insn_queue_index has to be looked up at run time. Are you sure? I didn't check the object files, but, since max_insn_queue_index is a const int, I would expect a relocation that would be resolved at link time, not a constructor. Is it a problem to have a static constructor for the tables? 2) Doesn't this mean that the depth of searching will depend on properties of the automata rather than some machine specific values (so that potentially adding or removing unrelated scheduler rules could change the behaviour of the compiler)? No. The extra queue entries that will appear from extending an unrelated automaton will be empty, so the search will check them, but won't find anything. In general, how should someone tuning the compiler for this parameter select a value that isn't one of (-1, m_i_q_d+1)? From my experiments it seems there are 4 reasonable values for the parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) turned on everywhere. If there is a static constructor generated for tune tables and it is a problem to have it -- I can shrink acceptable values to these 3 and call it a day. -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan ramana.radhakrish...@arm.com wrote: Hi Ramana, Hi Vladimir, I still don't have SPEC2000/SPEC2006 benchmark numbers for this patch. Since stage3 is about to finish, I'm going to commit the target independent part of the patch now (it was approved by Vladimir a while ago). I'll commit ARM part of the patch once Ramana is happy with it. I've addressed all review comments and updated patch is attached. Any objections? On 14/11/14 15:12, Maxim Kuvyrkov wrote: On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote: On 10/20/14 22:06, Maxim Kuvyrkov wrote: Hi, Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. What were the results of that benchmarking? IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that. Sorry, my bad there). I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers. The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve. I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk. The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k. After going back to the original benchmark that exposed the autoprefetcher issue (STREAMS) the required value of the parameter to fix the regression turned out to be 5. I have also discovered that Cortex-A57 also benefits from this optimization. Running STREAMS benchmark on Juno in AArch32 mode showed increased performance by 10-15%, which is not as big as Cortex-A15's 250-300%, but still significant. In this updated patch the parameter is set to maximum -- look through all scheduling queue of instructions in search of memory ops that are relevant to L2 autoprefetcher. In practice the setting to maximum will not change much, since there are rarely many instructions in the queue. @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void) return issue_rate 1 ? issue_rate : 0; } +/* Enable modeling of Cortex-A15 L2 auto-prefetcher. */ +static int +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index) +{ + switch (arm_tune) +{ +case cortexa15: + return autopref_multipass_dfa_lookahead_guard (insn, ready_index); + +default: + return 0; +} +} + It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued. So this logic here should just be if (current_tune-multipass_lookahead) return autopref_multipass_lookahead_guard (insn, ready_index); else return 0; Done. The autoprefetcher is enabled in Cortex-A15 and Cortex-A57 tuning tables. -- Maxim Kuvyrkov www.linaro.org 0001-Model-cache-auto-prefetcher-in-scheduler.patch Description: Binary data
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Fri, Jan 16, 2015 at 3:06 PM, Maxim Kuvyrkov maxim.kuvyr...@linaro.org wrote: On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan ramana.radhakrish...@arm.com wrote: Hi Ramana, Hi Vladimir, I still don't have SPEC2000/SPEC2006 benchmark numbers for this patch. Since stage3 is about to finish, I'm going to commit the target independent part of the patch now (it was approved by Vladimir a while ago). I'll commit ARM part of the patch once Ramana is happy with it. I've addressed all review comments and updated patch is attached. Any objections? On 14/11/14 15:12, Maxim Kuvyrkov wrote: On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote: On 10/20/14 22:06, Maxim Kuvyrkov wrote: Hi, Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. What were the results of that benchmarking? IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that. Sorry, my bad there). I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers. The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve. I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk. The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k. After going back to the original benchmark that exposed the autoprefetcher issue (STREAMS) the required value of the parameter to fix the regression turned out to be 5. I have also discovered that Cortex-A57 also benefits from this optimization. Running STREAMS benchmark on Juno in AArch32 mode showed increased performance by 10-15%, which is not as big as Cortex-A15's 250-300%, but still significant. In this updated patch the parameter is set to maximum -- look through all scheduling queue of instructions in search of memory ops that are relevant to L2 autoprefetcher. In practice the setting to maximum will not change much, since there are rarely many instructions in the queue. @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void) return issue_rate 1 ? issue_rate : 0; } +/* Enable modeling of Cortex-A15 L2 auto-prefetcher. */ +static int +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index) +{ + switch (arm_tune) +{ +case cortexa15: + return autopref_multipass_dfa_lookahead_guard (insn, ready_index); + +default: + return 0; +} +} + It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued. So this logic here should just be if (current_tune-multipass_lookahead) return autopref_multipass_lookahead_guard (insn, ready_index); else return 0; Done. The autoprefetcher is enabled in Cortex-A15 and Cortex-A57 tuning tables. The ARM changes look sane. As long as you've tested this with a bootstrap, I'm ok with it - we probably need a similar change for AArch64 and Cortex-A57 but I think we should do that for the next stage1. Thanks for persevering with this. Ramana -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 14/11/14 15:12, Maxim Kuvyrkov wrote: On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote: On 10/20/14 22:06, Maxim Kuvyrkov wrote: Hi, Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. What were the results of that benchmarking? IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that. Sorry, my bad there). I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers. The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve. I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk. The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k. @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void) return issue_rate 1 ? issue_rate : 0; } +/* Enable modeling of Cortex-A15 L2 auto-prefetcher. */ +static int +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index) +{ + switch (arm_tune) +{ +case cortexa15: + return autopref_multipass_dfa_lookahead_guard (insn, ready_index); + +default: + return 0; +} +} + It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued. So this logic here should just be if (current_tune-multipass_lookahead) return autopref_multipass_lookahead_guard (insn, ready_index); else return 0; regards Ramana ... Can this be built on top of Bin's work for insn fusion? There's a lot of commonality in the structure of the insns you care about. He's already got a nice little priority function that I think you could utilize to to ensure the insns with smaller offsets fire first. I would argue that macro-fusion should have been implemented the way autopref_model is -- via targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook. To implement the autopref model I cleaned up and generalized existing infrastructure (max_issue and dfa_lookahead_guard hook) instead of adding yet another decision-making primitive to the scheduler. My biggest concern would be sched2 coming along and undoing that work since you're not going to fuse those into move-multiple types of instructions. The autoprefetcher will be active only during sched2. It is disabled during sched1 by the fact that max_issue is not used when scheduling for register pressure. Thanks, -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan ramana.radhakrish...@arm.com wrote: On 14/11/14 15:12, Maxim Kuvyrkov wrote: On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote: On 10/20/14 22:06, Maxim Kuvyrkov wrote: Hi, Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. What were the results of that benchmarking? IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that. Sorry, my bad there). I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers. The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve. I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees you gave me links to against the equivalent version on trunk. The results with SPEC2k on A15 were in the noise with the default value for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting on results for values 0, 1 and 3 and hopefully something will come back soon for SPEC2k. I think there is a good chance that this optimization will not improve SPEC benchmarks, my main concern is to make sure SPECs don't regress. If SPECs are neutral and the patch gives good improvement (2.5 times faster) on another benchmark (STREAM), is this a good enough argument for commit? @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void) return issue_rate 1 ? issue_rate : 0; } +/* Enable modeling of Cortex-A15 L2 auto-prefetcher. */ +static int +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index) +{ + switch (arm_tune) +{ +case cortexa15: + return autopref_multipass_dfa_lookahead_guard (insn, ready_index); + +default: + return 0; +} +} + It would be better to have this as a flag in the tuning tables rather than hardcoding for a core here. The backend has been moving in that direction for all core centric information and it is preferable that be continued. So this logic here should just be if (current_tune-multipass_lookahead) return autopref_multipass_lookahead_guard (insn, ready_index); else return 0; OK, I'll convert this to use tuning tables. Thank you, -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Nov 14, 2014, at 4:57 AM, Vladimir Makarov vmaka...@redhat.com wrote: On 2014-10-21 12:06 AM, Maxim Kuvyrkov wrote: ... I'd prefer symbolic constants for dont_delay. Also the address can contains other parts, e.g. index for some targets. It is not necessary to change the code but a comment would be nice that right now it is oriented for machine with base+disp only addressing. Although it is probably matter of taste. So you are free to commit it without any change. I'll add an enum with symbolic constants for dont_delay and a comment about handled memory address types. Thanks for the review! -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote: On 10/20/14 22:06, Maxim Kuvyrkov wrote: Hi, Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. What were the results of that benchmarking? IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that. Sorry, my bad there). I don't have the benchmarking results yet, and I was hoping for ARM to help with getting the numbers. The arm maintainers still need to OK the arm-specific portion of the patch, which, I imagine, will happen only of benchmark scores improve. ... Can this be built on top of Bin's work for insn fusion? There's a lot of commonality in the structure of the insns you care about. He's already got a nice little priority function that I think you could utilize to to ensure the insns with smaller offsets fire first. I would argue that macro-fusion should have been implemented the way autopref_model is -- via targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook. To implement the autopref model I cleaned up and generalized existing infrastructure (max_issue and dfa_lookahead_guard hook) instead of adding yet another decision-making primitive to the scheduler. My biggest concern would be sched2 coming along and undoing that work since you're not going to fuse those into move-multiple types of instructions. The autoprefetcher will be active only during sched2. It is disabled during sched1 by the fact that max_issue is not used when scheduling for register pressure. Thanks, -- Maxim Kuvyrkov www.linaro.org
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 11/14/14 08:12, Maxim Kuvyrkov wrote: ... Can this be built on top of Bin's work for insn fusion? There's a lot of commonality in the structure of the insns you care about. He's already got a nice little priority function that I think you could utilize to to ensure the insns with smaller offsets fire first. I would argue that macro-fusion should have been implemented the way autopref_model is -- via targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook. To implement the autopref model I cleaned up and generalized existing infrastructure (max_issue and dfa_lookahead_guard hook) instead of adding yet another decision-making primitive to the scheduler. Fair enough. We can put unifying the two implementations on the queue for next stage 1. My biggest concern would be sched2 coming along and undoing that work since you're not going to fuse those into move-multiple types of instructions. The autoprefetcher will be active only during sched2. It is disabled during sched1 by the fact that max_issue is not used when scheduling for register pressure. Sorry, should have been clearer, that comment was in reference to using the fusion stuff to handle prefetching. It was not in reference to your change. Vlad's approval and my questions crossed last night, I don't have any objections to this going in, though I do want us to look at unifying the memory pairing and prefetching stuff during the next stage1. Jeff
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 2014-10-21 12:06 AM, Maxim Kuvyrkov wrote: Hi, This patch adds auto-prefetcher modeling to GCC scheduler. The auto-prefetcher model is currently enabled only for ARM Cortex-A15, since this is the only CPU that I know of to have the hardware auto-prefetcher unit. The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short note in Cortex-A15 manual (search for L2 cache auto-prefether). This patch, therefore, implements a very abstract model that makes scheduler prefer mem_op (base+8); mem_op (base+12) over mem_op (base+12); mem_op (base+8). In other words, memory operations are tried to be issued in order of increasing memory offsets. The auto-prefetcher model implementation is based on max_issue mutlipass lookahead scheduling, and its guard hook. The guard hook examines contents of the ready list and the queue, and, if it finds instructions with lower memory offsets, marks instructions with higher memory offset as unavailable for immediate scheduling. This patch has been in works since beginning of the year, and many of my previous scheduler cleanup patches were to prepare the infrastructure for this feature. Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. At the moment the parameter is set to 2, which means that the autopref model will look through ready list and 1-stall queue in search of relevant instructions. Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through ready list and 2-stall queue) should be considered and benchmarked. Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and aarch64-linux-gnu. OK to apply, provided no performance or correctness regressions? [ChangeLog is part of the git patch] I'd prefer symbolic constants for dont_delay. Also the address can contains other parts, e.g. index for some targets. It is not necessary to change the code but a comment would be nice that right now it is oriented for machine with base+disp only addressing. Although it is probably matter of taste. So you are free to commit it without any change. Thanks, Maxim.
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On 10/20/14 22:06, Maxim Kuvyrkov wrote: Hi, Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. What were the results of that benchmarking? IIRC I tabled reviewing this work waiting for those results (and I probably should have let you know that. Sorry, my bad there). 0007-Model-cache-auto-prefetcher-in-scheduler.patch From 629c2cc593b49b8596b00e3e3d6293aa3514 Mon Sep 17 00:00:00 2001 From: Maxim Kuvyrkovmaxim.kuvyr...@linaro.org Date: Mon, 20 Oct 2014 23:13:23 +0100 Subject: [PATCH 7/8] Model cache auto-prefetcher in scheduler * config/arm/arm.c (sched-int.h): Include header. (arm_first_cycle_multipass_dfa_lookahead_guard,) (TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD): Define hook. Enable auto-prefetcher model for Cortex-A15. (arm_option_override): Set autoprefetcher parameter. * config/arm/t-arm (arm.o): Update. * haifa-sched.c (update_insn_after_change): Update. (rank_for_schedule): Use auto-prefetcher model, if requested. (autopref_multipass_init): New static function. (autopref_rank_for_schedule): New rank_for_schedule heuristic. (autopref_multipass_dfa_lookahead_guard_started_dump_p): New static variable for debug dumps. (autopref_multipass_dfa_lookahead_guard_1): New static helper function. (autopref_multipass_dfa_lookahead_guard): New global function that implements TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD hook. (init_h_i_d): Update. * params.def (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH): New tuning knob. * sched-int.h (autopref_multipass_data_): Structure for auto-prefetcher data. (autopref_multipass_data_def, autopref_multipass_data_t): New typedefs. (struct _haifa_insn_data:autopref_multipass_data): New field. (INSN_AUTOPREF_MULTIPASS_DATA): New access macro. (autopref_multipass_dfa_lookahead_guard): Declare. Can this be built on top of Bin's work for insn fusion? There's a lot of commonality in the structure of the insns you care about. He's already got a nice little priority function that I think you could utilize to to ensure the insns with smaller offsets fire first. My biggest concern would be sched2 coming along and undoing that work since you're not going to fuse those into move-multiple types of instructions. Jeff
Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler
On Oct 21, 2014, at 8:06 AM, Maxim Kuvyrkov maxim.kuvyr...@linaro.org wrote: Hi, This patch adds auto-prefetcher modeling to GCC scheduler. The auto-prefetcher model is currently enabled only for ARM Cortex-A15, since this is the only CPU that I know of to have the hardware auto-prefetcher unit. The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short note in Cortex-A15 manual (search for L2 cache auto-prefether). This patch, therefore, implements a very abstract model that makes scheduler prefer mem_op (base+8); mem_op (base+12) over mem_op (base+12); mem_op (base+8). In other words, memory operations are tried to be issued in order of increasing memory offsets. The auto-prefetcher model implementation is based on max_issue mutlipass lookahead scheduling, and its guard hook. The guard hook examines contents of the ready list and the queue, and, if it finds instructions with lower memory offsets, marks instructions with higher memory offset as unavailable for immediate scheduling. This patch has been in works since beginning of the year, and many of my previous scheduler cleanup patches were to prepare the infrastructure for this feature. Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. At the moment the parameter is set to 2, which means that the autopref model will look through ready list and 1-stall queue in search of relevant instructions. Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through ready list and 2-stall queue) should be considered and benchmarked. Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and aarch64-linux-gnu. OK to apply, provided no performance or correctness regressions? [ChangeLog is part of the git patch] Ping? All prerequisite patches for this one are now approved and [mostly] checked in. This is the last outstanding item from my patch series to improve scheduling. Thank you, -- Maxim Kuvyrkov www.linaro.org
[PATCH 7/8] Model cache auto-prefetcher in scheduler
Hi, This patch adds auto-prefetcher modeling to GCC scheduler. The auto-prefetcher model is currently enabled only for ARM Cortex-A15, since this is the only CPU that I know of to have the hardware auto-prefetcher unit. The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short note in Cortex-A15 manual (search for L2 cache auto-prefether). This patch, therefore, implements a very abstract model that makes scheduler prefer mem_op (base+8); mem_op (base+12) over mem_op (base+12); mem_op (base+8). In other words, memory operations are tried to be issued in order of increasing memory offsets. The auto-prefetcher model implementation is based on max_issue mutlipass lookahead scheduling, and its guard hook. The guard hook examines contents of the ready list and the queue, and, if it finds instructions with lower memory offsets, marks instructions with higher memory offset as unavailable for immediate scheduling. This patch has been in works since beginning of the year, and many of my previous scheduler cleanup patches were to prepare the infrastructure for this feature. Ramana, this change requires benchmarking, which I can't easily do at the moment. I would appreciate any benchmarking results that you can share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be tuned/confirmed for Cortex-A15. At the moment the parameter is set to 2, which means that the autopref model will look through ready list and 1-stall queue in search of relevant instructions. Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through ready list and 2-stall queue) should be considered and benchmarked. Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and aarch64-linux-gnu. OK to apply, provided no performance or correctness regressions? [ChangeLog is part of the git patch] Thank you, -- Maxim Kuvyrkov www.linaro.org 0007-Model-cache-auto-prefetcher-in-scheduler.patch Description: Binary data