Re: Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-02-11 Thread Jiong Wang


On 20/01/15 13:26, Maxim Kuvyrkov wrote:

On Jan 20, 2015, at 1:24 PM, Richard Earnshaw rearn...@arm.com wrote:
...

In general, how should someone tuning the compiler for this parameter
select a value that isn't one of (-1, m_i_q_d+1)?

 From my experiments it seems there are 4 reasonable values for the parameter: 
(-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) 
turned on everywhere.  If there is a static constructor generated for tune 
tables and it is a problem to have it -- I can shrink acceptable values to 
these 3 and call it a day.


You only mention 3 values: what was the fourth?

Typo.  No fourth.


It might be better then
to define a set of values that represent each of these cases and only
allow the tuning parameters to select one of those.  The init code then
uses that set to select how to set up the various parameters to meet
those goals.

So something like

ARM_SCHED_AUTOPREF_OFF
ARM_SCHED_AUTOPREF_RANK
ARM_SCHED_AUTOPREF_FULL

A patch is attached.  I bootstrapped it on arm-linux-gnueabihf.  OK to apply?

bootstrap failure on chromebook, reproduced on two chromebook.

see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65020 please.




--
Maxim Kuvyrkov
www.linaro.org






Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-20 Thread Ramana Radhakrishnan



On 19/01/15 18:14, Maxim Kuvyrkov wrote:

On Jan 19, 2015, at 6:05 PM, Richard Earnshaw rearn...@arm.com wrote:


On 16/01/15 15:06, Maxim Kuvyrkov wrote:

@@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
   true, true,   /* Prefer 32-bit encodings.  */
   true,/* Prefer Neon for 
stringops.  */
   8,   /* Maximum insns to inline 
memset.  */
-  ARM_FUSE_NOTHING /* Fuseable pairs of 
instructions.  */
+  ARM_FUSE_NOTHING,/* Fuseable pairs of 
instructions.  */
+  max_insn_queue_index + 1 /* Sched L2 autopref depth.  */
};



Hmm, two issues here:
1) This requires a static constructor for the tuning table entry (since
the value of max_insn_queue_index has to be looked up at run time.


Are you sure?  I didn't check the object files, but, since max_insn_queue_index is a 
const int, I would expect a relocation that would be resolved at link time, 
not a constructor.


I did miss this during the original review, sorry about that.

In theory yes, but in practice apparently not. Probably something LTO 
might be able to handle, but in the absence of defaults to LTO I don't 
see how we can guarantee something like this.


I do see a static constructor being put in for this on a cross compiler 
build.



00d79338 _Z41__static_initialization_and_destruction_0ii:

d79338:   55  push   %rbp
  d79339:   48 89 e5mov%rsp,%rbp
  d7933c:   89 7d fcmov%edi,-0x4(%rbp)
  d7933f:   89 75 f8mov%esi,-0x8(%rbp)
  d79342:   83 7d fc 01 cmpl   $0x1,-0x4(%rbp)
  d79346:   75 27   jned7936f 
_Z41__static_initialization_and_destruction_0ii+0x37

  d79348:   81 7d f8 ff ff 00 00cmpl   $0x,-0x8(%rbp)
  d7934f:   75 1e   jned7936f 
_Z41__static_initialization_and_destruction_0ii+0x37
  d79351:   8b 05 f1 43 48 00   mov0x4843f1(%rip),%eax 
   # 11fd748 max_insn_queue_index

  d79357:   83 c0 01add$0x1,%eax
  d7935a:   89 05 b4 27 a9 00   mov%eax,0xa927b4(%rip) 
   # 180bb14 _ZL19arm_cortex_a15_tune+0x54
  d79360:   8b 05 e2 43 48 00   mov0x4843e2(%rip),%eax 
   # 11fd748 max_insn_queue_index

  d79366:   83 c0 01add$0x1,%eax
  d79369:   89 05 05 28 a9 00   mov%eax,0xa92805(%rip) 
   # 180bb74 _ZL19arm_cortex_a57_tune+0x54

  d7936f:   5d  pop%rbp
  d79370:   c3  retq



Is it a problem to have a static constructor for the tables?



2) Doesn't this mean that the depth of searching will depend on
properties of the automata rather than some machine specific values (so
that potentially adding or removing unrelated scheduler rules could
change the behaviour of the compiler)?


No.  The extra queue entries that will appear from extending an unrelated 
automaton will be empty, so the search will check them, but won't find anything.



In general, how should someone tuning the compiler for this parameter
select a value that isn't one of (-1, m_i_q_d+1)?


 From my experiments it seems there are 4 reasonable values for the parameter: 
(-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) 
turned on everywhere.  If there is a static constructor generated for tune 
tables and it is a problem to have it -- I can shrink acceptable values to 
these 3 and call it a day.


There doesn't seem to be anything documented in the coding conventions 
for this specifically. In the absence of any such documentation, I'd 
rather not grow static constructors for something like this.


Can we just shrink the acceptable values to 3 and just call it a day here ?


regards
Ramana



--
Maxim Kuvyrkov
www.linaro.org






Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-20 Thread Richard Earnshaw
On 19/01/15 18:14, Maxim Kuvyrkov wrote:
 On Jan 19, 2015, at 6:05 PM, Richard Earnshaw rearn...@arm.com wrote:
 
 On 16/01/15 15:06, Maxim Kuvyrkov wrote:
 @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
   true, true,   /* Prefer 32-bit encodings. 
  */
   true, /* Prefer Neon for 
 stringops.  */
   8,/* Maximum insns to 
 inline memset.  */
 -  ARM_FUSE_NOTHING /* Fuseable pairs of 
 instructions.  */
 +  ARM_FUSE_NOTHING,/* Fuseable pairs of 
 instructions.  */
 +  max_insn_queue_index + 1 /* Sched L2 autopref depth.  */
 };


 Hmm, two issues here:
 1) This requires a static constructor for the tuning table entry (since
 the value of max_insn_queue_index has to be looked up at run time.
 
 Are you sure?  I didn't check the object files, but, since 
 max_insn_queue_index is a const int, I would expect a relocation that would 
 be resolved at link time, not a constructor.
 

Yes, I'm sure.  Relocations can only resolve addresses of objects, not
their contents.  LTO might eliminate the need for the reloc, but
otherwise the compiler will never see the definition and will need to
create a static constructor.

 Is it a problem to have a static constructor for the tables?

Needing constructors means that the compiler can't put the object into
read-only sections of the image.  It's not a huge problem, but if there
are ways by which they can be avoided, that's likely to be preferable;
there's a small run-time overhead to running them.

 

 2) Doesn't this mean that the depth of searching will depend on
 properties of the automata rather than some machine specific values (so
 that potentially adding or removing unrelated scheduler rules could
 change the behaviour of the compiler)?
 
 No.  The extra queue entries that will appear from extending an unrelated 
 automaton will be empty, so the search will check them, but won't find 
 anything.
 

OK, so there's just a minor performance cost of checking values that
never hit.



 In general, how should someone tuning the compiler for this parameter
 select a value that isn't one of (-1, m_i_q_d+1)?
 
 From my experiments it seems there are 4 reasonable values for the parameter: 
 (-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) 
 turned on everywhere.  If there is a static constructor generated for tune 
 tables and it is a problem to have it -- I can shrink acceptable values to 
 these 3 and call it a day.
 

You only mention 3 values: what was the fourth?  It might be better then
to define a set of values that represent each of these cases and only
allow the tuning parameters to select one of those.  The init code then
uses that set to select how to set up the various parameters to meet
those goals.

So something like

ARM_SCHED_AUTOPREF_OFF
ARM_SCHED_AUTOPREF_RANK
ARM_SCHED_AUTOPREF_FULL


R.
 --
 Maxim Kuvyrkov
 www.linaro.org
 
 
 
 




Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-20 Thread Maxim Kuvyrkov
On Jan 20, 2015, at 1:24 PM, Richard Earnshaw rearn...@arm.com wrote:
...
 In general, how should someone tuning the compiler for this parameter
 select a value that isn't one of (-1, m_i_q_d+1)?
 
 From my experiments it seems there are 4 reasonable values for the 
 parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, 
 (m_i_q_d+1) turned on everywhere.  If there is a static constructor 
 generated for tune tables and it is a problem to have it -- I can shrink 
 acceptable values to these 3 and call it a day.
 
 
 You only mention 3 values: what was the fourth?  

Typo.  No fourth.

 It might be better then
 to define a set of values that represent each of these cases and only
 allow the tuning parameters to select one of those.  The init code then
 uses that set to select how to set up the various parameters to meet
 those goals.
 
 So something like
 
 ARM_SCHED_AUTOPREF_OFF
 ARM_SCHED_AUTOPREF_RANK
 ARM_SCHED_AUTOPREF_FULL

A patch is attached.  I bootstrapped it on arm-linux-gnueabihf.  OK to apply?

--
Maxim Kuvyrkov
www.linaro.org



0001-Use-enum-for-sched_autopref-tune-settings.patch
Description: Binary data


Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-20 Thread Richard Earnshaw
On 20/01/15 13:26, Maxim Kuvyrkov wrote:
 On Jan 20, 2015, at 1:24 PM, Richard Earnshaw rearn...@arm.com wrote:
 ...
 In general, how should someone tuning the compiler for this parameter
 select a value that isn't one of (-1, m_i_q_d+1)?

 From my experiments it seems there are 4 reasonable values for the 
 parameter: (-1) autopref turned off, (0) turned on in rank_for_schedule, 
 (m_i_q_d+1) turned on everywhere.  If there is a static constructor 
 generated for tune tables and it is a problem to have it -- I can shrink 
 acceptable values to these 3 and call it a day.


 You only mention 3 values: what was the fourth?  
 
 Typo.  No fourth.
 
 It might be better then
 to define a set of values that represent each of these cases and only
 allow the tuning parameters to select one of those.  The init code then
 uses that set to select how to set up the various parameters to meet
 those goals.

 So something like

 ARM_SCHED_AUTOPREF_OFF
 ARM_SCHED_AUTOPREF_RANK
 ARM_SCHED_AUTOPREF_FULL
 
 A patch is attached.  I bootstrapped it on arm-linux-gnueabihf.  OK to apply?
 

OK.  Thanks.

R.

 --
 Maxim Kuvyrkov
 www.linaro.org
 
 
 0001-Use-enum-for-sched_autopref-tune-settings.patch
 
 
 From 9d9ee7c33210960970d0d78ccc7a16a58b392f85 Mon Sep 17 00:00:00 2001
 From: Maxim Kuvyrkov maxim.kuvyr...@linaro.org
 Date: Tue, 20 Jan 2015 12:30:37 +
 Subject: [PATCH 1/3] Use enum for sched_autopref tune settings
 
   * config/arm/arm-protos.h (enum arm_sched_autopref): New constants.
   (struct tune_params): Use the enum.
   * arm.c (arm_*_tune): Update.
   (arm_option_override): Update.
 ---
  gcc/config/arm/arm-protos.h |9 +++-
  gcc/config/arm/arm.c|   51 
 +--
  2 files changed, 38 insertions(+), 22 deletions(-)
 
 diff --git a/gcc/config/arm/arm-protos.h b/gcc/config/arm/arm-protos.h
 index 3db7e16..307babb 100644
 --- a/gcc/config/arm/arm-protos.h
 +++ b/gcc/config/arm/arm-protos.h
 @@ -257,6 +257,13 @@ struct cpu_vec_costs {
  
  struct cpu_cost_table;
  
 +enum arm_sched_autopref
 +  {
 +ARM_SCHED_AUTOPREF_OFF,
 +ARM_SCHED_AUTOPREF_RANK,
 +ARM_SCHED_AUTOPREF_FULL
 +  };
 +
  struct tune_params
  {
bool (*rtx_costs) (rtx, RTX_CODE, RTX_CODE, int *, bool);
 @@ -292,7 +299,7 @@ struct tune_params
/* Bitfield encoding the fuseable pairs of instructions.  */
unsigned int fuseable_ops;
/* Depth of scheduling queue to check for L2 autoprefetcher.  */
 -  int sched_autopref_queue_depth;
 +  enum arm_sched_autopref sched_autopref;
  };
  
  extern const struct tune_params *current_tune;
 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
 index fddd770..34672ce 100644
 --- a/gcc/config/arm/arm.c
 +++ b/gcc/config/arm/arm.c
 @@ -1697,7 +1697,7 @@ const struct tune_params arm_slowmul_tune =
false, /* Prefer Neon for stringops.  
 */
8, /* Maximum insns to inline 
 memset.  */
ARM_FUSE_NOTHING,  /* Fuseable pairs of 
 instructions.  */
 -  -1 /* Sched L2 autopref depth.  */
 +  ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref.  */
  };
  
  const struct tune_params arm_fastmul_tune =
 @@ -1718,7 +1718,7 @@ const struct tune_params arm_fastmul_tune =
false, /* Prefer Neon for stringops.  
 */
8, /* Maximum insns to inline 
 memset.  */
ARM_FUSE_NOTHING,  /* Fuseable pairs of 
 instructions.  */
 -  -1 /* Sched L2 autopref depth.  */
 +  ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref.  */
  };
  
  /* StrongARM has early execution of branches, so a sequence that is worth
 @@ -1742,7 +1742,7 @@ const struct tune_params arm_strongarm_tune =
false, /* Prefer Neon for stringops.  
 */
8, /* Maximum insns to inline 
 memset.  */
ARM_FUSE_NOTHING,  /* Fuseable pairs of 
 instructions.  */
 -  -1 /* Sched L2 autopref depth.  */
 +  ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref.  */
  };
  
  const struct tune_params arm_xscale_tune =
 @@ -1763,7 +1763,7 @@ const struct tune_params arm_xscale_tune =
false, /* Prefer Neon for stringops.  
 */
8, /* Maximum insns to inline 
 memset.  */
ARM_FUSE_NOTHING,  /* Fuseable pairs of 
 instructions.  */
 -  -1 /* Sched L2 autopref depth.  */
 +  ARM_SCHED_AUTOPREF_OFF /* Sched L2 autopref.  */
  };
  
  const struct tune_params arm_9e_tune =
 @@ -1784,7 +1784,7 @@ const struct tune_params arm_9e_tune =
false, 

Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-19 Thread Richard Earnshaw
On 16/01/15 15:06, Maxim Kuvyrkov wrote:
 @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
true, true,   /* Prefer 32-bit encodings.  
 */
true,  /* Prefer Neon for 
 stringops.  */
8, /* Maximum insns to inline 
 memset.  */
 -  ARM_FUSE_NOTHING   /* Fuseable pairs of 
 instructions.  */
 +  ARM_FUSE_NOTHING,  /* Fuseable pairs of 
 instructions.  */
 +  max_insn_queue_index + 1   /* Sched L2 autopref depth.  */
  };


Hmm, two issues here:
1) This requires a static constructor for the tuning table entry (since
the value of max_insn_queue_index has to be looked up at run time.

2) Doesn't this mean that the depth of searching will depend on
properties of the automata rather than some machine specific values (so
that potentially adding or removing unrelated scheduler rules could
change the behaviour of the compiler)?

In general, how should someone tuning the compiler for this parameter
select a value that isn't one of (-1, m_i_q_d+1)?

R.



Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-19 Thread Maxim Kuvyrkov
On Jan 19, 2015, at 6:05 PM, Richard Earnshaw rearn...@arm.com wrote:

 On 16/01/15 15:06, Maxim Kuvyrkov wrote:
 @@ -1874,7 +1889,8 @@ const struct tune_params arm_cortex_a15_tune =
   true, true,   /* Prefer 32-bit encodings.  
 */
   true,  /* Prefer Neon for 
 stringops.  */
   8, /* Maximum insns to inline 
 memset.  */
 -  ARM_FUSE_NOTHING  /* Fuseable pairs of 
 instructions.  */
 +  ARM_FUSE_NOTHING, /* Fuseable pairs of 
 instructions.  */
 +  max_insn_queue_index + 1  /* Sched L2 autopref depth.  */
 };
 
 
 Hmm, two issues here:
 1) This requires a static constructor for the tuning table entry (since
 the value of max_insn_queue_index has to be looked up at run time.

Are you sure?  I didn't check the object files, but, since max_insn_queue_index 
is a const int, I would expect a relocation that would be resolved at link 
time, not a constructor.

Is it a problem to have a static constructor for the tables?

 
 2) Doesn't this mean that the depth of searching will depend on
 properties of the automata rather than some machine specific values (so
 that potentially adding or removing unrelated scheduler rules could
 change the behaviour of the compiler)?

No.  The extra queue entries that will appear from extending an unrelated 
automaton will be empty, so the search will check them, but won't find anything.

 
 In general, how should someone tuning the compiler for this parameter
 select a value that isn't one of (-1, m_i_q_d+1)?

From my experiments it seems there are 4 reasonable values for the parameter: 
(-1) autopref turned off, (0) turned on in rank_for_schedule, (m_i_q_d+1) 
turned on everywhere.  If there is a static constructor generated for tune 
tables and it is a problem to have it -- I can shrink acceptable values to 
these 3 and call it a day.

--
Maxim Kuvyrkov
www.linaro.org





Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-16 Thread Maxim Kuvyrkov
On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan 
ramana.radhakrish...@arm.com wrote:

 

Hi Ramana,
Hi Vladimir,

I still don't have SPEC2000/SPEC2006 benchmark numbers for this patch.  Since 
stage3 is about to finish, I'm going to commit the target independent part of 
the patch now (it was approved by Vladimir a while ago).

I'll commit ARM part of the patch once Ramana is happy with it.

I've addressed all review comments and updated patch is attached.

Any objections?

 On 14/11/14 15:12, Maxim Kuvyrkov wrote:
 On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote:
 
 On 10/20/14 22:06, Maxim Kuvyrkov wrote:
 Hi,
 Ramana, this change requires benchmarking, which I can't easily do
 at
 the moment. I would appreciate any benchmarking results that you can
 share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
 needs to be tuned/confirmed for Cortex-A15.
 What were the results of that benchmarking?  IIRC I tabled reviewing this 
 work waiting for those results (and I probably should have let you know 
 that.  Sorry,  my bad there).
 
 I don't have the benchmarking results yet, and I was hoping for ARM to help 
 with getting the numbers.  The arm maintainers still need to OK the 
 arm-specific portion of the patch, which, I imagine, will happen only of 
 benchmark scores improve.
 
 
 I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees 
 you gave me links to against the equivalent version on trunk.
 
 The results with SPEC2k on A15 were in the noise with the default value for 
 PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting 
 on results for values 0, 1 and 3 and hopefully something will come back soon 
 for SPEC2k.

After going back to the original benchmark that exposed the autoprefetcher 
issue (STREAMS) the required value of the parameter to fix the regression 
turned out to be 5.

I have also discovered that Cortex-A57 also benefits from this optimization.  
Running STREAMS benchmark on Juno in AArch32 mode showed increased performance 
by 10-15%, which is not as big as Cortex-A15's 250-300%, but still significant.

In this updated patch the parameter is set to maximum -- look through all 
scheduling queue of instructions in search of memory ops that are relevant to 
L2 autoprefetcher.  In practice the setting to maximum will not change much, 
since there are rarely many instructions in the queue.

 
 
 
 @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
   return issue_rate  1 ? issue_rate : 0;
 }
 
 +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
 +static int
 +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
 +{
 +  switch (arm_tune)
 +{
 +case cortexa15:
 +  return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
 +
 +default:
 +  return 0;
 +}
 +}
 +
 
 
 It would be better to have this as a flag in the tuning tables rather than 
 hardcoding for a core here. The backend has been moving in that direction for 
 all core centric information and it is preferable that be continued.
 
 So this logic here should just be
 
 if (current_tune-multipass_lookahead)
  return autopref_multipass_lookahead_guard (insn, ready_index);
 else
  return 0;
 

Done.  The autoprefetcher is enabled in Cortex-A15 and Cortex-A57 tuning tables.

--
Maxim Kuvyrkov
www.linaro.org




0001-Model-cache-auto-prefetcher-in-scheduler.patch
Description: Binary data


Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2015-01-16 Thread Ramana Radhakrishnan
On Fri, Jan 16, 2015 at 3:06 PM, Maxim Kuvyrkov
maxim.kuvyr...@linaro.org wrote:
 On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan 
 ramana.radhakrish...@arm.com wrote:



 Hi Ramana,
 Hi Vladimir,

 I still don't have SPEC2000/SPEC2006 benchmark numbers for this patch.  Since 
 stage3 is about to finish, I'm going to commit the target independent part of 
 the patch now (it was approved by Vladimir a while ago).

 I'll commit ARM part of the patch once Ramana is happy with it.

 I've addressed all review comments and updated patch is attached.

 Any objections?

 On 14/11/14 15:12, Maxim Kuvyrkov wrote:
 On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote:

 On 10/20/14 22:06, Maxim Kuvyrkov wrote:
 Hi,
 Ramana, this change requires benchmarking, which I can't easily do
 at
 the moment. I would appreciate any benchmarking results that you can
 share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
 needs to be tuned/confirmed for Cortex-A15.
 What were the results of that benchmarking?  IIRC I tabled reviewing this 
 work waiting for those results (and I probably should have let you know 
 that.  Sorry,  my bad there).

 I don't have the benchmarking results yet, and I was hoping for ARM to help 
 with getting the numbers.  The arm maintainers still need to OK the 
 arm-specific portion of the patch, which, I imagine, will happen only of 
 benchmark scores improve.


 I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees 
 you gave me links to against the equivalent version on trunk.

 The results with SPEC2k on A15 were in the noise with the default value for 
 PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still 
 waiting on results for values 0, 1 and 3 and hopefully something will come 
 back soon for SPEC2k.

 After going back to the original benchmark that exposed the autoprefetcher 
 issue (STREAMS) the required value of the parameter to fix the regression 
 turned out to be 5.

 I have also discovered that Cortex-A57 also benefits from this optimization.  
 Running STREAMS benchmark on Juno in AArch32 mode showed increased 
 performance by 10-15%, which is not as big as Cortex-A15's 250-300%, but 
 still significant.

 In this updated patch the parameter is set to maximum -- look through all 
 scheduling queue of instructions in search of memory ops that are relevant to 
 L2 autoprefetcher.  In practice the setting to maximum will not change much, 
 since there are rarely many instructions in the queue.




 @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
   return issue_rate  1 ? issue_rate : 0;
 }

 +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
 +static int
 +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
 +{
 +  switch (arm_tune)
 +{
 +case cortexa15:
 +  return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
 +
 +default:
 +  return 0;
 +}
 +}
 +


 It would be better to have this as a flag in the tuning tables rather than 
 hardcoding for a core here. The backend has been moving in that direction 
 for all core centric information and it is preferable that be continued.

 So this logic here should just be

 if (current_tune-multipass_lookahead)
  return autopref_multipass_lookahead_guard (insn, ready_index);
 else
  return 0;


 Done.  The autoprefetcher is enabled in Cortex-A15 and Cortex-A57 tuning 
 tables.


The ARM changes look sane. As long as you've tested this with a
bootstrap, I'm ok with it - we probably need a similar change for
AArch64 and Cortex-A57 but I think we should do that for the next
stage1. Thanks for persevering with this.


Ramana


 --
 Maxim Kuvyrkov
 www.linaro.org




Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-19 Thread Ramana Radhakrishnan



On 14/11/14 15:12, Maxim Kuvyrkov wrote:

On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote:


On 10/20/14 22:06, Maxim Kuvyrkov wrote:

Hi,
Ramana, this change requires benchmarking, which I can't easily do
at

the moment. I would appreciate any benchmarking results that you can
share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
needs to be tuned/confirmed for Cortex-A15.
What were the results of that benchmarking?  IIRC I tabled reviewing this work 
waiting for those results (and I probably should have let you know that.  
Sorry,  my bad there).


I don't have the benchmarking results yet, and I was hoping for ARM to help 
with getting the numbers.  The arm maintainers still need to OK the 
arm-specific portion of the patch, which, I imagine, will happen only of 
benchmark scores improve.



I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the 
trees you gave me links to against the equivalent version on trunk.


The results with SPEC2k on A15 were in the noise with the default value 
for PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm 
still waiting on results for values 0, 1 and 3 and hopefully something 
will come back soon for SPEC2k.





@@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
   return issue_rate  1 ? issue_rate : 0;
 }

+/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
+static int
+arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
+{
+  switch (arm_tune)
+{
+case cortexa15:
+  return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
+
+default:
+  return 0;
+}
+}
+



It would be better to have this as a flag in the tuning tables rather 
than hardcoding for a core here. The backend has been moving in that 
direction for all core centric information and it is preferable that be 
continued.


So this logic here should just be

if (current_tune-multipass_lookahead)
  return autopref_multipass_lookahead_guard (insn, ready_index);
else
  return 0;




regards
Ramana






...

Can this be built on top of Bin's work for insn fusion?  There's a lot of 
commonality in the structure of the insns you care about.  He's already got a 
nice little priority function that I think you could utilize to to ensure the 
insns with smaller offsets fire first.


I would argue that macro-fusion should have been implemented the way 
autopref_model is -- via 
targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook.  To implement the 
autopref model I cleaned up and generalized existing infrastructure (max_issue 
and dfa_lookahead_guard hook) instead of adding yet another decision-making 
primitive to the scheduler.




My biggest concern would be sched2 coming along and undoing that work since 
you're not going to fuse those into move-multiple types of instructions.


The autoprefetcher will be active only during sched2.  It is disabled during 
sched1 by the fact that max_issue is not used when scheduling for register 
pressure.

Thanks,

--
Maxim Kuvyrkov
www.linaro.org




Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-19 Thread Maxim Kuvyrkov
On Nov 19, 2014, at 12:27 PM, Ramana Radhakrishnan 
ramana.radhakrish...@arm.com wrote:

 
 
 On 14/11/14 15:12, Maxim Kuvyrkov wrote:
 On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote:
 
 On 10/20/14 22:06, Maxim Kuvyrkov wrote:
 Hi,
 Ramana, this change requires benchmarking, which I can't easily do
 at
 the moment. I would appreciate any benchmarking results that you can
 share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
 needs to be tuned/confirmed for Cortex-A15.
 What were the results of that benchmarking?  IIRC I tabled reviewing this 
 work waiting for those results (and I probably should have let you know 
 that.  Sorry,  my bad there).
 
 I don't have the benchmarking results yet, and I was hoping for ARM to help 
 with getting the numbers.  The arm maintainers still need to OK the 
 arm-specific portion of the patch, which, I imagine, will happen only of 
 benchmark scores improve.
 
 
 I tried benchmarking 78f367cfcfdc9f0a422a362cd85ecc122834d96f from the trees 
 you gave me links to against the equivalent version on trunk.
 
 The results with SPEC2k on A15 were in the noise with the default value for 
 PARAM_SCHED_AUTOPREF_QUEUE_DEPTH which is 2 in the backend. I'm still waiting 
 on results for values 0, 1 and 3 and hopefully something will come back soon 
 for SPEC2k.

I think there is a good chance that this optimization will not improve SPEC 
benchmarks, my main concern is to make sure SPECs don't regress.  If SPECs are 
neutral and the patch gives good improvement (2.5 times faster) on another 
benchmark (STREAM), is this a good enough argument for commit?

 
 
 
 @@ -29903,6 +29915,20 @@ arm_first_cycle_multipass_dfa_lookahead (void)
   return issue_rate  1 ? issue_rate : 0;
 }
 
 +/* Enable modeling of Cortex-A15 L2 auto-prefetcher.  */
 +static int
 +arm_first_cycle_multipass_dfa_lookahead_guard (rtx insn, int ready_index)
 +{
 +  switch (arm_tune)
 +{
 +case cortexa15:
 +  return autopref_multipass_dfa_lookahead_guard (insn, ready_index);
 +
 +default:
 +  return 0;
 +}
 +}
 +
 
 
 It would be better to have this as a flag in the tuning tables rather than 
 hardcoding for a core here. The backend has been moving in that direction for 
 all core centric information and it is preferable that be continued.
 
 So this logic here should just be
 
 if (current_tune-multipass_lookahead)
  return autopref_multipass_lookahead_guard (insn, ready_index);
 else
  return 0;
 

OK, I'll convert this to use tuning tables.

Thank you,

--
Maxim Kuvyrkov
www.linaro.org


Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-14 Thread Maxim Kuvyrkov
On Nov 14, 2014, at 4:57 AM, Vladimir Makarov vmaka...@redhat.com wrote:

 On 2014-10-21 12:06 AM, Maxim Kuvyrkov wrote:
 
...
 I'd prefer symbolic constants for dont_delay.  Also the address can contains 
 other parts, e.g. index for some targets.  It is not necessary to change the 
 code but a comment would be nice that right now it is oriented for machine 
 with base+disp only addressing.
 
 Although it is probably matter of taste.  So you are free to commit it 
 without any change.

I'll add an enum with symbolic constants for dont_delay and a comment about 
handled memory address types.

Thanks for the review!

--
Maxim Kuvyrkov
www.linaro.org



Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-14 Thread Maxim Kuvyrkov
On Nov 14, 2014, at 8:38 AM, Jeff Law l...@redhat.com wrote:

 On 10/20/14 22:06, Maxim Kuvyrkov wrote:
 Hi,
 Ramana, this change requires benchmarking, which I can't easily do
 at
 the moment. I would appreciate any benchmarking results that you can
 share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
 needs to be tuned/confirmed for Cortex-A15.
 What were the results of that benchmarking?  IIRC I tabled reviewing this 
 work waiting for those results (and I probably should have let you know that. 
  Sorry,  my bad there).

I don't have the benchmarking results yet, and I was hoping for ARM to help 
with getting the numbers.  The arm maintainers still need to OK the 
arm-specific portion of the patch, which, I imagine, will happen only of 
benchmark scores improve.

...
 Can this be built on top of Bin's work for insn fusion?  There's a lot of 
 commonality in the structure of the insns you care about.  He's already got a 
 nice little priority function that I think you could utilize to to ensure the 
 insns with smaller offsets fire first.

I would argue that macro-fusion should have been implemented the way 
autopref_model is -- via 
targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook.  To implement the 
autopref model I cleaned up and generalized existing infrastructure (max_issue 
and dfa_lookahead_guard hook) instead of adding yet another decision-making 
primitive to the scheduler.

 
 
 My biggest concern would be sched2 coming along and undoing that work since 
 you're not going to fuse those into move-multiple types of instructions.

The autoprefetcher will be active only during sched2.  It is disabled during 
sched1 by the fact that max_issue is not used when scheduling for register 
pressure.

Thanks,

--
Maxim Kuvyrkov
www.linaro.org



Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-14 Thread Jeff Law

On 11/14/14 08:12, Maxim Kuvyrkov wrote:

...

Can this be built on top of Bin's work for insn fusion?  There's a
lot of commonality in the structure of the insns you care about.
He's already got a nice little priority function that I think you
could utilize to to ensure the insns with smaller offsets fire
first.


I would argue that macro-fusion should have been implemented the way
autopref_model is -- via
targetm.sched.first_cycle_multipass_dfa_lookahead_guard hook.  To
implement the autopref model I cleaned up and generalized existing
infrastructure (max_issue and dfa_lookahead_guard hook) instead of
adding yet another decision-making primitive to the scheduler.
Fair enough.  We can put unifying the two implementations on the queue 
for next stage 1.







My biggest concern would be sched2 coming along and undoing that
work since you're not going to fuse those into move-multiple types
of instructions.


The autoprefetcher will be active only during sched2.  It is disabled
during sched1 by the fact that max_issue is not used when scheduling
for register pressure.
Sorry, should have been clearer, that comment was in reference to using 
the fusion stuff to handle prefetching.  It was not in reference to your 
change.


Vlad's approval and my questions crossed last night, I don't have any 
objections to this going in, though I do want us to look at unifying the 
memory pairing and prefetching stuff during the next stage1.


Jeff



Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-13 Thread Vladimir Makarov

On 2014-10-21 12:06 AM, Maxim Kuvyrkov wrote:

Hi,

This patch adds auto-prefetcher modeling to GCC scheduler.  The auto-prefetcher 
model is currently enabled only for ARM Cortex-A15, since this is the only CPU 
that I know of to have the hardware auto-prefetcher unit.

The documentation on the auto-prefetcher is very sparse, and all I have are my empirical studies and a short 
note in Cortex-A15 manual (search for L2 cache auto-prefether).  This patch, therefore, 
implements a very abstract model that makes scheduler prefer mem_op (base+8); mem_op (base+12) 
over mem_op (base+12); mem_op (base+8).  In other words, memory operations are tried to be issued 
in order of increasing memory offsets.

The auto-prefetcher model implementation is based on max_issue mutlipass lookahead 
scheduling, and its guard hook.  The guard hook examines contents of the 
ready list and the queue, and, if it finds instructions with lower memory offsets, marks 
instructions with higher memory offset as unavailable for immediate scheduling.

This patch has been in works since beginning of the year, and many of my 
previous scheduler cleanup patches were to prepare the infrastructure for this 
feature.

Ramana, this change requires benchmarking, which I can't easily do at the 
moment.  I would appreciate any benchmarking results that you can share.  In 
particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be 
tuned/confirmed for Cortex-A15.

At the moment the parameter is set to 2, which means that the autopref model 
will look through ready list and 1-stall queue in search of relevant instructions.  
Values of -1 (disable autopref), 0 (use autopref only in rank_for_schedule), 1 (look 
through ready list), 2 (look through ready list and 1-stall queue), and 3 (look through 
ready list and 2-stall queue) should be considered and benchmarked.

Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and 
aarch64-linux-gnu.  OK to apply, provided no performance or correctness 
regressions?

[ChangeLog is part of the git patch]




I'd prefer symbolic constants for dont_delay.  Also the address can 
contains other parts, e.g. index for some targets.  It is not necessary 
to change the code but a comment would be nice that right now it is 
oriented for machine with base+disp only addressing.


Although it is probably matter of taste.  So you are free to commit it 
without any change.


  Thanks, Maxim.



Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-13 Thread Jeff Law

On 10/20/14 22:06, Maxim Kuvyrkov wrote:

Hi,
Ramana, this change requires benchmarking, which I can't easily do
at

the moment. I would appreciate any benchmarking results that you can
share. In particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH
needs to be tuned/confirmed for Cortex-A15.
What were the results of that benchmarking?  IIRC I tabled reviewing 
this work waiting for those results (and I probably should have let you 
know that.  Sorry,  my bad there).





0007-Model-cache-auto-prefetcher-in-scheduler.patch


 From 629c2cc593b49b8596b00e3e3d6293aa3514 Mon Sep 17 00:00:00 2001
From: Maxim Kuvyrkovmaxim.kuvyr...@linaro.org
Date: Mon, 20 Oct 2014 23:13:23 +0100
Subject: [PATCH 7/8] Model cache auto-prefetcher in scheduler

* config/arm/arm.c (sched-int.h): Include header.
(arm_first_cycle_multipass_dfa_lookahead_guard,)
(TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD): Define hook.
Enable auto-prefetcher model for Cortex-A15.
(arm_option_override): Set autoprefetcher parameter.
* config/arm/t-arm (arm.o): Update.
* haifa-sched.c (update_insn_after_change): Update.
(rank_for_schedule): Use auto-prefetcher model, if requested.
(autopref_multipass_init): New static function.
(autopref_rank_for_schedule): New rank_for_schedule heuristic.
(autopref_multipass_dfa_lookahead_guard_started_dump_p): New static
variable for debug dumps.
(autopref_multipass_dfa_lookahead_guard_1): New static helper function.
(autopref_multipass_dfa_lookahead_guard): New global function that
implements TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD hook.
(init_h_i_d): Update.
* params.def (PARAM_SCHED_AUTOPREF_QUEUE_DEPTH): New tuning knob.
* sched-int.h (autopref_multipass_data_): Structure for auto-prefetcher
data.
(autopref_multipass_data_def, autopref_multipass_data_t): New typedefs.
(struct _haifa_insn_data:autopref_multipass_data): New field.
(INSN_AUTOPREF_MULTIPASS_DATA): New access macro.
(autopref_multipass_dfa_lookahead_guard): Declare.
Can this be built on top of Bin's work for insn fusion?  There's a lot 
of commonality in the structure of the insns you care about.  He's 
already got a nice little priority function that I think you could 
utilize to to ensure the insns with smaller offsets fire first.



My biggest concern would be sched2 coming along and undoing that work 
since you're not going to fuse those into move-multiple types of 
instructions.


Jeff


Re: [PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-11-10 Thread Maxim Kuvyrkov
On Oct 21, 2014, at 8:06 AM, Maxim Kuvyrkov maxim.kuvyr...@linaro.org wrote:

 Hi,
 
 This patch adds auto-prefetcher modeling to GCC scheduler.  The 
 auto-prefetcher model is currently enabled only for ARM Cortex-A15, since 
 this is the only CPU that I know of to have the hardware auto-prefetcher unit.
 
 The documentation on the auto-prefetcher is very sparse, and all I have are 
 my empirical studies and a short note in Cortex-A15 manual (search for L2 
 cache auto-prefether).  This patch, therefore, implements a very abstract 
 model that makes scheduler prefer mem_op (base+8); mem_op (base+12) over 
 mem_op (base+12); mem_op (base+8).  In other words, memory operations are 
 tried to be issued in order of increasing memory offsets.
 
 The auto-prefetcher model implementation is based on max_issue mutlipass 
 lookahead scheduling, and its guard hook.  The guard hook examines contents 
 of the ready list and the queue, and, if it finds instructions with lower 
 memory offsets, marks instructions with higher memory offset as unavailable 
 for immediate scheduling.
 
 This patch has been in works since beginning of the year, and many of my 
 previous scheduler cleanup patches were to prepare the infrastructure for 
 this feature. 
 
 Ramana, this change requires benchmarking, which I can't easily do at the 
 moment.  I would appreciate any benchmarking results that you can share.  In 
 particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be 
 tuned/confirmed for Cortex-A15.
 
 At the moment the parameter is set to 2, which means that the autopref 
 model will look through ready list and 1-stall queue in search of relevant 
 instructions.  Values of -1 (disable autopref), 0 (use autopref only in 
 rank_for_schedule), 1 (look through ready list), 2 (look through ready list 
 and 1-stall queue), and 3 (look through ready list and 2-stall queue) should 
 be considered and benchmarked.
 
 Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and 
 aarch64-linux-gnu.  OK to apply, provided no performance or correctness 
 regressions?
 
 [ChangeLog is part of the git patch]

Ping?

All prerequisite patches for this one are now approved and [mostly] checked in. 
 This is the last outstanding item from my patch series to improve scheduling.

Thank you,

--
Maxim Kuvyrkov
www.linaro.org



[PATCH 7/8] Model cache auto-prefetcher in scheduler

2014-10-20 Thread Maxim Kuvyrkov
Hi,

This patch adds auto-prefetcher modeling to GCC scheduler.  The auto-prefetcher 
model is currently enabled only for ARM Cortex-A15, since this is the only CPU 
that I know of to have the hardware auto-prefetcher unit.

The documentation on the auto-prefetcher is very sparse, and all I have are my 
empirical studies and a short note in Cortex-A15 manual (search for L2 cache 
auto-prefether).  This patch, therefore, implements a very abstract model that 
makes scheduler prefer mem_op (base+8); mem_op (base+12) over mem_op 
(base+12); mem_op (base+8).  In other words, memory operations are tried to be 
issued in order of increasing memory offsets.

The auto-prefetcher model implementation is based on max_issue mutlipass 
lookahead scheduling, and its guard hook.  The guard hook examines contents 
of the ready list and the queue, and, if it finds instructions with lower 
memory offsets, marks instructions with higher memory offset as unavailable for 
immediate scheduling.

This patch has been in works since beginning of the year, and many of my 
previous scheduler cleanup patches were to prepare the infrastructure for this 
feature. 

Ramana, this change requires benchmarking, which I can't easily do at the 
moment.  I would appreciate any benchmarking results that you can share.  In 
particular, the value of PARAM_SCHED_AUTOPREF_QUEUE_DEPTH needs to be 
tuned/confirmed for Cortex-A15.

At the moment the parameter is set to 2, which means that the autopref model 
will look through ready list and 1-stall queue in search of relevant 
instructions.  Values of -1 (disable autopref), 0 (use autopref only in 
rank_for_schedule), 1 (look through ready list), 2 (look through ready list and 
1-stall queue), and 3 (look through ready list and 2-stall queue) should be 
considered and benchmarked.

Bootstrapped on x86_64-linux-gnu and regtested on arm-linux-gnueaihf and 
aarch64-linux-gnu.  OK to apply, provided no performance or correctness 
regressions?

[ChangeLog is part of the git patch]

Thank you,

--
Maxim Kuvyrkov
www.linaro.org




0007-Model-cache-auto-prefetcher-in-scheduler.patch
Description: Binary data