Re: [PATCH][AArch64] Accept more addressing modes for PRFM

2017-05-04 Thread Kyrill Tkachov


On 15/02/17 15:30, Richard Earnshaw (lists) wrote:

On 15/02/17 15:03, Kyrill Tkachov wrote:

Hi Richard,

On 15/02/17 15:00, Richard Earnshaw (lists) wrote:

On 03/02/17 17:12, Kyrill Tkachov wrote:

Hi all,

While evaluating Maxim's SW prefetch patches [1] I noticed that the
aarch64 prefetch pattern is
overly restrictive in its address operand. It only accepts simple
register addressing modes.
In fact, the PRFM instruction accepts almost all modes that a normal
64-bit LDR supports.
The restriction in the pattern leads to explicit address calculation
code to be emitted which we could avoid.

This patch relaxes the restrictions on the prefetch define_insn. It
creates a predicate and constraint that
allow the full addressing modes that PRFM allows. Thus for the testcase
in the patch (adapted from one of the existing
__builtin_prefetch tests in the testsuite) we can generate a:
prfmPLDL1STRM, [x1, 8]

instead of the current
prfmPLDL1STRM, [x1]
with an explicit increment of x1 by 8 in a separate instruction.

I've removed the %a output modifier in the output template and wrapped
the address operand into a DImode MEM before
passing it down to aarch64_print_operand.

This is because operand 0 is an address operand rather than a memory
operand and thus doesn't have a mode associated
with it.  When processing the 'a' output modifier the code in final.c
will call TARGET_PRINT_OPERAND_ADDRESS with a VOIDmode
argument.  This will ICE on aarch64 because we need a mode for the
memory in order for aarch64_classify_address to work
correctly.  Rather than overriding the VOIDmode in
aarch64_print_operand_address I decided to instead create the DImode
MEM in the "prefetch" output template and treat it as a normal 64-bit
memory address, which at the point of assembly output
is what it is anyway.

With this patch I see a reduction in instruction count in the SPEC2006
benchmarks when SW prefetching is enabled on top
of Maxim's patchset because fewer address calculation instructions are
emitted due to the use of the more expressive
addressing modes. It also fixes a performance regression that I observed
in 410.bwaves from Maxim's patches on Cortex-A72.
I'll be running a full set of benchmarks to evaluate this further, but I
think this is the right thing to do.

Bootstrapped and tested on aarch64-none-linux-gnu.

Maxim, do you want to try this on top of your patches on your hardware
to see if it helps with the regressions you mentioned?

Thanks,
Kyrill


[1] https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02284.html

2016-02-03  Kyrylo Tkachov  

  * config/aarch64/aarch64.md (prefetch); Adjust predicate and
  constraint on operand 0 to allow more general addressing modes.
  Adjust output template.
  * config/aarch64/aarch64.c (aarch64_address_valid_for_prefetch_p):
  New function.
  * config/aarch64/aarch64-protos.h
  (aarch64_address_valid_for_prefetch_p): Declare prototype.
  * config/aarch64/constraints.md (Dp): New address constraint.
  * config/aarch64/predicates.md (aarch64_prefetch_operand): New
  predicate.

2016-02-03  Kyrylo Tkachov  

  * gcc.target/aarch64/prfm_imm_offset_1.c: New test.

aarch64-prfm-imm.patch


Hmm, I'm not sure about this.  rtl.texi says that a prefetch code
contains an address, not a MEM.  So it's theoretically possible for
generic code to want to look inside the first operand and find an
address directly.  This change would break that assumption.

With this change the prefetch operand is still an address, not a MEM
during all the
optimisation passes.
It's wrapped in a MEM only during the ultimate printing of the assembly
string
during 'final'.


Ah!  I'd missed that.

This is OK for stage1.


I've bootstrapped and tested the patch against current trunk and committed
it as r247603.

Thanks,
Kyrill


R.


Kyrill


R.


commit a324e2f2ea243fe42f23a026ecbe1435876e2c8b
Author: Kyrylo Tkachov 
Date:   Thu Feb 2 14:46:11 2017 +

  [AArch64] Accept more addressing modes for PRFM

diff --git a/gcc/config/aarch64/aarch64-protos.h
b/gcc/config/aarch64/aarch64-protos.h
index babc327..61706de 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -300,6 +300,7 @@ extern struct tune_params aarch64_tune_params;
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned,
unsigned);
   int aarch64_get_condition_code (rtx);
+bool aarch64_address_valid_for_prefetch_p (rtx, bool);
   bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
   unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
   unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index acc093a..c05eff3 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -4549,6 +4549,24 @@ aarch64_classify_address (struct
aarch64_address_info *info,
   

Re: [PATCH][AArch64] Accept more addressing modes for PRFM

2017-02-15 Thread Richard Earnshaw (lists)
On 15/02/17 15:03, Kyrill Tkachov wrote:
> Hi Richard,
> 
> On 15/02/17 15:00, Richard Earnshaw (lists) wrote:
>> On 03/02/17 17:12, Kyrill Tkachov wrote:
>>> Hi all,
>>>
>>> While evaluating Maxim's SW prefetch patches [1] I noticed that the
>>> aarch64 prefetch pattern is
>>> overly restrictive in its address operand. It only accepts simple
>>> register addressing modes.
>>> In fact, the PRFM instruction accepts almost all modes that a normal
>>> 64-bit LDR supports.
>>> The restriction in the pattern leads to explicit address calculation
>>> code to be emitted which we could avoid.
>>>
>>> This patch relaxes the restrictions on the prefetch define_insn. It
>>> creates a predicate and constraint that
>>> allow the full addressing modes that PRFM allows. Thus for the testcase
>>> in the patch (adapted from one of the existing
>>> __builtin_prefetch tests in the testsuite) we can generate a:
>>> prfmPLDL1STRM, [x1, 8]
>>>
>>> instead of the current
>>> prfmPLDL1STRM, [x1]
>>> with an explicit increment of x1 by 8 in a separate instruction.
>>>
>>> I've removed the %a output modifier in the output template and wrapped
>>> the address operand into a DImode MEM before
>>> passing it down to aarch64_print_operand.
>>>
>>> This is because operand 0 is an address operand rather than a memory
>>> operand and thus doesn't have a mode associated
>>> with it.  When processing the 'a' output modifier the code in final.c
>>> will call TARGET_PRINT_OPERAND_ADDRESS with a VOIDmode
>>> argument.  This will ICE on aarch64 because we need a mode for the
>>> memory in order for aarch64_classify_address to work
>>> correctly.  Rather than overriding the VOIDmode in
>>> aarch64_print_operand_address I decided to instead create the DImode
>>> MEM in the "prefetch" output template and treat it as a normal 64-bit
>>> memory address, which at the point of assembly output
>>> is what it is anyway.
>>>
>>> With this patch I see a reduction in instruction count in the SPEC2006
>>> benchmarks when SW prefetching is enabled on top
>>> of Maxim's patchset because fewer address calculation instructions are
>>> emitted due to the use of the more expressive
>>> addressing modes. It also fixes a performance regression that I observed
>>> in 410.bwaves from Maxim's patches on Cortex-A72.
>>> I'll be running a full set of benchmarks to evaluate this further, but I
>>> think this is the right thing to do.
>>>
>>> Bootstrapped and tested on aarch64-none-linux-gnu.
>>>
>>> Maxim, do you want to try this on top of your patches on your hardware
>>> to see if it helps with the regressions you mentioned?
>>>
>>> Thanks,
>>> Kyrill
>>>
>>>
>>> [1] https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02284.html
>>>
>>> 2016-02-03  Kyrylo Tkachov  
>>>
>>>  * config/aarch64/aarch64.md (prefetch); Adjust predicate and
>>>  constraint on operand 0 to allow more general addressing modes.
>>>  Adjust output template.
>>>  * config/aarch64/aarch64.c (aarch64_address_valid_for_prefetch_p):
>>>  New function.
>>>  * config/aarch64/aarch64-protos.h
>>>  (aarch64_address_valid_for_prefetch_p): Declare prototype.
>>>  * config/aarch64/constraints.md (Dp): New address constraint.
>>>  * config/aarch64/predicates.md (aarch64_prefetch_operand): New
>>>  predicate.
>>>
>>> 2016-02-03  Kyrylo Tkachov  
>>>
>>>  * gcc.target/aarch64/prfm_imm_offset_1.c: New test.
>>>
>>> aarch64-prfm-imm.patch
>>>
>> Hmm, I'm not sure about this.  rtl.texi says that a prefetch code
>> contains an address, not a MEM.  So it's theoretically possible for
>> generic code to want to look inside the first operand and find an
>> address directly.  This change would break that assumption.
> 
> With this change the prefetch operand is still an address, not a MEM
> during all the
> optimisation passes.
> It's wrapped in a MEM only during the ultimate printing of the assembly
> string
> during 'final'.
> 

Ah!  I'd missed that.

This is OK for stage1.

R.

> Kyrill
> 
>> R.
>>
>>> commit a324e2f2ea243fe42f23a026ecbe1435876e2c8b
>>> Author: Kyrylo Tkachov 
>>> Date:   Thu Feb 2 14:46:11 2017 +
>>>
>>>  [AArch64] Accept more addressing modes for PRFM
>>>
>>> diff --git a/gcc/config/aarch64/aarch64-protos.h
>>> b/gcc/config/aarch64/aarch64-protos.h
>>> index babc327..61706de 100644
>>> --- a/gcc/config/aarch64/aarch64-protos.h
>>> +++ b/gcc/config/aarch64/aarch64-protos.h
>>> @@ -300,6 +300,7 @@ extern struct tune_params aarch64_tune_params;
>>> HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned,
>>> unsigned);
>>>   int aarch64_get_condition_code (rtx);
>>> +bool aarch64_address_valid_for_prefetch_p (rtx, bool);
>>>   bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
>>>   unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
>>>   unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
>>> diff --git 

Re: [PATCH][AArch64] Accept more addressing modes for PRFM

2017-02-15 Thread Kyrill Tkachov

Hi Richard,

On 15/02/17 15:00, Richard Earnshaw (lists) wrote:

On 03/02/17 17:12, Kyrill Tkachov wrote:

Hi all,

While evaluating Maxim's SW prefetch patches [1] I noticed that the
aarch64 prefetch pattern is
overly restrictive in its address operand. It only accepts simple
register addressing modes.
In fact, the PRFM instruction accepts almost all modes that a normal
64-bit LDR supports.
The restriction in the pattern leads to explicit address calculation
code to be emitted which we could avoid.

This patch relaxes the restrictions on the prefetch define_insn. It
creates a predicate and constraint that
allow the full addressing modes that PRFM allows. Thus for the testcase
in the patch (adapted from one of the existing
__builtin_prefetch tests in the testsuite) we can generate a:
prfmPLDL1STRM, [x1, 8]

instead of the current
prfmPLDL1STRM, [x1]
with an explicit increment of x1 by 8 in a separate instruction.

I've removed the %a output modifier in the output template and wrapped
the address operand into a DImode MEM before
passing it down to aarch64_print_operand.

This is because operand 0 is an address operand rather than a memory
operand and thus doesn't have a mode associated
with it.  When processing the 'a' output modifier the code in final.c
will call TARGET_PRINT_OPERAND_ADDRESS with a VOIDmode
argument.  This will ICE on aarch64 because we need a mode for the
memory in order for aarch64_classify_address to work
correctly.  Rather than overriding the VOIDmode in
aarch64_print_operand_address I decided to instead create the DImode
MEM in the "prefetch" output template and treat it as a normal 64-bit
memory address, which at the point of assembly output
is what it is anyway.

With this patch I see a reduction in instruction count in the SPEC2006
benchmarks when SW prefetching is enabled on top
of Maxim's patchset because fewer address calculation instructions are
emitted due to the use of the more expressive
addressing modes. It also fixes a performance regression that I observed
in 410.bwaves from Maxim's patches on Cortex-A72.
I'll be running a full set of benchmarks to evaluate this further, but I
think this is the right thing to do.

Bootstrapped and tested on aarch64-none-linux-gnu.

Maxim, do you want to try this on top of your patches on your hardware
to see if it helps with the regressions you mentioned?

Thanks,
Kyrill


[1] https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02284.html

2016-02-03  Kyrylo Tkachov  

 * config/aarch64/aarch64.md (prefetch); Adjust predicate and
 constraint on operand 0 to allow more general addressing modes.
 Adjust output template.
 * config/aarch64/aarch64.c (aarch64_address_valid_for_prefetch_p):
 New function.
 * config/aarch64/aarch64-protos.h
 (aarch64_address_valid_for_prefetch_p): Declare prototype.
 * config/aarch64/constraints.md (Dp): New address constraint.
 * config/aarch64/predicates.md (aarch64_prefetch_operand): New
 predicate.

2016-02-03  Kyrylo Tkachov  

 * gcc.target/aarch64/prfm_imm_offset_1.c: New test.

aarch64-prfm-imm.patch


Hmm, I'm not sure about this.  rtl.texi says that a prefetch code
contains an address, not a MEM.  So it's theoretically possible for
generic code to want to look inside the first operand and find an
address directly.  This change would break that assumption.


With this change the prefetch operand is still an address, not a MEM during all 
the
optimisation passes.
It's wrapped in a MEM only during the ultimate printing of the assembly string
during 'final'.

Kyrill


R.


commit a324e2f2ea243fe42f23a026ecbe1435876e2c8b
Author: Kyrylo Tkachov 
Date:   Thu Feb 2 14:46:11 2017 +

 [AArch64] Accept more addressing modes for PRFM

diff --git a/gcc/config/aarch64/aarch64-protos.h 
b/gcc/config/aarch64/aarch64-protos.h
index babc327..61706de 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -300,6 +300,7 @@ extern struct tune_params aarch64_tune_params;
  
  HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);

  int aarch64_get_condition_code (rtx);
+bool aarch64_address_valid_for_prefetch_p (rtx, bool);
  bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
  unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
  unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index acc093a..c05eff3 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -4549,6 +4549,24 @@ aarch64_classify_address (struct aarch64_address_info 
*info,
  }
  }
  
+/* Return true if the address X is valid for a PRFM instruction.

+   STRICT_P is true if we should do strict checking with
+   aarch64_classify_address.  */
+
+bool
+aarch64_address_valid_for_prefetch_p (rtx x, bool strict_p)
+{
+  struct 

Re: [PATCH][AArch64] Accept more addressing modes for PRFM

2017-02-15 Thread Richard Earnshaw (lists)
On 03/02/17 17:12, Kyrill Tkachov wrote:
> Hi all,
> 
> While evaluating Maxim's SW prefetch patches [1] I noticed that the
> aarch64 prefetch pattern is
> overly restrictive in its address operand. It only accepts simple
> register addressing modes.
> In fact, the PRFM instruction accepts almost all modes that a normal
> 64-bit LDR supports.
> The restriction in the pattern leads to explicit address calculation
> code to be emitted which we could avoid.
> 
> This patch relaxes the restrictions on the prefetch define_insn. It
> creates a predicate and constraint that
> allow the full addressing modes that PRFM allows. Thus for the testcase
> in the patch (adapted from one of the existing
> __builtin_prefetch tests in the testsuite) we can generate a:
> prfmPLDL1STRM, [x1, 8]
> 
> instead of the current
> prfmPLDL1STRM, [x1]
> with an explicit increment of x1 by 8 in a separate instruction.
> 
> I've removed the %a output modifier in the output template and wrapped
> the address operand into a DImode MEM before
> passing it down to aarch64_print_operand.
> 
> This is because operand 0 is an address operand rather than a memory
> operand and thus doesn't have a mode associated
> with it.  When processing the 'a' output modifier the code in final.c
> will call TARGET_PRINT_OPERAND_ADDRESS with a VOIDmode
> argument.  This will ICE on aarch64 because we need a mode for the
> memory in order for aarch64_classify_address to work
> correctly.  Rather than overriding the VOIDmode in
> aarch64_print_operand_address I decided to instead create the DImode
> MEM in the "prefetch" output template and treat it as a normal 64-bit
> memory address, which at the point of assembly output
> is what it is anyway.
> 
> With this patch I see a reduction in instruction count in the SPEC2006
> benchmarks when SW prefetching is enabled on top
> of Maxim's patchset because fewer address calculation instructions are
> emitted due to the use of the more expressive
> addressing modes. It also fixes a performance regression that I observed
> in 410.bwaves from Maxim's patches on Cortex-A72.
> I'll be running a full set of benchmarks to evaluate this further, but I
> think this is the right thing to do.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Maxim, do you want to try this on top of your patches on your hardware
> to see if it helps with the regressions you mentioned?
> 
> Thanks,
> Kyrill
> 
> 
> [1] https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02284.html
> 
> 2016-02-03  Kyrylo Tkachov  
> 
> * config/aarch64/aarch64.md (prefetch); Adjust predicate and
> constraint on operand 0 to allow more general addressing modes.
> Adjust output template.
> * config/aarch64/aarch64.c (aarch64_address_valid_for_prefetch_p):
> New function.
> * config/aarch64/aarch64-protos.h
> (aarch64_address_valid_for_prefetch_p): Declare prototype.
> * config/aarch64/constraints.md (Dp): New address constraint.
> * config/aarch64/predicates.md (aarch64_prefetch_operand): New
> predicate.
> 
> 2016-02-03  Kyrylo Tkachov  
> 
> * gcc.target/aarch64/prfm_imm_offset_1.c: New test.
> 
> aarch64-prfm-imm.patch
> 

Hmm, I'm not sure about this.  rtl.texi says that a prefetch code
contains an address, not a MEM.  So it's theoretically possible for
generic code to want to look inside the first operand and find an
address directly.  This change would break that assumption.

R.

> 
> commit a324e2f2ea243fe42f23a026ecbe1435876e2c8b
> Author: Kyrylo Tkachov 
> Date:   Thu Feb 2 14:46:11 2017 +
> 
> [AArch64] Accept more addressing modes for PRFM
> 
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index babc327..61706de 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -300,6 +300,7 @@ extern struct tune_params aarch64_tune_params;
>  
>  HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
>  int aarch64_get_condition_code (rtx);
> +bool aarch64_address_valid_for_prefetch_p (rtx, bool);
>  bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
>  unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
>  unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index acc093a..c05eff3 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -4549,6 +4549,24 @@ aarch64_classify_address (struct aarch64_address_info 
> *info,
>  }
>  }
>  
> +/* Return true if the address X is valid for a PRFM instruction.
> +   STRICT_P is true if we should do strict checking with
> +   aarch64_classify_address.  */
> +
> +bool
> +aarch64_address_valid_for_prefetch_p (rtx x, bool strict_p)
> +{
> +  struct aarch64_address_info addr;
> +
> +  /* PRFM accepts the same addresses as DImode...  */

Re: [PATCH][AArch64] Accept more addressing modes for PRFM

2017-02-04 Thread Maxim Kuvyrkov
> On Feb 3, 2017, at 8:12 PM, Kyrill Tkachov  
> wrote:
> 
> Hi all,
> 
> While evaluating Maxim's SW prefetch patches [1] I noticed that the aarch64 
> prefetch pattern is
> overly restrictive in its address operand. It only accepts simple register 
> addressing modes.
> In fact, the PRFM instruction accepts almost all modes that a normal 64-bit 
> LDR supports.
> The restriction in the pattern leads to explicit address calculation code to 
> be emitted which we could avoid.

Thanks for this fix, I'll test it on my hardware.

I've reviewed your patch and it looks OK to me.

> 
> This patch relaxes the restrictions on the prefetch define_insn. It creates a 
> predicate and constraint that
> allow the full addressing modes that PRFM allows. Thus for the testcase in 
> the patch (adapted from one of the existing
> __builtin_prefetch tests in the testsuite) we can generate a:
> prfmPLDL1STRM, [x1, 8]
> 
> instead of the current
> prfmPLDL1STRM, [x1]
> with an explicit increment of x1 by 8 in a separate instruction.
> 
> I've removed the %a output modifier in the output template and wrapped the 
> address operand into a DImode MEM before
> passing it down to aarch64_print_operand.
> 
> This is because operand 0 is an address operand rather than a memory operand 
> and thus doesn't have a mode associated
> with it.  When processing the 'a' output modifier the code in final.c will 
> call TARGET_PRINT_OPERAND_ADDRESS with a VOIDmode
> argument.  This will ICE on aarch64 because we need a mode for the memory in 
> order for aarch64_classify_address to work
> correctly.  Rather than overriding the VOIDmode in 
> aarch64_print_operand_address I decided to instead create the DImode
> MEM in the "prefetch" output template and treat it as a normal 64-bit memory 
> address, which at the point of assembly output
> is what it is anyway.

I agree that it is cleaner to convert operand of prefetch to DImode just before 
printing out to assembly.  There is little to be gained in relaxing asserts in 
aarch64_print_operand_address.

> 
> With this patch I see a reduction in instruction count in the SPEC2006 
> benchmarks when SW prefetching is enabled on top
> of Maxim's patchset because fewer address calculation instructions are 
> emitted due to the use of the more expressive
> addressing modes. It also fixes a performance regression that I observed in 
> 410.bwaves from Maxim's patches on Cortex-A72.
> I'll be running a full set of benchmarks to evaluate this further, but I 
> think this is the right thing to do.
> 
> Bootstrapped and tested on aarch64-none-linux-gnu.
> 
> Maxim, do you want to try this on top of your patches on your hardware to see 
> if it helps with the regressions you mentioned?

Sure.

--
Maxim Kuvyrkov
www.linaro.org




[PATCH][AArch64] Accept more addressing modes for PRFM

2017-02-03 Thread Kyrill Tkachov

Hi all,

While evaluating Maxim's SW prefetch patches [1] I noticed that the aarch64 
prefetch pattern is
overly restrictive in its address operand. It only accepts simple register 
addressing modes.
In fact, the PRFM instruction accepts almost all modes that a normal 64-bit LDR 
supports.
The restriction in the pattern leads to explicit address calculation code to be 
emitted which we could avoid.

This patch relaxes the restrictions on the prefetch define_insn. It creates a 
predicate and constraint that
allow the full addressing modes that PRFM allows. Thus for the testcase in the 
patch (adapted from one of the existing
__builtin_prefetch tests in the testsuite) we can generate a:
prfmPLDL1STRM, [x1, 8]

instead of the current
prfmPLDL1STRM, [x1]
with an explicit increment of x1 by 8 in a separate instruction.

I've removed the %a output modifier in the output template and wrapped the 
address operand into a DImode MEM before
passing it down to aarch64_print_operand.

This is because operand 0 is an address operand rather than a memory operand 
and thus doesn't have a mode associated
with it.  When processing the 'a' output modifier the code in final.c will call 
TARGET_PRINT_OPERAND_ADDRESS with a VOIDmode
argument.  This will ICE on aarch64 because we need a mode for the memory in 
order for aarch64_classify_address to work
correctly.  Rather than overriding the VOIDmode in 
aarch64_print_operand_address I decided to instead create the DImode
MEM in the "prefetch" output template and treat it as a normal 64-bit memory 
address, which at the point of assembly output
is what it is anyway.

With this patch I see a reduction in instruction count in the SPEC2006 
benchmarks when SW prefetching is enabled on top
of Maxim's patchset because fewer address calculation instructions are emitted 
due to the use of the more expressive
addressing modes. It also fixes a performance regression that I observed in 
410.bwaves from Maxim's patches on Cortex-A72.
I'll be running a full set of benchmarks to evaluate this further, but I think 
this is the right thing to do.

Bootstrapped and tested on aarch64-none-linux-gnu.

Maxim, do you want to try this on top of your patches on your hardware to see 
if it helps with the regressions you mentioned?

Thanks,
Kyrill


[1] https://gcc.gnu.org/ml/gcc-patches/2017-01/msg02284.html

2016-02-03  Kyrylo Tkachov  

* config/aarch64/aarch64.md (prefetch); Adjust predicate and
constraint on operand 0 to allow more general addressing modes.
Adjust output template.
* config/aarch64/aarch64.c (aarch64_address_valid_for_prefetch_p):
New function.
* config/aarch64/aarch64-protos.h
(aarch64_address_valid_for_prefetch_p): Declare prototype.
* config/aarch64/constraints.md (Dp): New address constraint.
* config/aarch64/predicates.md (aarch64_prefetch_operand): New
predicate.

2016-02-03  Kyrylo Tkachov  

* gcc.target/aarch64/prfm_imm_offset_1.c: New test.
commit a324e2f2ea243fe42f23a026ecbe1435876e2c8b
Author: Kyrylo Tkachov 
Date:   Thu Feb 2 14:46:11 2017 +

[AArch64] Accept more addressing modes for PRFM

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index babc327..61706de 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -300,6 +300,7 @@ extern struct tune_params aarch64_tune_params;
 
 HOST_WIDE_INT aarch64_initial_elimination_offset (unsigned, unsigned);
 int aarch64_get_condition_code (rtx);
+bool aarch64_address_valid_for_prefetch_p (rtx, bool);
 bool aarch64_bitmask_imm (HOST_WIDE_INT val, machine_mode);
 unsigned HOST_WIDE_INT aarch64_and_split_imm1 (HOST_WIDE_INT val_in);
 unsigned HOST_WIDE_INT aarch64_and_split_imm2 (HOST_WIDE_INT val_in);
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index acc093a..c05eff3 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -4549,6 +4549,24 @@ aarch64_classify_address (struct aarch64_address_info *info,
 }
 }
 
+/* Return true if the address X is valid for a PRFM instruction.
+   STRICT_P is true if we should do strict checking with
+   aarch64_classify_address.  */
+
+bool
+aarch64_address_valid_for_prefetch_p (rtx x, bool strict_p)
+{
+  struct aarch64_address_info addr;
+
+  /* PRFM accepts the same addresses as DImode...  */
+  bool res = aarch64_classify_address (, x, DImode, MEM, strict_p);
+  if (!res)
+return false;
+
+  /* ... except writeback forms.  */
+  return addr.type != ADDRESS_REG_WB;
+}
+
 bool
 aarch64_symbolic_address_p (rtx x)
 {
diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
index b9e8edf..c6201a5 100644
--- a/gcc/config/aarch64/aarch64.md
+++ b/gcc/config/aarch64/aarch64.md
@@ -518,27 +518,31 @@ (define_insn "nop"
 )
 
 (define_insn "prefetch"
-  [(prefetch (match_operand:DI 0 "register_operand" "r")
+  [(prefetch