Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2019-09-17 Thread Richard Henderson
On 9/17/19 6:55 AM, Wilco Dijkstra wrote:
> Hi Kyrill,
> 
>>> When you select a CPU the goal is that we optimize and schedule for that
>>> specific microarchitecture. That implies using atomics that work best for
>>> that core rather than outlining them.
>>
>> I think we want to go ahead with this framework to enable the portable 
>> deployment of LSE atomics.
>>
>> More CPU-specific fine-tuning can come later separately.
> 
> I'm not talking about CPU-specific fine-tuning, but ensuring we don't penalize
> performance when a user selects the specific CPU their application will run 
> on.
> And in that case outlining is unnecessary.

>From aarch64_override_options:

Given both -march=foo -mcpu=bar, then the architecture will be foo and -mcpu
will be treated as -mtune=bar, but will not use any insn not in foo.

Given only -mcpu=foo, then the architecture will be the one supported by foo.

So if foo supports LSE, then we will not outline the functions, no matter how
we arrive at foo.


r~


Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2019-09-17 Thread Wilco Dijkstra
Hi Kyrill,

>> When you select a CPU the goal is that we optimize and schedule for that
>> specific microarchitecture. That implies using atomics that work best for
>> that core rather than outlining them.
>
> I think we want to go ahead with this framework to enable the portable 
> deployment of LSE atomics.
>
> More CPU-specific fine-tuning can come later separately.

I'm not talking about CPU-specific fine-tuning, but ensuring we don't penalize
performance when a user selects the specific CPU their application will run on.
And in that case outlining is unnecessary.

Cheers,
Wilco

Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2019-09-17 Thread Kyrill Tkachov



On 9/16/19 12:58 PM, Wilco Dijkstra wrote:

Hi Richard,

>> So what is the behaviour when you explicitly select a specific CPU?
>
> Selecting a specific cpu selects the specific architecture that the cpu
> supports, does it not?  Thus the architecture example above still 
applies.

>
> Unless I don't understand what distinction that you're making?

When you select a CPU the goal is that we optimize and schedule for that
specific microarchitecture. That implies using atomics that work best for
that core rather than outlining them.



I think we want to go ahead with this framework to enable the portable 
deployment of LSE atomics.


More CPU-specific fine-tuning can come later separately.

Thanks,

Kyrill




>> I'd say that by the time GCC10 is released and used in distros, 
systems without
>> LSE atomics would be practically non-existent. So we should favour 
LSE atomics

>> by default.
>
> I suppose.  Does it not continue to be true that an a53 is more 
impacted by the

> branch prediction than an a76?

That's hard to say for sure - the cost of taken branches (3 in just a 
few instructions for
the outlined atomics) might well affect big/wide cores more. Also note 
Cortex-A55

(successor of Cortex-A53) has LSE atomics.

Wilco


Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2019-09-16 Thread Wilco Dijkstra
Hi Richard,

>> So what is the behaviour when you explicitly select a specific CPU?
>
> Selecting a specific cpu selects the specific architecture that the cpu
> supports, does it not?  Thus the architecture example above still applies.
>
> Unless I don't understand what distinction that you're making?

When you select a CPU the goal is that we optimize and schedule for that
specific microarchitecture. That implies using atomics that work best for
that core rather than outlining them.

>> I'd say that by the time GCC10 is released and used in distros, systems 
>> without
>> LSE atomics would be practically non-existent. So we should favour LSE 
>> atomics
>> by default.
>
> I suppose.  Does it not continue to be true that an a53 is more impacted by 
> the
> branch prediction than an a76?

That's hard to say for sure - the cost of taken branches (3 in just a few 
instructions for
the outlined atomics) might well affect big/wide cores more. Also note 
Cortex-A55
(successor of Cortex-A53) has LSE atomics.

Wilco

Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2019-09-14 Thread Richard Henderson
On 9/5/19 10:35 AM, Wilco Dijkstra wrote:
> Agreed. I've got a couple of general comments:
> 
> * The option name -matomic-ool sounds too abbreviated. I think eg.
> -moutline-atomics is more descriptive and user friendlier.

Changed.

> * Similarly the exported __aa64_have_atomics variable could be named
>   __aarch64_have_lse_atomics so it's clear that it is about LSE atomics.

Changed.

> +@item -matomic-ool
> +@itemx -mno-atomic-ool
> +Enable or disable calls to out-of-line helpers to implement atomic 
> operations.
> +These helpers will, at runtime, determine if ARMv8.1-Atomics instructions
> +should be used; if not, they will use the load/store-exclusive instructions
> +that are present in the base ARMv8.0 ISA.
> +
> +This option is only applicable when compiling for the base ARMv8.0
> +instruction set.  If using a later revision, e.g. @option{-march=armv8.1-a}
> +or @option{-march=armv8-a+lse}, the ARMv8.1-Atomics instructions will be
> +used directly. 
> 
> So what is the behaviour when you explicitly select a specific CPU?

Selecting a specific cpu selects the specific architecture that the cpu
supports, does it not?  Thus the architecture example above still applies.

Unless I don't understand what distinction that you're making?

> +/* Branch to LABEL if LSE is enabled.
> +   The branch should be easily predicted, in that it will, after 
> constructors,
> +   always branch the same way.  The expectation is that systems that 
> implement
> +   ARMv8.1-Atomics are "beefier" than those that omit the extension.
> +   By arranging for the fall-through path to use load-store-exclusive insns,
> +   we aid the branch predictor of the smallest cpus.  */ 
> 
> I'd say that by the time GCC10 is released and used in distros, systems 
> without
> LSE atomics would be practically non-existent. So we should favour LSE atomics
> by default.

I suppose.  Does it not continue to be true that an a53 is more impacted by the
branch prediction than an a76?


r~


Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2019-09-05 Thread Wilco Dijkstra
Hi Richard,

>What I have not done, but is now a possibility, is to use a custom
>calling convention for the out-of-line routines.  I now only clobber
>2 (or 3, for TImode) temp regs and set a return value.

This would be a great feature to have since it reduces the overhead of
outlining considerably.

> I think this patch series would be great to have for GCC 10!

Agreed. I've got a couple of general comments:

* The option name -matomic-ool sounds too abbreviated. I think eg.
-moutline-atomics is more descriptive and user friendlier.

* Similarly the exported __aa64_have_atomics variable could be named
  __aarch64_have_lse_atomics so it's clear that it is about LSE atomics.

+@item -matomic-ool
+@itemx -mno-atomic-ool
+Enable or disable calls to out-of-line helpers to implement atomic operations.
+These helpers will, at runtime, determine if ARMv8.1-Atomics instructions
+should be used; if not, they will use the load/store-exclusive instructions
+that are present in the base ARMv8.0 ISA.
+
+This option is only applicable when compiling for the base ARMv8.0
+instruction set.  If using a later revision, e.g. @option{-march=armv8.1-a}
+or @option{-march=armv8-a+lse}, the ARMv8.1-Atomics instructions will be
+used directly. 

So what is the behaviour when you explicitly select a specific CPU?

+/* Branch to LABEL if LSE is enabled.
+   The branch should be easily predicted, in that it will, after constructors,
+   always branch the same way.  The expectation is that systems that implement
+   ARMv8.1-Atomics are "beefier" than those that omit the extension.
+   By arranging for the fall-through path to use load-store-exclusive insns,
+   we aid the branch predictor of the smallest cpus.  */ 

I'd say that by the time GCC10 is released and used in distros, systems without
LSE atomics would be practically non-existent. So we should favour LSE atomics
by default.

Cheers,
Wilco


Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2019-09-05 Thread Kyrill Tkachov

Hi Richard,

On 11/1/18 9:46 PM, Richard Henderson wrote:

From: Richard Henderson 

Changes since v2:
  * Committed half of the patch set.
  * Split inline TImode support from out-of-line patches.
  * Removed the ST out-of-line functions, to match inline.
  * Moved the out-of-line functions to assembly.

What I have not done, but is now a possibility, is to use a custom
calling convention for the out-of-line routines.  I now only clobber
2 (or 3, for TImode) temp regs and set a return value.


I think this patch series would be great to have for GCC 10!

I've rebased them on current trunk and fixed up a couple of minor 
conflicts in my local tree.


After that, I've encountered a couple of issues with building a compiler 
with these patches.


I'll respond to the individual patches that I think cause the trouble.

Thanks,

Kyrill




r~


Richard Henderson (6):
  aarch64: Extend %R for integer registers
  aarch64: Implement TImode compare-and-swap
  aarch64: Tidy aarch64_split_compare_and_swap
  aarch64: Add out-of-line functions for LSE atomics
  aarch64: Implement -matomic-ool
  Enable -matomic-ool by default

 gcc/config/aarch64/aarch64-protos.h   |  13 +
 gcc/common/config/aarch64/aarch64-common.c    |   6 +-
 gcc/config/aarch64/aarch64.c  | 211 
 .../atomic-comp-swap-release-acquire.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-acq_rel.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-acquire.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-char.c   |   2 +-
 .../gcc.target/aarch64/atomic-op-consume.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-imm.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-int.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-long.c   |   2 +-
 .../gcc.target/aarch64/atomic-op-relaxed.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-release.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-seq_cst.c    |   2 +-
 .../gcc.target/aarch64/atomic-op-short.c  |   2 +-
 .../aarch64/atomic_cmp_exchange_zero_reg_1.c  |   2 +-
 .../atomic_cmp_exchange_zero_strong_1.c   |   2 +-
 .../gcc.target/aarch64/sync-comp-swap.c   |   2 +-
 .../gcc.target/aarch64/sync-op-acquire.c  |   2 +-
 .../gcc.target/aarch64/sync-op-full.c |   2 +-
 libgcc/config/aarch64/lse-init.c  |  45 
 gcc/config/aarch64/aarch64.opt    |   4 +
 gcc/config/aarch64/atomics.md | 185 +-
 gcc/config/aarch64/iterators.md   |   3 +
 gcc/doc/invoke.texi   |  14 +-
 libgcc/config.host    |   4 +
 libgcc/config/aarch64/lse.S   | 238 ++
 libgcc/config/aarch64/t-lse   |  44 
 28 files changed, 717 insertions(+), 84 deletions(-)
 create mode 100644 libgcc/config/aarch64/lse-init.c
 create mode 100644 libgcc/config/aarch64/lse.S
 create mode 100644 libgcc/config/aarch64/t-lse

--
2.17.2



Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2018-11-11 Thread Richard Henderson
Ping.

On 11/1/18 10:46 PM, Richard Henderson wrote:
> From: Richard Henderson 
> 
> Changes since v2:
>   * Committed half of the patch set.
>   * Split inline TImode support from out-of-line patches.
>   * Removed the ST out-of-line functions, to match inline.
>   * Moved the out-of-line functions to assembly.
> 
> What I have not done, but is now a possibility, is to use a custom
> calling convention for the out-of-line routines.  I now only clobber
> 2 (or 3, for TImode) temp regs and set a return value.
> 
> 
> r~
>   
> 
> Richard Henderson (6):
>   aarch64: Extend %R for integer registers
>   aarch64: Implement TImode compare-and-swap
>   aarch64: Tidy aarch64_split_compare_and_swap
>   aarch64: Add out-of-line functions for LSE atomics
>   aarch64: Implement -matomic-ool
>   Enable -matomic-ool by default
> 
>  gcc/config/aarch64/aarch64-protos.h   |  13 +
>  gcc/common/config/aarch64/aarch64-common.c|   6 +-
>  gcc/config/aarch64/aarch64.c  | 211 
>  .../atomic-comp-swap-release-acquire.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-acq_rel.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-acquire.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-char.c   |   2 +-
>  .../gcc.target/aarch64/atomic-op-consume.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-imm.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-int.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-long.c   |   2 +-
>  .../gcc.target/aarch64/atomic-op-relaxed.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-release.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-seq_cst.c|   2 +-
>  .../gcc.target/aarch64/atomic-op-short.c  |   2 +-
>  .../aarch64/atomic_cmp_exchange_zero_reg_1.c  |   2 +-
>  .../atomic_cmp_exchange_zero_strong_1.c   |   2 +-
>  .../gcc.target/aarch64/sync-comp-swap.c   |   2 +-
>  .../gcc.target/aarch64/sync-op-acquire.c  |   2 +-
>  .../gcc.target/aarch64/sync-op-full.c |   2 +-
>  libgcc/config/aarch64/lse-init.c  |  45 
>  gcc/config/aarch64/aarch64.opt|   4 +
>  gcc/config/aarch64/atomics.md | 185 +-
>  gcc/config/aarch64/iterators.md   |   3 +
>  gcc/doc/invoke.texi   |  14 +-
>  libgcc/config.host|   4 +
>  libgcc/config/aarch64/lse.S   | 238 ++
>  libgcc/config/aarch64/t-lse   |  44 
>  28 files changed, 717 insertions(+), 84 deletions(-)
>  create mode 100644 libgcc/config/aarch64/lse-init.c
>  create mode 100644 libgcc/config/aarch64/lse.S
>  create mode 100644 libgcc/config/aarch64/t-lse
> 



[PATCH, AArch64, v3 0/6] LSE atomics out-of-line

2018-11-01 Thread Richard Henderson
From: Richard Henderson 

Changes since v2:
  * Committed half of the patch set.
  * Split inline TImode support from out-of-line patches.
  * Removed the ST out-of-line functions, to match inline.
  * Moved the out-of-line functions to assembly.

What I have not done, but is now a possibility, is to use a custom
calling convention for the out-of-line routines.  I now only clobber
2 (or 3, for TImode) temp regs and set a return value.


r~
  

Richard Henderson (6):
  aarch64: Extend %R for integer registers
  aarch64: Implement TImode compare-and-swap
  aarch64: Tidy aarch64_split_compare_and_swap
  aarch64: Add out-of-line functions for LSE atomics
  aarch64: Implement -matomic-ool
  Enable -matomic-ool by default

 gcc/config/aarch64/aarch64-protos.h   |  13 +
 gcc/common/config/aarch64/aarch64-common.c|   6 +-
 gcc/config/aarch64/aarch64.c  | 211 
 .../atomic-comp-swap-release-acquire.c|   2 +-
 .../gcc.target/aarch64/atomic-op-acq_rel.c|   2 +-
 .../gcc.target/aarch64/atomic-op-acquire.c|   2 +-
 .../gcc.target/aarch64/atomic-op-char.c   |   2 +-
 .../gcc.target/aarch64/atomic-op-consume.c|   2 +-
 .../gcc.target/aarch64/atomic-op-imm.c|   2 +-
 .../gcc.target/aarch64/atomic-op-int.c|   2 +-
 .../gcc.target/aarch64/atomic-op-long.c   |   2 +-
 .../gcc.target/aarch64/atomic-op-relaxed.c|   2 +-
 .../gcc.target/aarch64/atomic-op-release.c|   2 +-
 .../gcc.target/aarch64/atomic-op-seq_cst.c|   2 +-
 .../gcc.target/aarch64/atomic-op-short.c  |   2 +-
 .../aarch64/atomic_cmp_exchange_zero_reg_1.c  |   2 +-
 .../atomic_cmp_exchange_zero_strong_1.c   |   2 +-
 .../gcc.target/aarch64/sync-comp-swap.c   |   2 +-
 .../gcc.target/aarch64/sync-op-acquire.c  |   2 +-
 .../gcc.target/aarch64/sync-op-full.c |   2 +-
 libgcc/config/aarch64/lse-init.c  |  45 
 gcc/config/aarch64/aarch64.opt|   4 +
 gcc/config/aarch64/atomics.md | 185 +-
 gcc/config/aarch64/iterators.md   |   3 +
 gcc/doc/invoke.texi   |  14 +-
 libgcc/config.host|   4 +
 libgcc/config/aarch64/lse.S   | 238 ++
 libgcc/config/aarch64/t-lse   |  44 
 28 files changed, 717 insertions(+), 84 deletions(-)
 create mode 100644 libgcc/config/aarch64/lse-init.c
 create mode 100644 libgcc/config/aarch64/lse.S
 create mode 100644 libgcc/config/aarch64/t-lse

-- 
2.17.2