Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
On 9/17/19 6:55 AM, Wilco Dijkstra wrote: > Hi Kyrill, > >>> When you select a CPU the goal is that we optimize and schedule for that >>> specific microarchitecture. That implies using atomics that work best for >>> that core rather than outlining them. >> >> I think we want to go ahead with this framework to enable the portable >> deployment of LSE atomics. >> >> More CPU-specific fine-tuning can come later separately. > > I'm not talking about CPU-specific fine-tuning, but ensuring we don't penalize > performance when a user selects the specific CPU their application will run > on. > And in that case outlining is unnecessary. >From aarch64_override_options: Given both -march=foo -mcpu=bar, then the architecture will be foo and -mcpu will be treated as -mtune=bar, but will not use any insn not in foo. Given only -mcpu=foo, then the architecture will be the one supported by foo. So if foo supports LSE, then we will not outline the functions, no matter how we arrive at foo. r~
Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
Hi Kyrill, >> When you select a CPU the goal is that we optimize and schedule for that >> specific microarchitecture. That implies using atomics that work best for >> that core rather than outlining them. > > I think we want to go ahead with this framework to enable the portable > deployment of LSE atomics. > > More CPU-specific fine-tuning can come later separately. I'm not talking about CPU-specific fine-tuning, but ensuring we don't penalize performance when a user selects the specific CPU their application will run on. And in that case outlining is unnecessary. Cheers, Wilco
Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
On 9/16/19 12:58 PM, Wilco Dijkstra wrote: Hi Richard, >> So what is the behaviour when you explicitly select a specific CPU? > > Selecting a specific cpu selects the specific architecture that the cpu > supports, does it not? Thus the architecture example above still applies. > > Unless I don't understand what distinction that you're making? When you select a CPU the goal is that we optimize and schedule for that specific microarchitecture. That implies using atomics that work best for that core rather than outlining them. I think we want to go ahead with this framework to enable the portable deployment of LSE atomics. More CPU-specific fine-tuning can come later separately. Thanks, Kyrill >> I'd say that by the time GCC10 is released and used in distros, systems without >> LSE atomics would be practically non-existent. So we should favour LSE atomics >> by default. > > I suppose. Does it not continue to be true that an a53 is more impacted by the > branch prediction than an a76? That's hard to say for sure - the cost of taken branches (3 in just a few instructions for the outlined atomics) might well affect big/wide cores more. Also note Cortex-A55 (successor of Cortex-A53) has LSE atomics. Wilco
Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
Hi Richard, >> So what is the behaviour when you explicitly select a specific CPU? > > Selecting a specific cpu selects the specific architecture that the cpu > supports, does it not? Thus the architecture example above still applies. > > Unless I don't understand what distinction that you're making? When you select a CPU the goal is that we optimize and schedule for that specific microarchitecture. That implies using atomics that work best for that core rather than outlining them. >> I'd say that by the time GCC10 is released and used in distros, systems >> without >> LSE atomics would be practically non-existent. So we should favour LSE >> atomics >> by default. > > I suppose. Does it not continue to be true that an a53 is more impacted by > the > branch prediction than an a76? That's hard to say for sure - the cost of taken branches (3 in just a few instructions for the outlined atomics) might well affect big/wide cores more. Also note Cortex-A55 (successor of Cortex-A53) has LSE atomics. Wilco
Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
On 9/5/19 10:35 AM, Wilco Dijkstra wrote: > Agreed. I've got a couple of general comments: > > * The option name -matomic-ool sounds too abbreviated. I think eg. > -moutline-atomics is more descriptive and user friendlier. Changed. > * Similarly the exported __aa64_have_atomics variable could be named > __aarch64_have_lse_atomics so it's clear that it is about LSE atomics. Changed. > +@item -matomic-ool > +@itemx -mno-atomic-ool > +Enable or disable calls to out-of-line helpers to implement atomic > operations. > +These helpers will, at runtime, determine if ARMv8.1-Atomics instructions > +should be used; if not, they will use the load/store-exclusive instructions > +that are present in the base ARMv8.0 ISA. > + > +This option is only applicable when compiling for the base ARMv8.0 > +instruction set. If using a later revision, e.g. @option{-march=armv8.1-a} > +or @option{-march=armv8-a+lse}, the ARMv8.1-Atomics instructions will be > +used directly. > > So what is the behaviour when you explicitly select a specific CPU? Selecting a specific cpu selects the specific architecture that the cpu supports, does it not? Thus the architecture example above still applies. Unless I don't understand what distinction that you're making? > +/* Branch to LABEL if LSE is enabled. > + The branch should be easily predicted, in that it will, after > constructors, > + always branch the same way. The expectation is that systems that > implement > + ARMv8.1-Atomics are "beefier" than those that omit the extension. > + By arranging for the fall-through path to use load-store-exclusive insns, > + we aid the branch predictor of the smallest cpus. */ > > I'd say that by the time GCC10 is released and used in distros, systems > without > LSE atomics would be practically non-existent. So we should favour LSE atomics > by default. I suppose. Does it not continue to be true that an a53 is more impacted by the branch prediction than an a76? r~
Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
Hi Richard, >What I have not done, but is now a possibility, is to use a custom >calling convention for the out-of-line routines. I now only clobber >2 (or 3, for TImode) temp regs and set a return value. This would be a great feature to have since it reduces the overhead of outlining considerably. > I think this patch series would be great to have for GCC 10! Agreed. I've got a couple of general comments: * The option name -matomic-ool sounds too abbreviated. I think eg. -moutline-atomics is more descriptive and user friendlier. * Similarly the exported __aa64_have_atomics variable could be named __aarch64_have_lse_atomics so it's clear that it is about LSE atomics. +@item -matomic-ool +@itemx -mno-atomic-ool +Enable or disable calls to out-of-line helpers to implement atomic operations. +These helpers will, at runtime, determine if ARMv8.1-Atomics instructions +should be used; if not, they will use the load/store-exclusive instructions +that are present in the base ARMv8.0 ISA. + +This option is only applicable when compiling for the base ARMv8.0 +instruction set. If using a later revision, e.g. @option{-march=armv8.1-a} +or @option{-march=armv8-a+lse}, the ARMv8.1-Atomics instructions will be +used directly. So what is the behaviour when you explicitly select a specific CPU? +/* Branch to LABEL if LSE is enabled. + The branch should be easily predicted, in that it will, after constructors, + always branch the same way. The expectation is that systems that implement + ARMv8.1-Atomics are "beefier" than those that omit the extension. + By arranging for the fall-through path to use load-store-exclusive insns, + we aid the branch predictor of the smallest cpus. */ I'd say that by the time GCC10 is released and used in distros, systems without LSE atomics would be practically non-existent. So we should favour LSE atomics by default. Cheers, Wilco
Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
Hi Richard, On 11/1/18 9:46 PM, Richard Henderson wrote: From: Richard Henderson Changes since v2: * Committed half of the patch set. * Split inline TImode support from out-of-line patches. * Removed the ST out-of-line functions, to match inline. * Moved the out-of-line functions to assembly. What I have not done, but is now a possibility, is to use a custom calling convention for the out-of-line routines. I now only clobber 2 (or 3, for TImode) temp regs and set a return value. I think this patch series would be great to have for GCC 10! I've rebased them on current trunk and fixed up a couple of minor conflicts in my local tree. After that, I've encountered a couple of issues with building a compiler with these patches. I'll respond to the individual patches that I think cause the trouble. Thanks, Kyrill r~ Richard Henderson (6): aarch64: Extend %R for integer registers aarch64: Implement TImode compare-and-swap aarch64: Tidy aarch64_split_compare_and_swap aarch64: Add out-of-line functions for LSE atomics aarch64: Implement -matomic-ool Enable -matomic-ool by default gcc/config/aarch64/aarch64-protos.h | 13 + gcc/common/config/aarch64/aarch64-common.c | 6 +- gcc/config/aarch64/aarch64.c | 211 .../atomic-comp-swap-release-acquire.c | 2 +- .../gcc.target/aarch64/atomic-op-acq_rel.c | 2 +- .../gcc.target/aarch64/atomic-op-acquire.c | 2 +- .../gcc.target/aarch64/atomic-op-char.c | 2 +- .../gcc.target/aarch64/atomic-op-consume.c | 2 +- .../gcc.target/aarch64/atomic-op-imm.c | 2 +- .../gcc.target/aarch64/atomic-op-int.c | 2 +- .../gcc.target/aarch64/atomic-op-long.c | 2 +- .../gcc.target/aarch64/atomic-op-relaxed.c | 2 +- .../gcc.target/aarch64/atomic-op-release.c | 2 +- .../gcc.target/aarch64/atomic-op-seq_cst.c | 2 +- .../gcc.target/aarch64/atomic-op-short.c | 2 +- .../aarch64/atomic_cmp_exchange_zero_reg_1.c | 2 +- .../atomic_cmp_exchange_zero_strong_1.c | 2 +- .../gcc.target/aarch64/sync-comp-swap.c | 2 +- .../gcc.target/aarch64/sync-op-acquire.c | 2 +- .../gcc.target/aarch64/sync-op-full.c | 2 +- libgcc/config/aarch64/lse-init.c | 45 gcc/config/aarch64/aarch64.opt | 4 + gcc/config/aarch64/atomics.md | 185 +- gcc/config/aarch64/iterators.md | 3 + gcc/doc/invoke.texi | 14 +- libgcc/config.host | 4 + libgcc/config/aarch64/lse.S | 238 ++ libgcc/config/aarch64/t-lse | 44 28 files changed, 717 insertions(+), 84 deletions(-) create mode 100644 libgcc/config/aarch64/lse-init.c create mode 100644 libgcc/config/aarch64/lse.S create mode 100644 libgcc/config/aarch64/t-lse -- 2.17.2
Re: [PATCH, AArch64, v3 0/6] LSE atomics out-of-line
Ping. On 11/1/18 10:46 PM, Richard Henderson wrote: > From: Richard Henderson > > Changes since v2: > * Committed half of the patch set. > * Split inline TImode support from out-of-line patches. > * Removed the ST out-of-line functions, to match inline. > * Moved the out-of-line functions to assembly. > > What I have not done, but is now a possibility, is to use a custom > calling convention for the out-of-line routines. I now only clobber > 2 (or 3, for TImode) temp regs and set a return value. > > > r~ > > > Richard Henderson (6): > aarch64: Extend %R for integer registers > aarch64: Implement TImode compare-and-swap > aarch64: Tidy aarch64_split_compare_and_swap > aarch64: Add out-of-line functions for LSE atomics > aarch64: Implement -matomic-ool > Enable -matomic-ool by default > > gcc/config/aarch64/aarch64-protos.h | 13 + > gcc/common/config/aarch64/aarch64-common.c| 6 +- > gcc/config/aarch64/aarch64.c | 211 > .../atomic-comp-swap-release-acquire.c| 2 +- > .../gcc.target/aarch64/atomic-op-acq_rel.c| 2 +- > .../gcc.target/aarch64/atomic-op-acquire.c| 2 +- > .../gcc.target/aarch64/atomic-op-char.c | 2 +- > .../gcc.target/aarch64/atomic-op-consume.c| 2 +- > .../gcc.target/aarch64/atomic-op-imm.c| 2 +- > .../gcc.target/aarch64/atomic-op-int.c| 2 +- > .../gcc.target/aarch64/atomic-op-long.c | 2 +- > .../gcc.target/aarch64/atomic-op-relaxed.c| 2 +- > .../gcc.target/aarch64/atomic-op-release.c| 2 +- > .../gcc.target/aarch64/atomic-op-seq_cst.c| 2 +- > .../gcc.target/aarch64/atomic-op-short.c | 2 +- > .../aarch64/atomic_cmp_exchange_zero_reg_1.c | 2 +- > .../atomic_cmp_exchange_zero_strong_1.c | 2 +- > .../gcc.target/aarch64/sync-comp-swap.c | 2 +- > .../gcc.target/aarch64/sync-op-acquire.c | 2 +- > .../gcc.target/aarch64/sync-op-full.c | 2 +- > libgcc/config/aarch64/lse-init.c | 45 > gcc/config/aarch64/aarch64.opt| 4 + > gcc/config/aarch64/atomics.md | 185 +- > gcc/config/aarch64/iterators.md | 3 + > gcc/doc/invoke.texi | 14 +- > libgcc/config.host| 4 + > libgcc/config/aarch64/lse.S | 238 ++ > libgcc/config/aarch64/t-lse | 44 > 28 files changed, 717 insertions(+), 84 deletions(-) > create mode 100644 libgcc/config/aarch64/lse-init.c > create mode 100644 libgcc/config/aarch64/lse.S > create mode 100644 libgcc/config/aarch64/t-lse >
[PATCH, AArch64, v3 0/6] LSE atomics out-of-line
From: Richard Henderson Changes since v2: * Committed half of the patch set. * Split inline TImode support from out-of-line patches. * Removed the ST out-of-line functions, to match inline. * Moved the out-of-line functions to assembly. What I have not done, but is now a possibility, is to use a custom calling convention for the out-of-line routines. I now only clobber 2 (or 3, for TImode) temp regs and set a return value. r~ Richard Henderson (6): aarch64: Extend %R for integer registers aarch64: Implement TImode compare-and-swap aarch64: Tidy aarch64_split_compare_and_swap aarch64: Add out-of-line functions for LSE atomics aarch64: Implement -matomic-ool Enable -matomic-ool by default gcc/config/aarch64/aarch64-protos.h | 13 + gcc/common/config/aarch64/aarch64-common.c| 6 +- gcc/config/aarch64/aarch64.c | 211 .../atomic-comp-swap-release-acquire.c| 2 +- .../gcc.target/aarch64/atomic-op-acq_rel.c| 2 +- .../gcc.target/aarch64/atomic-op-acquire.c| 2 +- .../gcc.target/aarch64/atomic-op-char.c | 2 +- .../gcc.target/aarch64/atomic-op-consume.c| 2 +- .../gcc.target/aarch64/atomic-op-imm.c| 2 +- .../gcc.target/aarch64/atomic-op-int.c| 2 +- .../gcc.target/aarch64/atomic-op-long.c | 2 +- .../gcc.target/aarch64/atomic-op-relaxed.c| 2 +- .../gcc.target/aarch64/atomic-op-release.c| 2 +- .../gcc.target/aarch64/atomic-op-seq_cst.c| 2 +- .../gcc.target/aarch64/atomic-op-short.c | 2 +- .../aarch64/atomic_cmp_exchange_zero_reg_1.c | 2 +- .../atomic_cmp_exchange_zero_strong_1.c | 2 +- .../gcc.target/aarch64/sync-comp-swap.c | 2 +- .../gcc.target/aarch64/sync-op-acquire.c | 2 +- .../gcc.target/aarch64/sync-op-full.c | 2 +- libgcc/config/aarch64/lse-init.c | 45 gcc/config/aarch64/aarch64.opt| 4 + gcc/config/aarch64/atomics.md | 185 +- gcc/config/aarch64/iterators.md | 3 + gcc/doc/invoke.texi | 14 +- libgcc/config.host| 4 + libgcc/config/aarch64/lse.S | 238 ++ libgcc/config/aarch64/t-lse | 44 28 files changed, 717 insertions(+), 84 deletions(-) create mode 100644 libgcc/config/aarch64/lse-init.c create mode 100644 libgcc/config/aarch64/lse.S create mode 100644 libgcc/config/aarch64/t-lse -- 2.17.2