[Bug c++/110619] Dangling pointer returned from constexpr function converts in nullptr

2023-08-06 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110619

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #7 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #2)
> >but it is not nullptr.
> 
> Or is it just undefined so it could be considered a nullptr ...


Implementation-defined behaviour, according to answers on
https://stackoverflow.com/questions/76843246/why-does-the-address-of-an-out-of-scope-variable-equal-zero-with-constexpr


https://eel.is/c++draft/basic.compound#def:value,invalid_pointer

https://eel.is/c++draft/basic.stc.general#4

>  Indirection through an invalid pointer value and passing an invalid pointer 
> value to a deallocation function have undefined behavior.
> **Any other use of an invalid pointer value has implementation-defined 
> behavior.**

So this wasn't a bug, but the new behaviour is also allowed.

This commit could be reverted or kept, depending on maintainability and/or
quality-of-life for users of GCC.  Having it pick the other
implementation-defined behaviour from clang (GCC's previous behaviour) is maybe
a *good* thing, to help programmers catch dependence on an invalid pointer
being either null or non-null if they try their code with both compilers.

[Bug middle-end/108441] [12 Regression] Maybe missed optimization: loading an 16-bit integer value from .rodata instead of an immediate store

2023-01-18 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441

--- Comment #4 from Peter Cordes  ---
This is already fixed in current trunk; sorry I forgot to check that before
recommending to report this store-coalescing bug.

# https://godbolt.org/z/j3MdWrcWM
# GCC nightly -O3   (tune=generic)  and GCC11
store:
movl$16, %eax
movw%ax, ldap(%rip)
ret

In case anyone's wondering why GCC doesn't  movw $16, foo(%rip)
it's avoiding LCP stalls on Intel P6-family CPUs from the 16-bit immediate.

For MOV specifically, that only happens on P6-family (Nehalem and earlier), not
Sandybridge-family, so it's getting close to time to drop it from
-mtune=generic.  (-mtune= bdver* or znver* don't do it, so there is a tuning
setting controlling it)

GCC *only* seems to know about MOV, so ironically with -march=skylake for
example, we avoid a non-existant LCP stall for mov to memory, but GCC compiles
x += 1234 into code that will LCP stall, addw $1234, x(%rip).

-march=alderlake disables this tuning workaround, using movw $imm, mem.  (The
Silvermont-family E-cores in Alder Lake don't have this problem either, so
that's correct.  Agner Fog's guide didn't mention any changes in LCP stalls for
Alder Lake.)

Avoiding LCP stalls is somewhat less important on CPUs with a uop cache, since
it only happens on legacy decode.  Although various things can cause code to
only run from legacy decode even inside a loop, such as Skylake's JCC erratum
microcode mitigation if users don't assemble with the option to have GAS work
around it, which GCC doesn't pass by default with -march=skylake.

If there isn't already a bug open about tuning choices mismatching hardware, I
can repost this as a new bug if you'd like.


Related
:https://stackoverflow.com/questions/75154687/is-this-a-missed-optimization-in-gcc-loading-an-16-bit-integer-value-from-roda

and
https://stackoverflow.com/questions/70719114/why-does-the-short-16-bit-variable-mov-a-value-to-a-register-and-store-that-u

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #27 from Peter Cordes  ---
(In reply to Alexander Monakov from comment #26)
> Sure, the right course of action seems to be to simply document that atomic
> types and built-ins are meant to be used on "common" (writeback) memory

Agreed.  Where in the manual should this go?  Maybe a new subsection of the
chapter about __atomic builtins where we document per-ISA requirements for them
to actually work?

e.g. x86 memory-type stuff, and that ARM assumes all cores are in the same
inner-shareable cache-coherency domain, thus barriers are   dmb ish   not  dmb
sy and so on.
I guess we might want to avoid documenting the actual asm implementation
strategies in the main manual, because that would imply it's supported to make
assumptions based on that.

Putting it near the __atomic docs might make it easier for readers to notice
that the list of requirements exists, vs. scattering them into different pages
for different ISAs.  And we don't currently have any section in the manual
about per-ISA quirks or requirements, just about command-line options,
builtins, and attributes that are per-ISA, so there's no existing page where
this could get tacked on.

This would also be a place where we can document that __atomic ops are
address-free when they're lock-free, and thus usable on shared memory between
processes.  ISO C++ says that *should* be the case for std::atomic, but
doesn't standardize the existence of multiple processes.

To avoid undue worry, documentation about this should probably start by saying
that normal programs (running under mainstream OSes) don't have to worry about
it or do anything special.

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

--- Comment #25 from Peter Cordes  ---
(In reply to Alexander Monakov from comment #24)
> 
> I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
> OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device
> memory (and then CPU vendor cannot guarantee that 128b access won't tear
> because it might depend on downstream devices).


Even atomic_int doesn't work properly if you deref a pointer to WC memory.  WC
doesn't have the same ordering guarantees, so it would break acquire/release
semantics.
So we already don't support WC for this.

We do at least de-facto support atomics on UC memory because the ordering
guarantees are a superset of cacheable memory, and 8-byte atomicity for aligned
load/store is guaranteed even for non-cacheable memory types since P5 Pentium
(and on AMD).  (And lock cmpxchg16b is always atomic even on UC memory.)

But you're right that only Intel guarantees that 16-byte VMOVDQA loads/stores
would be atomic on UC memory.  So this change could break that very unwise
corner-case on AMD which only guarantees that for cacheable loads/stores, and
Zhaoxin only for WB.

But was anyone previously using 16-byte atomics on UC device memory?  Do we
actually care about supporting that?  I'd guess no and no, so it's just a
matter of documenting that somewhere.

Since GCC7 we've reported 16-byte atomics as being non-lock-free, so I *hope*
people weren't using __atomic_store_n on device memory.  The underlying
implementation was never guaranteed.

[Bug target/104688] gcc and libatomic can use SSE for 128-bit atomic loads on Intel and AMD CPUs with AVX

2022-11-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #23 from Peter Cordes  ---
(In reply to Xi Ruoyao from comment #20)
> "On Zhaoxin CPUs with AVX, the VMOVDQA instruction is atomic if the accessed
> memory is Write Back, but it's not guaranteed for other memory types."

VMOVDQA is still fine, I think WB is the only memory type that's relevant for
atomics, at least on the mainstream OSes we compile for.  It's not normally
possible for user-space to allocate memory of other types.  Kernels normally
use WB memory for their shared data, too.

You're correct that WT and WP are the other two cacheable memory types, and
Zhaoxin's statement doesn't explicitly guarantee atomicity for those, unlike
Intel and AMD.

But at least on Linux, I don't think there's a way for user-space to even ask
for a page of WT or WP memory (or UC or WC).  Only WB memory is easily
available without hacking the kernel.  As far as I know, this is true on other
existing OSes.

WT = write-through: read caching, no write-allocate.  Write hits update the
line and memory.
WP = write-protect: read caching, no write-allocate.  Writes go around the
cache, evicting even on hit.
(https://stackoverflow.com/questions/65953033/whats-the-usecase-of-write-protected-pat-memory-type
quotes the Intel definitions.)

Until recently, the main work on formalizing the x86 TSO memory model had only
looked at WB memory.
A 2022 paper looked at WT, UC, and WC memory types:
https://dl.acm.org/doi/pdf/10.1145/3498683 - Extending Intel-x86 Consistency
and Persistency
Formalising the Semantics of Intel-x86 Memory Types and Non-temporal Stores
(The intro part describing memory types is quite readable, in plain English not
full of formal symbols.  They only mention WP once, but tested some litmus
tests with readers and writers using any combination of the other memory
types.)


Some commenters on my answer on when WT is ever used or useful confirmed that
mainstream OSes don't give easy access to it.
https://stackoverflow.com/questions/61129142/when-use-write-through-cache-policy-for-pages/61130838#61130838
* Linux has never merged a patch to let user-space allocate WT pages.
* The Windows kernel reportedly doesn't have a mechanism to keep track of pages
that should be WT or WP, so you won't find any.

I don't know about *BSD making it plausible for user-space to point an _Atomic
int * at a page of WT or WP memory.  I'd guess not.

I don't know if there's anywhere we can document that _Atomic objects need to
be in memory that's allocated in a "normal" way.  Probably hard to word without
accidentally disallowing something that's fine.

[Bug tree-optimization/106138] Inefficient code generation: logical AND of disjoint booleans from equal and bitwise AND not optimized to constant false

2022-06-30 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106138

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #3 from Peter Cordes  ---
Ideally, bitwise & of booleans should also be handled, not just &&.
A testcase (https://godbolt.org/z/qvosv8q7c) makes it easy to check both.

//#define LOGIC_AND 
_Bool f2(char x)
{
_Bool b1 = x == 2;
_Bool b2 = x & 1;

#ifdef LOGIC_AND
  return b1 && b2;
#else
  return b1 & b2;
#endif
}

(Clang optimized it to return false for the && version, but not bitwise.  GCC
currently doesn't optimize either way.)
This was originally posted on Stack Overflow
(https://stackoverflow.com/q/72802469/224132), BTW.

[Bug target/105929] New: [AArch64] armv8.4-a allows atomic stp. 64-bit constants can use 2 32-bit halves with _Atomic or volatile

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105929

Bug ID: 105929
   Summary: [AArch64] armv8.4-a allows atomic stp. 64-bit
constants can use 2 32-bit halves with _Atomic or
volatile
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: arm64-*-*

void foo(unsigned long *p) {
*p = 0xdeadbeefdeadbeef;
}
// compiles nicely:  https://godbolt.org/z/8zf8ns14K
mov w1, 48879
movkw1, 0xdead, lsl 16
stp w1, w1, [x0]
ret

But even with -Os -march=armv8.4-a   the following doesn't:
void foo_atomic(_Atomic unsigned long *p) {
__atomic_store_n(p, 0xdeadbeefdeadbeef, __ATOMIC_RELAXED);
}

mov x1, 48879
movkx1, 0xdead, lsl 16
movkx1, 0xbeef, lsl 32
movkx1, 0xdead, lsl 48
stlrx1, [x0]
ret

ARMv8.4-a and later guarantees atomicity for aligned ldp/stp, according to
ARM's architecture reference manual: ARM DDI 0487H.a - ID020222, so we could
use the same asm as the non-atomic version.

> If FEAT_LSE2 is implemented, LDP, LDNP, and STP instructions that access 
> fewer than 16 bytes are single-copy atomic when all of the following 
> conditions are true:
> • All bytes being accessed are within a 16-byte quantity aligned to 16 bytes.
> • Accesses are to Inner Write-Back, Outer Write-Back Normal cacheable memory

(FEAT_LSE2 is the same CPU feature that gives 128-bit atomicity for aligned
ldp/stp x,x,mem)

Prior to that, apparently it wasn't guaranteed that stp of 32-bit halves merged
into a single 64-bit store. So without -march=armv8.4-a it wasn't a missed
optimization to construct the constant in a single register for _Atomic or
volatile.

But with ARMv8.4, we should use MOV/MOVK + STP.

Since there doesn't seem to be a release-store version of STP, 64-bit release
and seq_cst stores should still generate the full constant in a register,
instead of using STP + barriers.


(Without ARMv8.4-a, or with a memory-order other than relaxed, see PR105928 for
generating 64-bit constants in 3 instructions instead of 4, at least for -Os,
with add x0, x0, x0, lsl 32)

[Bug target/105928] New: [AArch64] 64-bit constants with same high/low halves can use ADD lsl 32 (-Os at least)

2022-06-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928

Bug ID: 105928
   Summary: [AArch64] 64-bit constants with same high/low halves
can use ADD lsl 32 (-Os at least)
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: arm64-*-*

void foo(unsigned long *p) {
*p = 0xdeadbeefdeadbeef;
}

cleverly compiles to https://godbolt.org/z/b3oqao5Kz

mov w1, 48879
movkw1, 0xdead, lsl 16
stp w1, w1, [x0]
ret

But producing the value in a register uses more than 3 instructions:

unsigned long constant(){
return 0xdeadbeefdeadbeef;
}

mov x0, 48879
movkx0, 0xdead, lsl 16
movkx0, 0xbeef, lsl 32
movkx0, 0xdead, lsl 48
ret

At least with -Os, and maybe at -O2 or -O3 if it's efficient, we could be doing
a shifted ADD or ORR to broadcast a zero-extended 32-bit value to 64-bit.

mov x0, 48879
movkx0, 0xdead, lsl 16
add x0, x0, x0, lsl 32

Some CPUs may fuse sequences of movk, and shifted operands for ALU ops may take
extra time in some CPUs, so this might not actually be optimal for performance,
but it is smaller for -Os and -Oz.

We should also be using that trick for stores to _Atomic or volatile long*,
where we currently do MOV + 3x MOVK, then an STR, with ARMv8.4-a which
guarantees atomicity.


---

ARMv8.4-a and later guarantees atomicity for ldp/stp within an aligned 16-byte
chunk, so we should use MOV/MOVK / STP there even for volatile or
__ATOMIC_RELAXED.  But presumably that's a different part of GCC's internals,
so I'll report that separately.

[Bug tree-optimization/105904] New: Predicated mov r0, #1 with opposite conditions could be hoisted, between 1 and 1<

2022-06-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904

Bug ID: 105904
   Summary: Predicated mov r0, #1 with opposite conditions could
be hoisted, between 1 and 1<  // using the libstdc++ header
unsigned roundup(unsigned x){
return std::bit_ceil(x);
}

https://godbolt.org/z/Px1fvWaex

GCC's version is somewhat clunky, including MOV r0, #1 in either "side":

roundup(unsigned int):
cmp r0, #1
i   hi
addhi   r3, r0, #-1
movhi   r0, #1@@ here
clzhi   r3, r3
rsbhi   r3, r3, #32
ite hi
lslhi   r0, r0, r3
movls   r0, #1@@ here
bx  lr

Even without spotting the other optimizations that clang finds, we can combine
to a single unconditional MOV r0, #1.  But only if we avoid setting flags, so
it requires a 4-byte encoding, not MOVS.  Still, it's one fewer instruction to
execute.

This is not totally trivial: it requires seeing that we can move it across the
conditional LSL.  So it's really a matter of folding the 1s between 1<

[Bug tree-optimization/105596] Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

--- Comment #1 from Peter Cordes  ---
https://godbolt.org/z/aoG55T5Yq
gcc -O3 -m32 has the same problem with  unsigned long long total  and unsigned
i.

Pretty much identical instruction sequences in the loop for all 3 versions,
doing add/adc to increment i, for example.  (Plus a bit of spilling). 
fact_gcc_handhold still compiles without the unnecessary widening.

Perhaps should retitle to widen to a "2-register type".

IDK how easily this occurs in real-world loops with 64 and 32-bit integers on
32-bit machines, but that's probably more of a concern for wasting more clock
cycles worldwide.

[Bug tree-optimization/105596] New: Loop counter widened to 128-bit unnecessarily

2022-05-13 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596

Bug ID: 105596
   Summary: Loop counter widened to 128-bit unnecessarily
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

For  total *= i  with a u128 total and a u32 loop counter, GCC pessimizes by
widening i and doing a full 128x128 => 128-bit multiply, and having to do a
128-bit increment and compare.

uint64_t i to make it a full register width doesn't help.

unsigned __int128 fact(unsigned n){
unsigned __int128 total = n;
for (unsigned i=2 ; i < n ; i++)
total *= i;
return total;
}
// 0! = 0  isn't mathematically correct, but that's not the point

https://godbolt.org/z/W4MW9b6T3  (gcc trunk 13.0.0 20220508 (experimental) and
clang 14, which makes efficient asm for all of these.)

# gcc -O3
fact:
movl%edi, %r9d
xorl%r11d, %r11d
movq%r9, %r10   # total = n  zext into  R11:R10
cmpl$2, %edi
jbe .L7 # if n<=2 return r11:r10
movl$2, %esi# i = 2  in  RDI:RSI
xorl%edi, %edi
.L9:  # do{
movq%r11, %rcx
movq%rdi, %rdx
movq%r10, %rax
movq%r9, %r8  # copy original n to destroy later
imulq   %r10, %rdx  # 128x128 multiply with 2x imul, 1x
widening mul
imulq   %rsi, %rcx
addq%rdx, %rcx
mulq%rsi
movq%rdx, %r11  # update total in r11:r10
movq%rax, %r10
addq%rcx, %r11  # last partial product

addq$1, %rsi# i++ as a 128-bit integer
adcq$0, %rdi
xorq%rsi, %r8   #  r8 = n^i
movq%rdi, %rcx   # useless copy, we're already destroying
r8
orq %r8, %rcx# hi(i^n) | lo(i^n)
jne .L9   # }while(i != n);
.L7:
movq%r10, %rax
movq%r11, %rdx
ret

So as well as creating extra work to do, it's not even doing it very
efficiently, with multiple unnecessary mov instructions.

This doesn't seem to be x86-64 specific.  It also compiles similarly for
AArch64 and MIPS64.  For some ISAs, I'm not sure if potentially-infinite loops
are making a difference, e.g. PowerPC is hard for me to read.  RV64 has three
multiply instructions in both versions.

I haven't tested a 32-bit equivalent with uint64_t total and uint32_t i.


This anti-optimization goes back to GCC4.6.  With GCC4.5 and earlier, the above
C compiles to a tight loop with the expected mul reg + imul reg,reg and 1
register loop counter: https://godbolt.org/z/6KheaqTx4  (using __uint128_t,
since unsigned __int128 wasn't supported on GCC4.4 or 4.1)

GCC 4.1 does an inefficient multiply, but one of the chunks is a freshly
xor-zeroed register.  It's still just incrementing and comparing a 32-bit loop
counter, but widening it for a 128x128-bit multiply recipe.  GCC4.4 optimizes
away the parts that are useless for the high 64 bits of (u128)i being zero.


-

A different version compiles efficiently with GCC6 and earlier, only becoming
slow like the above with GCC7 and later.

unsigned __int128 fact_downcount(unsigned n){
unsigned __int128 total = n;
for (unsigned i=n-1 ; i > 1 ; i--)
total *= i;
return total;  // 0! = 0 isn't mathematically correct
}


-

When the loop condition is possibly always-true, GCC can't prove the loop is
non-infinite, and as usual can't widen the loop counter.  In this case, that's
a good thing:

unsigned __int128 fact_gcc_handhold(unsigned n){
unsigned __int128 total = 1;   // loop does do final n
for (unsigned i=2 ; i <= n ; i++)  // potentially infinite loop defeats
this pessimization
total *= i;
return total;  // fun fact:  0! = 1  is mathematically correct
}


fact_gcc_handhold:
cmpl$1, %edi
jbe .L4
movl$2, %ecx   # i = 2   inECX
movl$1, %eax   # total = 1  in RDX:RAX
xorl%edx, %edx
.L3: #do{
movl%ecx, %esi# copy i instead of just incrementing it
later :/

movq%rdx, %r8   # save high half of total
addl$1, %ecx  # i++
imulq   %rsi, %r8   # lo x hi cross product
mulq%rsi# lo x lo widening
addq%r8, %rdx   # 128x64-bit multiply

cmpl%ecx, %edi
jnb .L3   # }while(i < n)
ret

Allocating total in RDX:RAX is nice, putting the lo part where we need it for
mulq anyway.

[Bug target/65146] alignment of _Atomic structure member is not correct

2022-04-27 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146

--- Comment #25 from Peter Cordes  ---
(In reply to CVS Commits from comment #24)
> The master branch has been updated by Jakub Jelinek :
> 
> https://gcc.gnu.org/g:04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6
> 
> commit r11-2909-g04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6
> Author: Jakub Jelinek 
> Date:   Thu Aug 27 18:44:40 2020 +0200
> 
> ia32: Fix alignment of _Atomic fields [PR65146]
> 
> For _Atomic fields, lowering the alignment of long long or double etc.
> fields on ia32 is undesirable, because then one really can't perform
> atomic
> operations on those using cmpxchg8b.


Just for the record, the description of this bugfix incorrectly mentioned
cmpxchg8b being a problem.  lock cmpxchg8b is *always* atomic, even if that
means the CPU has to take a bus lock (disastrously expensive affecting all
cores system-wide) instead of just delaying MESI response for one line
exclusively owned in this core's private cache (aka cache lock).

The correctness problem is __atomic_load_n / __atomic_store_n compiling to
actual 8-byte pure loads / pure stores using SSE2 movq, SSE1 movlps, or x87
fild/fistp (bouncing through the stack), such as

  movq  %xmm0, (%eax)

That's where correctness depends on Intel and AMD's atomicity guarantees which
are conditional on alignment.

(And if AVX is supported, same deal for 16-byte load/store.  Although we can
and should use movaps for that, which bakes alignment checking into the
instruction.  Intel did recently document that CPUs with AVX guarantee
atomicity of 16-byte aligned loads/stores, retroactive to all CPUs with AVX. 
It's about time, but yay.)

> Not sure about iamcu_alignment change, I know next to nothing about IA
> MCU,
> but unless it doesn't have cmpxchg8b instruction, it would surprise me
> if we
> don't want to do it as well.


I had to google iamcu.  Apparently it's Pentium-like, but only has soft-FP (so
I assume no MMX or SSE as well as no x87).

If that leaves it no way to do 8-byte load/store except (lock) cmpxchg8b, that
may mean there's no need for alignment, unless cache-line-split lock is still a
performance issue.  If it's guaranteed unicore as well, we can even omit the
lock prefix and cmpxchg8b will still be an atomic RMW (or load or store) wrt.
interrupts.  (And being unicore would likely mean much less system-wide
overhead for a split lock.)

[Bug target/82261] x86: missing peephole for SHLD / SHRD

2022-04-09 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261

--- Comment #4 from Peter Cordes  ---
GCC will emit SHLD / SHRD as part of shifting an integer that's two registers
wide.
Hironori Bono proposed the following functions as a workaround for this missed
optimization (https://stackoverflow.com/a/71805063/224132)

#include 

#ifdef __SIZEOF_INT128__
uint64_t shldq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)(unsigned __int128)high << 64) | (unsigned __int128)low)
<< (count & 63)) >> 64);
}

uint64_t shrdq_x64(uint64_t low, uint64_t high, uint64_t count) {
  return (uint64_t)unsigned __int128)high << 64) | (unsigned __int128)low)
>> (count & 63));
}
#endif

uint32_t shld_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)(uint64_t)high << 32) | (uint64_t)low) << (count & 31))
>> 32);
}

uint32_t shrd_x86(uint32_t low, uint32_t high, uint32_t count) {
  return (uint32_t)uint64_t)high << 32) | (uint64_t)low) >> (count & 31));
}

---

The uint64_t functions (using __int128) compile cleanly in 64-bit mode
(https://godbolt.org/z/1j94Gcb4o) using 64-bit operand-size shld/shrd

but the uint32_t functions compile to a total mess in 32-bit mode (GCC11.2 -O3
-m32 -mregparm=3) before eventually using shld, including a totally insane 
or  dh, 0

GCC trunk with -O3 -mregparm=3 compiles them cleanly, but without regparm it's
also slightly different mess.

Ironically, the uint32_t functions compile to quite a few instructions in
64-bit mode, actually doing the operations as written with shifts and ORs, and
having to manually mask the shift count to &31 because it uses a 64-bit
operand-size shift which masks with &63.  32-bit operand-size SHLD would be a
win here, at least for -mtune=intel or a specific Intel uarch.

I haven't looked at whether they still compile ok after inlining into
surrounding code, or whether operations would tend to combine with other things
in preference to becoming an SHLD.

[Bug target/105066] GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

--- Comment #5 from Peter Cordes  ---
> pextrw requires sse4.1 for mem operands.

You're right! I didn't double-check the asm manual for PEXTRW when writing up
the initial report, and had never realized that PINSRW wasn't symmetric with
it.  I was really surprised to see that in
https://www.felixcloutier.com/x86/pextrw

So we do need to care about tuning for _mm_storeu_si16(p, v) without SSE4.1
(without the option of PEXTRW to memory).  PEXTRW to an integer register is
obviously bad; we should be doing

movd  %xmm0, %eax
mov   %ax, (%rdi)

instead of an inefficient  pextrw $0, %xmm0, %eax ; movw-store

Reported as PR105079, since the cause of the load missed-opt was GCC thinking
the instruction wasn't available, rather than a wrong tuning choice like this
is.

[Bug target/105079] New: _mm_storeu_si16 inefficiently uses pextrw to an integer reg (without SSE4.1)

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079

Bug ID: 105079
   Summary: _mm_storeu_si16 inefficiently uses pextrw to an
integer reg (without SSE4.1)
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

With PR105066 fixed, we do _mm_loadu_si16 with pinsrw from memory, because
that's available with just SSE2.  (And the cause wasn't tuning choices, it was
a typo in what insns GCC thought were available.)  Related: PR105072 re:
folding such 16-bit loads into memory source operands for PMOVZX/SXBQ.

But the famously non-orthogonal SSE2 only includes pextrw $imm, %xmm, reg.  Not
reg/mem until SSE4.1 (with a longer opcode for no apparent reason, instead of
just allowing mem addressing modes for the existing one.  But same mnemonic so
the assembler takes care of it.  https://www.felixcloutier.com/x86/pextrw)

So we do need to care about tuning for _mm_storeu_si16(p, v) without the option
of PEXTRW to memory.  Currently we do this, which is obviously bad:

pextrw  $0, %xmm0, %eax  # 2 uops
movw%ax, (%rdi)

we should be doing this

movd%xmm0, %eax  # 1 uop
mov %ax, (%rdi)

https://godbolt.org/z/Ee3Ez174M

This is especially true if we don't need the integer value zero-extended into
EAX.

If we *did* also want the value zero-extended in an integer register, the extra
uop in PEXTRW (in addition to the port 0 uop like MOVD) is a port-5 shuffle to
extract an arbitrary 16-bit element, vs. a separate integer movzwl %cx, %eax
could run on any integer ALU port.  (Including port 6 on HSW/SKL, which doesn't
compete with any vector ALUs).

Mov-elimination for movzwl doesn't work on any current CPUs, only movzbl on
Intel, and movl / movq on both Intel and AMD.  So currently there's no benefit
to picking a different register like %ecx, instead of just using movzwl %ax,
%eax

When we both store and use the integer value:

int store16_and_use(void *p, __m128i v){
_mm_storeu_si16( p, v );
return 123 + *(unsigned short*)p;
}

https://godbolt.org/z/zq6TMo1oE current trunk GCC does this, which is not bad:

# -O3 with or without -msse4.1
pextrw  $0, %xmm0, %eax
movw%ax, (%rdi)
addl$123, %eax
ret

Clang13 uses MOVD + MOVZX like I was suggesting, even though it costs more code
size.  That's not necessarily better

movd%xmm0, %eax
movw%ax, (%rdi)
movzwl  %ax, %eax
addl$123, %eax
retq

In this case it's not obviously wrong to use PEXTRW to an integer reg, but it's
also fine to do it clang's way.  So however that corner case shakes out in the
process of fixing the main bug (using movd / movw without SSE4.1 when we don't
reload) is fine.

If SSE4.1 is available, the no-reload case should probably use PEXTRW to memory
instead of movd + movw.  On some CPUs, the ALU op that's part of PEXTRW has
more choice of ALU port than xmm->gp_int operations.

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-28 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508

--- Comment #17 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #16)
> >According to Intel (
> > https://software.intel.com/sites/landingpage/IntrinsicsGuide), there are no
> > alignment requirements for _mm_load_sd, _mm_store_sd and _mm_loaddup_pd. For
> > example, from _mm_load_sd:
> 
> I disagree with saying there is no alignment requirement.
> 
> The alignment requirement comes from the type of the argument (double
> const*). [...]
> Pointers themselves have an alignment requirement not just at the time of
> the load/store of them.

The intrinsics are badly designed to take pointer args with types other than
void*, despite how they're expected to work.  This is something we just need to
accept.  Starting with AVX-512, any new intrinsics take void*, but they haven't
redefined the old ones.

_mm_loadu_si128 takes a __m128i*, same as _mm_load_si128.  alignof(__m128i) ==
16, so _mm_loadu_si128 must not simply dereference it, that's what
_mm_load_si128 does.

Intel's intrinsics API requires you to do unaligned 16-byte loads by creating a
misaligned pointer and passing it to a loadu intrinsic.  (This in turn requires
that implementations supporting these intrinsics define the behaviour of
creating such a pointer without deref; in ISO C that alone would be UB.)

This additional unaligned-pointer behaviour that implementations must define
(at least for __m128i* and float/double*) is something I wrote about in an SO
answer:
https://stackoverflow.com/questions/52112605/is-reinterpret-casting-between-hardware-simd-vector-pointer-and-the-correspond


_mm_loadu_ps (like _mm_load_ps) takes a float*, but its entire purpose it to
not require alignment.

_mm512_loadu_ps takes a void* arg, so we can infer that earlier FP load
intrinsics really are intended to work on data with any alignment, not just
with the alignment of a float.

They're unlike a normal deref of a float* in aliasing rules, although that's
separate from creating a misaligned float* in code outside the intrinsic.  A
hypothetical low-performance portable emulation of intrinsics that ended up
dereferencing that float* arg directly would be broken for strict-aliasing as
well.

The requirement to define the behaviour of having a misaligned float* can be
blamed on Intel in 1995 (when SSE1 was new). Later extensions like AVX
_mm256_loadu_ps just followed the same pattern of taking float* until they
finally used void* for intrinsics introduced with or after AVX-512.

The introduction of _mm_loadu_si32 and si16 is another step in the right
direction, recognizing that _mm_cvtsi32_si128( *int_ptr ) isn't strict-aliasing
safe.  When those were new, it might have been around the time Intel started
exploring replacing ICC with the LLVM-based ICX.

Anyway, the requirement to support misaligned vector and float/double pointers
implies that _mm_load_ss/sd taking float*/double* doesn't imply alignof(float)
or alignof(double).

>  So either the intrinsics definition needs to be changed to be
> correct or GCC is correct.

That's an option; I'd love it if all the load/store intrinsics were changed
across all compilers to take void*.  It's ugly and a pain to type  
_mm_loadu_si128( (const __m128i*)ptr )
as well as creating cognitive dissonance because alignof(__m128i) == 16.

I'm not sure if it could break anything to change the intrinsics to take void*
even for older ones; possibly only C++ overload resolution for insane code that
defines a _mm_loadu_ps( other_type * ) and relies on float* args picking the
intrinsic.

If we changed just GCC, without getting buy-in from other compilers, taking
void* would let people's code compile on GCC without casts from stuff like
int*, when it wouldn't compile on other compilers.

That could be considered a bad thing if people test their code with GCC and are
surprised to get reports of failure from people using compilers that follow
Intel's documentation for the intrinsic function arg types. 
(https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html).  It
would basically be a case of being overly permissive for the feature / API that
people are trying to write portable code against.

[Bug sanitizer/84508] Load of misaligned address using _mm_load_sd

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #14 from Peter Cordes  ---
This bug is mis-categorized; it's not a sanitizer bug, it's a bug in the
implementation _mm_load_ss / sd.

It currently derefs the  `float const*` arg directly, which is not
strict-aliasing or alignment safe.  alignof(float) is 4, but Intel's
documentation for this API still says "mem_addr does not need to be aligned on
any particular boundary."

_mm_load_ss (float const *__P)
{
  return _mm_set_ss (*__P);
}


As discussed on PR99754 _mm_load_si32(const void*) *is* strict-aliasing and
alignment safe.  But it only existed recently, and GCC11's implementation of it
is buggy (shuffling the element to the wrong place).  Before that, one safe way
to do a 32-bit SIMD load is with _mm_load_ss and _mm_castps_si128.  Or it was
supposed to be safe, but isn't!!

Clang uses a packed may_alias struct containing a float to get a safe load
done.  Another way would be casting the pointer to

typdef float aliasing_unaligned_f32 __attribute__((aligned(1),may_alias));

This is similar to what we do with __m32_u for use in aliasing-safe integer
load/store, except we define that as int with
vector_size(4),may_alias,aligned(1) for some reason.  Perhaps influenced by
__m64_u which is a vector of 2 ints.

MSVC is like gcc -fno-strict-aliasing, so however it handles intrinsics,
they're always aliasing-safe.

I'm not 100% sure about what ICC formally guarantees, but in practice it
doesn't move aliasing short*  stores across a _mm_load_ss( (float*)pshort )
load.
https://godbolt.org/z/6s76v71xz  I didn't test with _mm_store_ss aliasing with
short loads, only vice versa.

So GCC is the odd one out, out of the major 4 compilers that support Intel's
intrinsics API.  All our narrow load/store intrinsics should be strict-aliasing
and alignment safe, regardless of what pointer type they accept.

Intel's early design of taking float* and double* instead of void* could be
considered poor design.  Their naming with just load/store instead of
_mm_loadu_ss / storeu is also poor design, clearly motivated by the asm
differences rather than an actual intrinsic API difference.

In x86 asm, loads/stores narrower than 16 bytes never require alignment (unless
the AC bit is set in EFLAGS).  Assuming Intel modeled their intrinsics API
after their asm, then it makes sense to have load and loadu for ps and si128,
but only load/store with an implied lack of alignment for intrinsics that wrap
instructions like movlps / movhps / movss / movsd, and movd / movq, which do
narrower memory accesses.

That of course *doesn't* make sense in C terms, where it's always potentially a
problem to dereference misaligned pointers to narrow objects, even when
compiling for x86-64:
https://stackoverflow.com/questions/47510783/why-does-unaligned-access-to-mmaped-memory-sometimes-segfault-on-amd64
has an example and links some others, showing that compilers *don't* define the
behaviour of deref of misaligned pointers.

I'm pretty certain that Intel always intended their narrow load/store
intrinsics to not have any alignment requirements, like the asm instructions
that wrap them, but weren't thinking in C terms when naming them.  And were
sloppily in their choices of which ones to provide until decades later, since
it seems they thought that _mm_cvtsi32_si128(*x) was sufficient for a movd
load.  (Only the case on a compiler without strict-aliasing or alignment, since
the deref happens on the user's plain int*).

Anyway, hopefully this refutes the argument that _mm_load_sd should be aligned
because of the name, and clarifies what Intel might have been thinking when
naming these.

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754

--- Comment #6 from Peter Cordes  ---
Looks good to me, thanks for taking care of this quickly, hopefully we can get
this backported to the GCC11 series to limit the damage for people using these
newish intrinsics.  I'd love to recommend them for general use, except for this
GCC problem where some distros have already shipped GCC versions that compile
without error but in a 100% broken way.

Portable ways to do narrow alignment/aliasing-safe SIMD loads were sorely
lacking; there aren't good effective workarounds for this, especially for
16-bit loads.  (I still don't know how to portably / safely write code that
will compile to a memory-source PMOVZXBQ across all compilers; Intel's
intrinsics API is rather lacking in some areas and relies on compilers folding
loads into memory source operands.)


> So, isn't that a bug in the intrinsic guide instead?

Yes, __m128i _mm_loadu_si16 only really makes sense with SSE2 for PINSRW.  Even
movzx into an integer reg and then MOVD xmm, eax requires SSE2.  With only SSE1
you'd have to movzx / dword store to stack / MOVSS reload.

SSE1 makes *some* sense for _mm_loadu_si32 since it can be implemented with a
single MOVSS if MOVD isn't available.

But we already have SSE1 __m128 _mm_load_ss(const float *) for that.

Except GCC's implementation of _mm_load_ss isn't alignment and strict-aliasing
safe; it derefs the actual float *__P as _mm_set_ss (*__P).  Which I think is a
bug, although I'm not clear what semantics Intel intended for that intrinsic. 
Clang implements it as alignment/aliasing safe with a packed may_alias struct
containing a float.  MSVC always behaves like -fno-strict-aliasing, and I
*think* ICC does, too.

Perhaps best to follow the crowd and make all narrow load/store intrinsics
alignment and aliasing safe, unless that causes code-gen regressions; users can
_mm_set_ss( *ptr ) themselves if they want that to tell the compiler that's its
a normal C float object.

Was going to report this, but PR84508 is still open and already covers the
relevant ss and sd intrinsics.  That points out that Intel specifically
documents it as not requiring alignment, not mentioning aliasing.



Speaking of bouncing through a GP-integer reg, GCC unfortunately does that; it
seems to incorrectly think PINSRW xmm, mem, 0 requires -msse4.1, unlike with a
GP register source.  Reported as PR105066 along with related missed
optimizations about folding into a memory source operand for pmovzx/sx.

But that's unrelated to correctness; this bug can be closed unless we're
keeping it open until it's fixed in the GCC11 current stable series.

[Bug target/105066] New: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not SSE2? _mm_loadu_si16 bounces through integer reg

2022-03-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066

Bug ID: 105066
   Summary: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not
SSE2?  _mm_loadu_si16 bounces through integer reg
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

PR99754 fixed the wrong-code for _mm_loadu_si16, but the resulting asm is not
efficient without -msse4.1 (as part of -march= most things).  It seems GCC
thinks that pinsrw / pextrw with a memory operand requires SSE4.1, like
pinsr/extr for b/d/q operand-size.  But actually 16-bit insr/extr only needs
SSE2

(We're also not efficiently folding it into a memory source operand for
PMOVZXBQ, see below)

https://godbolt.org/z/dYchb6hec shows GCC trunk 12.0.1 20220321

__m128i load16(void *p){
return _mm_loadu_si16( p );
}


load16(void*): # no options, or -march=core2 or -mssse3
movzwl  (%rdi), %eax
pxor%xmm1, %xmm1
pinsrw  $0, %eax, %xmm1   # should be MOVD %eax, or PINSRW mem
movdqa  %xmm1, %xmm0
ret

vs. 

load16(void*):  # -msse4.1
pxor%xmm1, %xmm1
pinsrw  $0, (%rdi), %xmm1
movdqa  %xmm1, %xmm0
ret


The second version is actually 100% fine with SSE2:
https://www.felixcloutier.com/x86/pinsrw shows that there's only a single
opcode for PINSRW xmm, r32/m16, imm8 and it requires SSE2; reg vs. mem source
is just a matter of the modr/m byte.

The same problem exists for _mm_storeu_si16 not using pextrw to memory (which
is also SSE2), instead bouncing through EAX.  (Insanely still PEXTRW instead of
MOVD).



There is a choice of strategy here, but pinsrw/extrw between eax and xmm0 is
clearly sub-optimal everywhere.  Once we factor out the dumb register
allocation that wastes a movdqa, the interesting options are:

movzwl  (%rdi), %eax  # 1 uop on everything
movd%eax, %xmm0   # 1 uop on everything

vs.

pxor%xmm0, %xmm0# 1 uop for the front-end, eliminated on Intel
pinsrw  $0, (%rdi), %xmm0   # 2 uops  (load + shuffle/merge)


Similarly for extract,

pextrw  $0, %xmm0, (%rdi)   # 2 uops on most

vs.

movd%xmm0, %eax # 1 uop, only 1/clock even on Ice Lake
movw%ax, (%rdi) # 1 uop

On Bulldozer-family, bouncing through an integer reg adds a lot of latency vs.
loading straight into the SIMD unit.  (2 integer cores share a SIMD/FP unit, so
movd between XMM and GP-integer is higher latency than most.)  So that would
definitely favour pinsrw/pextrw with memory.

On Ice Lake, pextrw to mem is 2/clock throughput: the SIMD shuffle can run on
p1/p5.  But MOVD r,v is still p0 only, and MOVD v,r is still p5 only.  So that
also favours pinsrw/pextrw with memory, despite the extra front-end uop for
pxor-zeroing the destination on load.

Of course, if _mm_storeu_si16 is used on a temporary that's later reloaded,
being able to optimize to a movd (and optionally movzx) is very good.  Similar
for _mm_loadu_si16 on a value we have in an integer reg, especially if we know
it's already zero-extended to 32-bit for just a movd, we'd like to be able to
do that.

---

It's also essential that these loads fold efficiently into memory source
operands for PMOVZX; pmovzxbq is one of the major use-cases for a 16-bit load.

That may be a separate bug, IDK

https://godbolt.org/z/3a9T55n3q shows _mm_cvtepu8_epi32(_mm_loadu_si32(p)) does
fold a 32-bit memory source operand nicely to pmovzxbd (%rdi), %xmm0 which can
micro-fuse into a single uop on Intel CPUs (for the 128-bit destination
version, not YMM), but disaster with 16-bit loads:

__m128i pmovzxbq(void *p){
return _mm_cvtepu8_epi64(_mm_loadu_si16(p));
}

pmovzxbq(void*):  # -O3 -msse4.1 -mtune=haswell
pxor%xmm0, %xmm0  # 1 uop
pinsrw  $0, (%rdi), %xmm0 # 2 uops, one for shuffle port
pmovzxbq%xmm0, %xmm0  # 1 uop for the same shuffle port
ret

(_mm_cvtepu8_epi64 requires SSE4.1 so there's no interaction with the
-mno-sse4.1 implementation of the load.)

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754

--- Comment #3 from Peter Cordes  ---
Wait a minute, the current implementation of _mm_loadu_si32 isn't
strict-aliasing or alignment safe!!!   That defeats the purpose for its
existence as something to use instead of _mm_cvtsi32_si128( *(int*)p );

The current code contains a deref of a plain (int*).
It should be using something like

typdef int unaligned_aliasing_int __attribute__((aligned(1),may_alias));

[Bug target/99754] [sse2] new _mm_loadu_si16 and _mm_loadu_si32 implemented incorrectly

2022-03-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
Can we get this patch applied soon?  There aren't any other
strict-aliasing-safe movd load intrinsics, but this one won't be portably
usable while there are buggy GCC versions around.

Until then, code should probably use something like


inline __m128i movd(void *p){
return _mm_castps_si128(_mm_load_ss((const float*)p));
}

(Which believe it or not is strict-aliasing safe even on integer data.  At
least it should be; last I tested it was across compilers, except maybe on ICC.
 Would have to double-check there.)

[Bug target/104773] New: compare with 1 not merged with subtract 1

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773

Bug ID: 104773
   Summary: compare with 1 not merged with subtract 1
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*, arm-*-*

std::bit_ceil(x) involves if(x == 0 || x == 1) return 1;
and 1u << (32-clz(x-1)).

The compare of course compiles to an unsigned <= 1, which can be done with a
sub instead of cmp, producing the value we need as an input for the
leading-zero count.  But GCC does *not* do this.  (Neither does clang for
x86-64).  I trimmed down the libstdc++  code into something I could
compile even when Godbolt is doesn't have working headers for some ISAs:
https://godbolt.org/z/3EE7W5bna

// cut down from libstdc++ for normal integer cases; compiles the same
  template
constexpr _Tp
bit_ceil(_Tp __x) noexcept
{
  constexpr auto _Nd = std::numeric_limits<_Tp>::digits;
  if (__x == 0 || __x == 1)
return 1;
  auto __shift_exponent = _Nd - __builtin_clz((_Tp)(__x - 1u));
  // using __promoted_type = decltype(__x << 1); ... // removed check for
x<

[Bug libstdc++/97759] Could std::has_single_bit be faster?

2022-03-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #14 from Peter Cordes  ---
Agreed with the idea of expanding doing the  popcount(x) == 1  peephole
replacement in the compiler, not forcing the header to figure out whether we
have efficient popcount or not.

If we have BMI2, it's BLSR (Bit Lowest-Set Reset) sets CF=1 if the input was
zero, and ZF=1 if the output is zero.  Unfortunately none of the standard
jcc/setcc conditions check ZF=1 & CF=0, even with CMC to invert CF first.  
(If Intel had designed it to produce ZF and CF inverted, it would be
non-intuitive for all other uses but would have allowed blsr / jcc to implement
if(has_single_bit(x)).)

CF=1, ZF=0  impossible: input was zero, output was non-zero
CF=1, ZF=1  input was zero
CF=0, ZF=0  input had multiple bits set
CF=0, ZF=1  input had a single bit set.

If we're going to branch on it anyway after inlining, a branchy strategy is
probably good:

singlebit_bmi2_branchy:
   xor%eax, %eax
   blsr   %edi, %edi#  ZF=1 means we cleared the last bit, or the input was
zero
   jc .Linput_zero  # input was zero, return 0 regardless of ZF
   setz   %al
 .Linput_zero:
   ret


And when we want a boolean in a register, a combination of setz and cmovc can
materialize one.  With clever choice of registers, we can even avoid giving
setcc a false dependency on a register that isn't already part of its dep chain

singlebit_bmi2_cmov:
   blsr%edi, %eax
   setz%al # false dep, but it's ready if FLAGS are ready because
we wrote it with BLSR
   cmovc   %edi, %eax  # return 1 only if ZF=1 (setz produces 1) and CF=0
(cmovc doesn't overwrite it with the input 0)
   ret

With xor-zeroing first, we could produce the boolean zero-extended to 32-bit,
instead of here where only the low 8 bits are actually 0 / 1.  (Which is fine
for returning a bool in all the mainstream calling conventions)

(This is the same kind of idea as ARM64 sub/tst / ccmp / cset, where ccmp can
conditionally update flags.)

An evil variation on this uses setnz / dec to invert ZF without affecting CF,
allowing JA:

   blsr   %edi,%eax
   setnz  %al # AL = !ZF
   dec%al # 1-1 -> ZF=1,  0-1 -> ZF=0.  ZF=!ZF without
affecting CF
   # seta   %al # set on CF=0 and ZF=0
   ja was_single_bit# only actually useful for branching after inlining

dec/ja can't macro-fuse into a single uop, but on Skylake and later Intel it
doesn't cost any extra partial-FLAGS merging uops, because JA simply has both
parts of FLAGS as separate inputs.  (This is why CMOVA / CMOVBE are still 2
uops on Skylake, unlike all other forms: they need 2 integer inputs and 2 parts
of FLAGS, while others need either CF or SPAZO not both.  Interestingly, Zen1/2
have that effect but not Zen3)

I don't know how AMD handles dec / ja partial-flags shenanigans.  Intel Haswell
would I think have a flags-merging uop; older Intel doesn't support BMI1 so
P6-family is irrelevant.  https://stackoverflow.com/a/49868149/224132

I haven't benchmarked them because they have different use-cases (materializing
a boolean vs. creating a FLAGS condition to branch on, being branchless
itself), so any single benchmark would make one of them look good.  If your
data almost never (or always) has an all-zero input, the JC in the first
version will predict well.  After inlining, if the caller branches on the bool
result, you might want to just branch on both conditions separately.

I don't think this setnz/dec/ja version is ever useful.  Unlike branching
separately on ZF and CF, it's not bad if both 0 and multi-bit inputs are common
while single-bit inputs are rare.  But blsr/setz/cmovc + test/jnz is only 4
uops, same as this on Skylake. (test can macro-fuse with jnz).

The uops are all dependent on each other, so it also has the same latency (to
detect a branch miss) as popcnt / macro-fused cmp/je which is 2 uops.  The only
thing this has going for it is avoiding a port-1-only uop, I think.

It's also possible to blsr / lahf / and  ah, (1<<6) | (1<<0) / cmp  ah, 1<<6   
to directly check that ZF=1 and CF=0.  I doubt that's useful.  Or hmm, can we
branch directly on PF after AND with that 2-bit mask?  CF=1 ZF=0 is impossible,
so the only other odd-parity case is CF=0 ZF=1.  AMD and Intel can macro-fuse
test/jp.

   blsr  %edi, %eax
   lahf
   test  $(1<<6) | (1<<0), %ah# check ZF and CF.
   jpo   was_single_bit   # ZF != CF means CF=0, ZF=1 because the
other way is impossible.

Also possible of course is the straightforward 2x setcc and AND to materialize
a boolean in the bottom byte of EAX.  Good ILP, only 3 cycle latency from input
to result on Intel, but that's the same as setz/cmovc which is fewer uops and
can avoid false 

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

--- Comment #11 from Peter Cordes  ---
Also, horizontal byte sums are generally best done with  VPSADBW against a zero
vector, even if that means some fiddling to flip to unsigned first and then
undo the bias.

simde_vaddlv_s8:
 vpxorxmm0, xmm0, .LC0[rip]  # set1_epi8(0x80) flip to unsigned 0..255
range
 vpxorxmm1, xmm1
 vpsadbw  xmm0, xmm0, xmm1   # horizontal byte sum within each 64-bit half
 vmovdeax, xmm0  # we only wanted the low half anyway
 sub  eax, 8 * 128  # subtract the bias we added earlier by flipping
sign bits
 ret

This is so much shorter we'd still be ahead if we generated the vector constant
on the fly instead of loading it.  (3 instructions: vpcmpeqd same,same / vpabsb
/ vpslld by 7.  Or pcmpeqd / psllw 8 / packsswb same,same to saturate to -128)

If we had wanted a 128-bit (16 byte) vector sum, we'd need

  ...
  vpsadbw ...

  vpshufd  xmm1, xmm0, 0xfe # shuffle upper 64 bits to the bottom
  vpaddd   xmm0, xmm0, xmm1
  vmovdeax, xmm0
  sub  eax, 16 * 128

Works efficiently with only SSE2.  Actually with AVX2, we should unpack the top
half with VUNPCKHQDQ to save a byte (no immediate operand), since we don't need
PSHUFD copy-and-shuffle.

Or movd / pextrw / scalar add but that's more uops: pextrw is 2 on its own.

[Bug tree-optimization/102494] Failure to optimize vector reduction properly especially when using OpenMP

2021-10-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #10 from Peter Cordes  ---
Current trunk with -fopenmp is still not good https://godbolt.org/z/b3jjhcvTa 
Still doing two separate sign extensions and two stores / wider reload (store
forwarding stall):

-O3 -march=skylake -fopenmp
simde_vaddlv_s8:
pushrbp
vpmovsxbw   xmm2, xmm0
vpsrlq  xmm0, xmm0, 32
mov rbp, rsp
vpmovsxbw   xmm3, xmm0
and rsp, -32
vmovq   QWORD PTR [rsp-16], xmm2
vmovq   QWORD PTR [rsp-8], xmm3
vmovdqa xmm4, XMMWORD PTR [rsp-16]
   ... then asm using byte-shifts

Including stuff like
   movdqa  xmm1, xmm0
   psrldq  xmm1, 4

instead of pshufd, which is an option because high garbage can be ignored.

And ARM64 goes scalar.



Current trunk *without* -fopenmp produces decent asm
https://godbolt.org/z/h1KEKPTW9

For ARM64 we've been making good asm since GCC 10.x (vs. scalar in 9.3)
simde_vaddlv_s8:
sxtlv0.8h, v0.8b
addvh0, v0.8h
umovw0, v0.h[0]
ret

x86-64 gcc  -O3 -march=skylake
simde_vaddlv_s8:
vpmovsxbw   xmm1, xmm0
vpsrlq  xmm0, xmm0, 32
vpmovsxbw   xmm0, xmm0
vpaddw  xmm0, xmm1, xmm0
vpsrlq  xmm1, xmm0, 32
vpaddw  xmm0, xmm0, xmm1
vpsrlq  xmm1, xmm0, 16
vpaddw  xmm0, xmm0, xmm1
vpextrw eax, xmm0, 0
ret


That's pretty good, but  VMOVD eax, xmm0  would be more efficient than  VPEXTRW
when we don't need to avoid high garbage (because it's a return value in this
case).  VPEXTRW zero-extends into RAX, so it's not directly helpful if we need
to sign-extend to 32 or 64-bit for some reason; we'd still need a scalar movsx.

Or with BMI2, go scalar before the last shift / VPADDW step, e.g.
  ...
  vmovd  eax, xmm0
  rorx   edx, eax, 16
  addeax, edx

[Bug tree-optimization/80570] auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2021-09-26 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570

--- Comment #3 from Peter Cordes  ---
(In reply to Andrew Pinski from comment #2)
> Even on aarch64:
> 
> .L2:
> ldr q0, [x1], 16
> sxtlv1.2d, v0.2s
> sxtl2   v0.2d, v0.4s
> scvtf   v1.2d, v1.2d
> scvtf   v0.2d, v0.2d
> stp q1, q0, [x0]
>
> But the above is decent really.

More that decent, that's what we *should* be doing, I think.

AArch64 has versions of most instructions that read the top of a vector, unlike
x86-64 where VPMOVZX / SX can only read from the bottom half.  That's the key
difference, and what makes this strategy good on ARM, bad on x86-64.

(On 32-bit ARM, you load a q register, then read the two halves separately as
64-bit d<0..31> registers.  AArch64 changed that so there are 32x 128-bit
vector regs, and no partial regs aliasing the high half.  But they provide OP,
OP2 versions of some instructions that widen or things like that, with the "2"
version accessing a high half.  Presumably part of the motivation is to make it
easier to port ARM NEON code that depended on accessing halves of a 128-bit q
vector using its d regs.  But it's a generally reasonable design and could also
be motivated by seeing how inconvenient things get in SSE and AVX for
pmovsx/zx.) 

 Anyway, AArch64 SIMD is specifically designed to make it fully efficient to do
wide loads and then unpack both halves, like is possible in ARM, but not
x86-64.  

It's also using a store (of a pair of regs) that's twice the width of the load.
 But even if it was using a max-width load of a pair of 128-bit vectors (and
having to store two pairs) that would be good, just effectively unrolling.  But
GCC sees it as one load and two separate stores, that it just happens to be
able to combine as a pair.

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2021-09-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103

--- Comment #9 from Peter Cordes  ---
Thanks for implementing my idea :)

(In reply to Hongtao.liu from comment #6)
> For elements located above 128bits, it seems always better(?) to use
> valign{d,q}

TL:DR:
 I think we should still use vextracti* / vextractf* when that can get the job
done in a single instruction, especially when the VEX-encoded vextracti/f128
can save a byte of code size for v[4].

Extracts are simpler shuffles that might have better throughput on some future
CPUs, especially the upcoming Zen4, so even without code-size savings we should
use them when possible.  Tiger Lake has a 256-bit shuffle unit on port 1 that
supports some common shuffles (like vpshufb); a future Intel might add
256->128-bit extracts to that.

It might also save a tiny bit of power, allowing on-average higher turbo
clocks.

---

On current CPUs with AVX-512, valignd is about equal to a single vextract, and
better than multiple instruction.  It doesn't really have downsides on current
Intel, since I think Intel has continued to not have int/FP bypass delays for
shuffles.

We don't know yet what AMD's Zen4 implementation of AVX-512 will look like.  If
it's like Zen1 was AVX2 (i.e. if it decodes 512-bit instructions other than
insert/extract into at least 2x 256-bit uops) a lane-crossing shuffle like
valignd probably costs more than 2 uops.  (vpermq is more than 2 uops on
Piledriver/Zen1).  But a 128-bit extract will probably cost just one uop.  (And
especially an extract of the high 256 might be very cheap and low latency, like
vextracti128 on Zen1, so we might prefer vextracti64x4 for v[8].)

So this change is good, but using a vextracti64x2 or vextracti64x4 could be a
useful peephole optimization when byte_offset % 16 == 0.  Or of course
vextracti128 when possible (x/ymm0..15, not 16..31 which are only accessible
with an EVEX-encoded instruction).

vextractf-whatever allows an FP shuffle on FP data in case some future CPU
cares about that for shuffles.

An extract is a simpler shuffle that might have better throughput on some
future CPU even with full-width execution units.  Some future Intel CPU might
add support for vextract uops to the extra shuffle unit on port 1.  (Which is
available when no 512-bit uops are in flight.)  Currently (Ice Lake / Tiger
Lake) it can only run some common shuffles like vpshufb ymm, but not including
any vextract or valign.  Of course port 1 vector ALUs are shut down when
512-bit uops are in flight, but could be relevant for __m256 vectors on these
hypothetical future CPUs.

When we can get the job done with a single vextract-something, we should use
that instead of valignd.  Otherwise use valignd.

We already check the index for low-128 special cases to use vunpckhqdq vs.
vpshufd (or vpsrldq) or similar FP shuffles.

-

On current Intel, with clean YMM/ZMM uppers (known by the CPU hardware to be
zero), an extract that only writes a 128-bit register will keep them clean
(even if it reads a ZMM), not needing a VZEROUPPER.  Since VZEROUPPER is only
needed for dirty y/zmm0..15, not with dirty zmm16..31, so a function like

float foo(float *p) {
  some vector stuff that can use high zmm regs;
  return scalar that happens to be from the middle of a vector;
}

could vextract into XMM0, but would need vzeroupper if it used valignd into
ZMM0.

(Also related
https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc
re reading a ZMM at all and turbo clock).

---

Having known zeros outside the low 128 bits (from writing an xmm instead of
rotating a zmm) is unlikely to matter, although for FP stuff copying fewer
elements that might be subnormal could happen to be an advantage, maybe saving
an FP assist for denormal.  We're unlikely to be able to take advantage of it
to save instructions/uops (like OR instead of blend).  But it's not worse to
use a single extract instruction instead of a single valignd.

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

--- Comment #37 from Peter Cordes  ---
Correction, PR82666 is that the cmov on the critical path happens even at -O2
(with GCC7 and later).  Not just with -O3 -fno-tree-vectorize.

Anyway, that's related, but probably separate from choosing to do if-conversion
or not after inlining.

[Bug target/56309] conditional moves instead of compare and branch result in almost 2x slower code

2021-09-04 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #36 from Peter Cordes  ---
Related:  a similar case of cmov being a worse choice, for a threshold
condition with an array input that happens to already be sorted:

https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2

GCC with -fprofile-generate / -fprofile-use does correctly decide to use
branches.

GCC7 and later (including current trunk) with -O3 -fno-tree-vectorize
de-optimizes by putting the CMOV on the critical path, instead of as part of
creating a zero/non-zero input for the ADD. PR82666.  If you do allow full -O3,
then vectorization is effective, though.

[Bug target/15533] Missed move to partial register

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #5 from Peter Cordes  ---
The new asm less bad, but still not good.  PR53133 is closed, but this code-gen
is a new instance of partial-register writing with xor al,al.  Also related:
PR82940 re: identifying bitfield insert patterns in the middle-end; hopefully
Andrew Pinski's planned set of patches to improve that can help back-ends do a
better job?

If we're going to read a 32-bit reg after writing an 8-bit reg (causing a
partial-register stall on Nehalem and earlier), we should be doing

  mov  a, %al   # merge into the low byte of RAX
  ret

Haswell and newer Intel don't rename the low byte partial register separately
from the full register, so they behave like AMD and other non-P6 /
non-Sandybridge CPU: dependency on the full register.  That's good for this
code; in this case the merging is necessary and we don't want the CPU to guess
that it won't be needed later.  The load+ALU-merge uops can micro-fuse into a
single uop for the front end.

 xor %al,%al still has a false dependency on the old value of RAX because it's
not a zeroing idiom; IIRC in my testing it's at least as good to do  mov $0,
%al.  Both instructions are 2 bytes long.

*
https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers
 survey of the ways partial regs are handled on Intel P6 family vs. Intel
Sandybridge vs. Haswell and later vs. non-Intel and Intel Silvermont etc.
*
https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to
- details of my testing on Haswell / Skylake.



*If* we still care about  -mtune=nehalem  and other increasingly less relevant
CPUs, we should be avoiding a partial register stall for those tuning options
with something like

   movzbl   a, %edx
   and  $-256, %eax
   or   %edx, %eax

i.e. what we're already doing, but spend a 5-byte AND-immediate instead of a
2-byte xor %al,%al or mov $0, %al

(That's what clang always does, so it's missing the code-size optimization.
https://godbolt.org/z/jsE57EKcb shows a similar case of return (a&0xFF00u)
| (b&0xFFu); with two register args)

-

The penalty on Pentium-M through Nehalem is to stall for 2-3 cycles while a
merging uop is inserted.  The penalty on earlier P6 (PPro / Pentium III) is to
stall for 5-6 cycles until the partial-register write retires.

The penalty on Sandybridge (and maybe Ivy Bridge if it renames AL) is no stall,
just insert a merging uop.

On later Intel, and AMD, and Silvermont-family Intel, writing AL has a
dependency on the old RAX; it's a merge on the spot.

BTW, modern Intel does still rename AH separately, and merging does require the
front-end to issue a merging uop in a cycle by itself.  So writing AH instead
of AL would be different.

[Bug middle-end/82940] Suboptimal code for (a & 0x7f) | (b & 0x80) on powerpc

2021-08-22 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82940

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #6 from Peter Cordes  ---
For a simpler test case, GCC 4.8.5 did redundantly mask before using
bitfield-insert, but GCC 9.2.1 doesn't.


unsigned merge2(unsigned a, unsigned b){
return (a&0xFF00u) | (b&0xFFu);
}

https://godbolt.org/z/froExaPxe
# PowerPC (32-bit) GCC 4.8.5
rlwinm 4,4,0,0xff # b &= 0xFF is totally redundant
rlwimi 3,4,0,24,31
blr

# power64 GCC 9.2.1 (ATI13.0)
rlwimi 3,4,0,255# bit-blend according to mask, rotate count=0
rldicl 3,3,0,32 # Is this zero-extension to 64-bit redundant?
blr

But ppc64 GCC does zero-extension of the result from 32 to 64-bit, which is
probably not needed unless the calling convention has different requirements
for return values than for incoming args.  (I don't know PPC well enough.)

So for at least some cases, modern GCC does ok.

Also, when the blend isn't split at a byte boundary, even GCC4.8.5 manages to
avoid redundant masking before the bitfield-insert.

unsigned merge2(unsigned a, unsigned b){
return (a & 0xFF80u) | (b & 0x7Fu);
}

rlwimi 3,4,0,25,31   # GCC4.8.5, 32-bit so no zero-extension
blr

[Bug tree-optimization/100922] CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

--- Comment #2 from Peter Cordes  ---
Possibly also related:

With different surrounding code, this loop can compile to asm which has two
useless movz / mov register copies in the loop at -O2 
(https://godbolt.org/z/PTcqzM6q7).  (To set up for entry into the next loop in
over-complicated ways, and doing this in the loop is unnecessary.)


  while( lut[(unsigned char)*str] == 0 ){  // also catches terminating 0
str++;
  }


.L19:
movzbl  1(%rdi), %edx
addq$1, %rdi
movzbl  %dl, %ecx
movl%edx, %eax
cmpb$0, -120(%rsp,%rcx)
je  .L19

from source

void remove_chars(char *restrict str, const char *restrict remove)
{
  char lut[256] = {0};
  do {
lut[(unsigned char)*remove] = -1;
  }while(*remove++);

/***   Over complicated asm in this loop */
  while( lut[(unsigned char)*str] == 0 ){  // also catches terminating 0
str++;
  }
  // str points at first char to *not* keep (or the terminating 0)
  const char *in = str;
  char *out = str;
  while (*in)
{
  char mask = lut[(unsigned char)*in];
unsigned char cin = *in, cout = *out;
*out = mask ? cout : cin;
  out += mask + 1;
  in++;
}
  *out = *in;
}

[Bug tree-optimization/100922] New: CSE leads to fully redundant (back to back) zero-extending loads of the same thing in a loop, or a register copy

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922

Bug ID: 100922
   Summary: CSE leads to fully redundant (back to back)
zero-extending loads of the same thing in a loop, or a
register copy
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

Created attachment 50948
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50948=edit
redundant_zero_extend.c

It's rarely a good idea to load the same thing twice; generally better to copy
a register.  Or to read the same register twice when a copy isn't needed.  So
the following asm should never happen, but it does with current trunk, and
similar with GCC as old as 4.5

movzbl  (%rax), %edx
movzbl  (%rax), %ecx# no branch target between these instructions

or 

ldrbw4, [x2]
ldrbw3, [x2], 1 # post-indexed *x2++

(Happens at -O3.  With -O2 we have a redundant register copy, so either way
still a wasted instruction.  And there are other differences earlier in the
function with -O2 vs. -O3.)

https://godbolt.org/z/jT7WaWeK8 - minimal test case. x86-64 and AArch64 trunk
show basically identical code structure.  x86-64 gcc (Compiler-Explorer-Build)
12.0.0 20210603 and aarch64-unknown-linux-gnu-gcc (GCC) 12.0.0 20210524

void remove_chars_inplace(char *str, const unsigned char keep_lut[256])
{
  while(keep_lut[(unsigned char)*str]){ // can be an if() and still repro
str++;// keep_lut[0] is false
  }

  char *out = str;
  unsigned char c;   /* must be unsigned char for correctness. */
  do {
  c = *str++;
  unsigned char inc = keep_lut[c];  // unsigned long doesn't help
  *out = c;
  out += inc;   // inc=0 or 1 to let next char overwrite or not
} while(c);
}

x86-64 asm:

remove_chars_inplace:
jmp .L8
.L3:# top of search loop for first char to remove
addq$1, %rdi
.L8:# loop entry point
movzbl  (%rdi), %eax
cmpb$0, (%rsi,%rax)  # un-laminates and doesn't macro-fuse ...
jne .L3

cmpb$0, (%rdi)  # 2nd loop body can be skipped if *str == 0
# should be test %al,%al  - this char was
already loaded.
leaq1(%rdi), %rax# even -march=znver2 fails to move this
earlier or later to allow cmp/je fusion.  (Intel won't macro-fuse cmp imm,mem /
jcc)
je  .L1

.L5: # TOP OF 2ND LOOP
movzbl  (%rax), %edx
movzbl  (%rax), %ecx # redundant load of *str
addq$1, %rax
movzbl  (%rsi,%rdx), %edx  # inc = lut[c]
movb%cl, (%rdi)
addq%rdx, %rdi   # out += inc
testb   %cl, %cl
jne .L5# }while(c != 0)
.L1:
ret

IDK if it's interesting or not that the   cmpb $0, (%rdi)  is also a redundant
load.  The first loop left *str, i.e. (%rdi), in EAX.  Putting the LEA between
cmp and je (even with -march=znver2) is a separate missed optimization. 
(unless that's working around Intel's JCC erratum)

With only -O2 instead of -O3, we get better asm in that part: it takes
advantage of having the char in AL, and jumps into the middle of the next loop
after xor-zeroing the `inc` variable.


Replacingc = *str++;  with
  c = *str;
  str++;
results in a wasted register copy with trunk, instead of a 2nd load (on x86-64
and arm64).  Still a missed opt, but less bad.  GCC7 and earlier still do an
extra load with either way of writing that.

Removing the first loop, or making its loop condition something like  *str &&
keep_lut[*str],  removes the problem entirely.  The CSE possibility is gone. 
(Same even if we use lut[*(unsigned char*)str] - type-pun the pointer to
unsigned char instead of casting the signed char value to unsigned char, on x86
where char is signed, but not on arm64 where char is unsigned.)

---

I didn't find any clear duplicates; the following are barely worth mentioning:
*  pr94442 looks like extra spilling, not just redundant loading.
*  pr97366 is due to vectors of different types, probably.
*  pr64319 needs runtime aliasing detection to avoid, unlike this.

The AArch64 version of this does seem to demo pr71942 (a useless  and x4, x2,
255 on an LDRB result) when you get it to copy a register instead of doing a
2nd load.

[Bug rtl-optimization/88770] Redundant load opt. or CSE pessimizes code

2021-06-05 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88770

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #2 from Peter Cordes  ---
Note that mov r64, imm64 is a 10-byte instruction, and can be slow to read from
the uop-cache on Sandybridge-family.

The crap involving OR is clearly sub-optimal, but *if* you already have two
spare call-preserved registers across this call, the following is actually
smaller code-size:

movabs  rdi, 21474836483
mov rbp, rdi
movabs  rsi, 39743127552
mov rbx, rsi
calltest
mov rdi, rbp
mov rsi, rbx
calltest


This is more total uops for the back-end though (movabs is still single-uop,
but takes 2 entries the uop cache on Sandybridge-family;
https://agner.org/optimize/).  So saving x86 machine-code size this way does
limit the ability of out-of-order exec to see farther, if the front-end isn't
the bottleneck.  And it's highly unlikely to be worth saving/restoring two regs
to enable this.  (Or to push rdi / push rsi before call, then pop after!)

Setting up the wrong value and then fixing it twice with OR is obviously
terrible and never has any advantage, but the general idea to CSE large
constants isn't totally crazy.  (But it's profitable only in such limited cases
that it might not be worth looking for, especially if it's only helpful at -Os)

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2021-06-03 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636

Peter Cordes  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Peter Cordes  ---
This seems to be fixed for ZMM vectors in GCC8. 
https://gcc.godbolt.org/z/7351be1v4

Seems to have never been a problem for __m256, at least not for 
__m256 zero256(){ return _mm256_setzero_ps(); }
IDK what I was looking at when I originally reported; maybe just clang which
*did* used to prefer YMM-zeroing.

Some later comments suggested movdqa vs. pxor zeroing choices (and mov vs. xor
for integer), but the bug title is just AVX / AVX-512 xor-zeroing, and that
seems to be fixed.  So I think this should be closed.

[Bug tree-optimization/42587] bswap not recognized for memory

2021-05-08 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42587

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #12 from Peter Cordes  ---
(In reply to Andi Kleen from comment #11)
> Only when the first test case is fixed too

https://godbolt.org/z/7M8cx3vT1  GCC8.1 -O3 for x86-64

pushrbx
mov ebx, edi
callacpi_ut_track_stack_ptr
mov eax, ebx
pop rbx
bswap   eax
ret


The code in the initial report optimizes to bswap with GCC8.1 and later.
Is that the test case you meant?  GCC8.1 was released on May 2, 2018, well
before your Nov comment, so maybe you meant something else.

[Bug middle-end/98801] Request for a conditional move built-in function

2021-01-25 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801

Peter Cordes  changed:

   What|Removed |Added

 CC||peter at cordes dot ca

--- Comment #5 from Peter Cordes  ---
(In reply to Richard Biener from comment #4)
> Slight complication arises because people will want to have cmoves with a
> memory destination.

Do we even want to provide this?  Most ISAs can't branchlessly conditionally
store, except via an RMW (which wouldn't be thread-safe for the no-store case
if not atomic) or something really clunky.  (Like x86  rep stos  with count=0
or 1.)

ARM predicated instructions allow branchless load or store that doesn't disturb
the memory operand (and won't even fault on a bad address).

I guess another option to emulate it could be to make a dummy local and cmov to
select a store address = dummy : real.  But that's something users can build in
the source using a non-memory conditional-select builtin that exposes the much
more widely available ALU conditional-select functionality like x86 CMOV,
AArch64 CSEL, MIPS MVN, etc.


> That won't solve the eventual request to have cmov _from_ memory ... (if we
> leave all of the memory combining to RTL people will again complain that
> it's subject to compilers discretion).

It might be sufficient for most use-cases like defending against timing
side-channels to not really try to allow conditional loads (from maybe-invalid
pointers).



I'm not sure if the motivation for this includes trying to make code without
data-dependent branching, to defend against timing side-channels.

But if we do provide something like this, people are going to want to use it
that way.  That's one case where best-effort behaviour at the mercy of the
optimizer for a ternary (or having to manually check the asm) is not great. 
Stack Overflow has gotten a few Q from people looking for guaranteed CMOV
for reasons like that.

So I think we should be wary of exposing functionality that most ISAs don't
have.  OTOH, failing to provide a way to take advantage of functionality that
some ISAs *do* have is not great, e.g. ISO C failing to provide popcnt and
bit-scan (clz / ctz) has been a problem for C for a long time.

But for something like __builtin_clz, emulating on machines that don't have
hardware support still works.  If we're trying to support a guarantee of no
data-dependent branching, that limits the emulation possibilities or makes them
clunkier.  Especially if we want to support ARM's ability to not fault / not
access memory if the condition is false.

The ALU-select part can be emulated with AND/OR, so that's something we can
provide on any target.

Folding memory operands into a predicated load on ARM could actually introduce
data-dependent cache access, vs. an unconditional load and a predicated reg-reg
MOV.  So this becomes somewhat thorny, and some design work to figure out what
documented guarantees to provide will be necessary.  Performance use-cases
would certainly rather just have a conditional load in one instruction.

[Bug tree-optimization/98291] New: multiple scalar FP accumulators auto-vectorize worse than scalar, including vector load + merge instead of scalar + high-half insert

2020-12-15 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291

Bug ID: 98291
   Summary: multiple scalar FP accumulators auto-vectorize worse
than scalar, including vector load + merge instead of
scalar + high-half insert
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization, ssemmx
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---
Target: x86_64-*-*, i?86-*-*

An FP reduction loop with 2 scalar accumulators auto-vectorizes into a mess,
instead of effectively mapping each scalar to an element of one vector
accumulator.  (Unless we use -ffast-math, then that happens.  clang gets it
right even without -ffast-math).

double dotprod(const double *a, const double *b, unsigned long long n)
{
  double d1 = 0.0;
  double d2 = 0.0;

  for (unsigned long long i = 0; i < n; i += 2) {
d1 += a[i] * b[i];
d2 += a[i + 1] * b[i + 1];
  }

  return (d1 + d2);
}

https://godbolt.org/z/Kq48j9

With -ffast-math the nice sane loop we expect

.L3:
movupd  (%rsi,%rax), %xmm0
movupd  (%rdi,%rax), %xmm3
addq$1, %rdx
addq$16, %rax
mulpd   %xmm3, %xmm0
addpd   %xmm0, %xmm1
cmpq%rcx, %rdx
jb  .L3


without: 

...
main loop
.L4:
movupd  (%rcx,%rax), %xmm1# 16-byte load
movupd  (%rsi,%rax), %xmm3 
movhpd  16(%rcx,%rax), %xmm1  # overwrite the high half of it!!
movhpd  16(%rsi,%rax), %xmm3
mulpd   %xmm3, %xmm1
movupd  16(%rsi,%rax), %xmm3
movlpd  8(%rsi,%rax), %xmm3
addsd   %xmm1, %xmm2
unpckhpd%xmm1, %xmm1
addsd   %xmm1, %xmm2
movupd  16(%rcx,%rax), %xmm1
movlpd  8(%rcx,%rax), %xmm1
addq$32, %rax
mulpd   %xmm3, %xmm1
addsd   %xmm1, %xmm0
unpckhpd%xmm1, %xmm1
addsd   %xmm1, %xmm0
cmpq%rdx, %rax
jne .L4

The overall strategy is insane, but even some of the details are insane.  e.g.
a 16-byte load into XMM1, and then overwriting the high half of that with a
different double before reading it.  That's bad enough, but you'd expect movsd
/ movhpd to manually gather 2 doubles, without introducing the possibility of a
cache-line split load for zero benefit.

Similarly, movupd / movlpd should have just loaded in the other order.  (Or
since they're contiguous, movupd  8(%rsi,%rax), %xmm3 / shufpd.)

So beyond the bad overall strategy (which is likely worse than unrolled
scalar), it might be worth checking for some of this kind of smaller-scale
insanity somewhere later to make it less bad if some other inputs can trigger
similar behaviour.

(This small-scale detecting of movupd / movhpd and using movsd / movhpd could
be a separate bug, but if it's just a symptom of something that should never
happen in the first place then it's not really its own bug at all.)

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #1 from Peter Cordes  ---
Forgot to include https://godbolt.org/z/q44r13

[Bug target/97366] New: [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread peter at cordes dot ca via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Bug ID: 97366
   Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX
vector intrinsics
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: peter at cordes dot ca
  Target Milestone: ---

When you use the same _mm_load_si128 or _mm256_load_si256 result twice,
sometimes GCC loads it *and* uses it as a memory source operand.

I'm not certain this is specific to x86 back-ends, please check bug tags if it
happens elsewhere.  (But it probably doesn't on 3-operand load/store RISC
machines; it looks like one operation chooses to load and then operate, the
other chooses to use the original source as a memory operand.)

#include 
void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict
input)
{
for (unsigned i=0 ; i<1024 ; i+=16){
__m128i in = _mm_load_si128((__m128i*)[i]);
__m128i high = _mm_srli_epi32(in, 4);
_mm_store_si128((__m128i*)[i], _mm_or_si128(in,high));
}
}

gcc 8 and later -O3 -mavx2, including 11.0.0 20200920, with 

gcc_double_load_128(signed char*, signed char const*):
xorl%eax, %eax
.L6:
vmovdqa (%rsi,%rax), %xmm1 # load
vpsrld  $4, %xmm1, %xmm0
vpor(%rsi,%rax), %xmm0, %xmm0  # reload as a memory operand
vmovdqa %xmm0, (%rdi,%rax)
addq$16, %rax
cmpq$1024, %rax
jne .L6
ret

GCC7.5 and earlier use  vpor %xmm1, %xmm0, %xmm0 to use the copy of the
original that was already loaded.

`-march=haswell` happens to fix this for GCC trunk, for this 128-bit version
but not for a __m256i version.

restrict doesn't make a difference, and there's no overlapping anyway.  The two
redundant loads both happen between any other stores.

Using a memory source operand for vpsrld wasn't an option: the form with a
memory source takes the *count* from  memory, not the data. 
https://www.felixcloutier.com/x86/psllw:pslld:psllq



Note that *without* AVX, the redundant load is a possible win, for code running
on Haswell and later Intel (and AMD) CPUs.  Possibly some heuristic is saving
instructions for the legacy-SSE case (in a way that's probably worse overall)
and hurting the AVX case.

GCC 7.5, -O3  without any -m options
gcc_double_load_128(signed char*, signed char const*):
xorl%eax, %eax
.L2:
movdqa  (%rsi,%rax), %xmm0
movdqa  %xmm0, %xmm1 # this instruction avoided
psrld   $4, %xmm1
por %xmm1, %xmm0 # with a memory source reload, in GCC8 and
later
movaps  %xmm0, (%rdi,%rax)
addq$16, %rax
cmpq$1024, %rax
jne .L2
rep ret


Using a memory-source POR saves 1 front-end uop by avoiding a register-copy, as
long as the indexed addressing mode can stay micro-fused on Intel.  (Requires
Haswell or later for that to happen, or any AMD.)  But in practice it's
probably worse.  Load-port pressure, and space in the out-of-order scheduler,
as well as code-size, is a problem for using an extra memory-source operand in
the SSE version, with the upside being saving 1 uop for the front-end.  (And
thus in the ROB.)  mov-elimination on modern CPUs means the movdqa register
copy costs no back-end resources (ivybridge and bdver1).

I don't know if GCC trunk is using por  (%rsi,%rax), %xmm0  on purpose for that
reason, of if it's just a coincidence.
I don't think it's a good idea on most CPUs, even if alignment is guaranteed.

This is of course 100% a loss with AVX; we have to `vmovdqa/u` load for the
shift, and it can leave the original value in a register so we're not saving a
vmovdqua.  And it's a bigger loss because indexed memory-source operands
unlaminate from 3-operand instructions even on Haswell/Skylake:
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes/31027695#31027695
so it hurts the front-end as well as wasting cycles on load ports, and taking
up space in the RS (scheduler).

The fact that -mtune=haswell fixes this for 128-bit vectors is interesting, but
it's clearly still a loss in the AVX version for all AVX CPUs.  2 memory ops /
cycle on Zen could become a bottleneck, and it's larger code size.  And
-mtune=haswell *doesn't* fix it for the -mavx2 _m256i version.

There is a possible real advantage in the SSE case, but it's very minor and
outweighed by disadvantages.  Especially for older CPUs like Nehalem that can
only do 1 load / 1 store per clock.  (Although this has so many uops in the
loop that it barely bottlenecks on that.)