[Bug target/106340] flag set from SVE svwhilelt intrinsic not reused in loop

2022-07-20 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340

Yichao Yu  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED

--- Comment #2 from Yichao Yu  ---
Over at the llvm bug report, it was pointed out to me that the standard pattern
to use is to do the branch based on ptest intrinsics. It matches the flag
setting of the whilelt family of instructions better and gcc is already able to
omit the ptest instruction in such case.

[Bug target/106324] ptrue not reused between vector instructions and predicate instructions

2022-07-18 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106324

--- Comment #3 from Yichao Yu  ---
Actually I just realized that the not instruction used the .d version as
requested, the vector instruction didn’t….. I got it reversed in the original
post……

[Bug target/106340] flag set from SVE svwhilelt intrinsic not reused in loop

2022-07-18 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340

--- Comment #1 from Yichao Yu  ---
Also note that this is for code I've tweaked to match what the finally code as
much as possible. For a complete implementation of this, I expect the loop
transformation done for normal loop should move the whilelt as well so that
source code like the following would generate pretty much the same code.

```
void set3(uint32_t *__restrict__ out, size_t m)
{
auto svelen = svcntw();
auto v = svdup_u32(1);
for (size_t i = 0; i < m; i += svelen) {
auto pg = svwhilelt_b32(i, m);
svst1(pg, [i], v);
}
}
```

Currently, while the cmp was moved to the end of the loop body and the loop
header, the whilelt that is meant to be paired with it did not so the flag from
the whilelt instruction isn't directly usable as is in the code.

[Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop

2022-07-18 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340

Bug ID: 106340
   Summary: flag set from SVE svwhilelt intrinsic not reused in
loop
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

I'm experimenting with manually writing VLA loops and trying to match the
assembly code I expect/from autovectorizer. One of the main area I can't get it
to work is when setting the loop predicate using the svwhilelt intrinsics. The
instruction it corresponds to set the flags and can be directly used to
terminate the loop. Indeed, when using the autovectorizer, this is exactly what
happens.

```
void set1(uint32_t *__restrict__ out, size_t m)
{
for (size_t i = 0; i < m; i++) {
out[i] = 1;
}
}
```

compiles to

```
cbz x1, .L1
mov x2, 0
cntwx3
whilelo p0.s, xzr, x1
mov z0.s, #1
.p2align 3,,7
.L3:
st1wz0.s, p0, [x0, x2, lsl 2]
add x2, x2, x3
whilelo p0.s, x2, x1
b.any   .L3
.L1:
ret
```

(Here I believe the flag set from the loop header whilelo could also be used
for the jump but that doesn't same much in this case.)

However, no matter how I trie to replicate this using manually written code
using the sve intrinsics, there is always an additional cmp instruction
generated. The closest I can get is by replicating the structure of the
auto-vectorized loop as much as possible with,

```
void set2(uint32_t *__restrict__ out, size_t m)
{
auto svelen = svcntw();
auto v = svdup_u32(1);
if (m != 0) {
auto pg = svwhilelt_b32(0ul, m);
for (size_t i = 0; i < m; i += svelen, pg = svwhilelt_b32(i, m)) {
svst1(pg, [i], v);
}
}
}
```

which is compiled to

```
cbz x1, .L9
mov x2, 0
cntwx3
whilelo p0.s, xzr, x1
mov z0.s, #1
.p2align 3,,7
.L11:
st1wz0.s, p0, [x0, x2, lsl 2]
add x2, x2, x3
whilelo p0.s, x2, x1
cmp x1, x2
bhi .L11
.L9:
ret
```

which is literally the same code down to register allocation except that the
branch following the `whilelo` instruction is replaced with another comparison
and branch.

[Bug target/106329] New: No optimization for SVE pfalse predicate

2022-07-16 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106329

Bug ID: 106329
   Summary: No optimization for SVE pfalse predicate
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

If a known-all-false predicate is used on an SVE intrinsic, the result should
be fully no-op, undefined, zeroing and no actual instruction (other than
potentially returning a zero) should be generated. This does not seem to be
happening even when a `svpfalse_b()` is explicitly passed in as the predicate.

As an example,

```
svfloat64_t add(svfloat64_t a, svfloat64_t b)
{
return svadd_x(svpfalse_b(), a, b);
}
```

is being compiled to
```
pfalse  p0.b
faddz0.d, p0/m, z0.d, z1.d
ret
```

when it could simply be an empty function.

[Bug target/106327] New: side-effect-free _x variance not optimized to unpredicated instruction

2022-07-16 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106327

Bug ID: 106327
   Summary: side-effect-free _x variance not optimized to
unpredicated instruction
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106326 .

According to the Arm C Language Extension for SVE, when the _x predicate is
used,

> The compiler can then pick whichever form of instruction seems to give the 
> best code. This includes using unpredicated instructions, where available and 
> suitable

Because of this, I'm expecting the following to be optimized to a single add
instruction, as if a `svptrue_b64()` predicate is used.

```
svfloat64_t add(svfloat64_t a, svfloat64_t b)
{
auto und_ok = svcmpge(svptrue_b64(), a, b);
return svadd_x(und_ok, a, b);
}
```

However, gcc compiles this as _m and generates

```
ptrue   p0.b, all
fcmge   p0.d, p0/z, z0.d, z1.d
faddz0.d, p0/m, z0.d, z1.d
```

In general, is there any reason not to treat an `add_x` (also other
side-effect-free functions) with an unknown predicate as unpredicated one?

[Bug target/106326] New: _m and _z version of SVE instrinsics not optimized to predicate-free version

2022-07-16 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106326

Bug ID: 106326
   Summary: _m and _z version of SVE instrinsics not optimized to
predicate-free version
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The following code should generate a predicate-free fadd instruction since all
the predicates are true.

```
svfloat64_t test(svfloat64_t a, svfloat64_t b)
{
return svadd_m(svptrue_b64(), a, b);
}
```

but gcc instead generates an all-tree predicate and use that instead, i.e.

```
ptrue   p0.b, all
faddz0.d, p0/m, z0.d, z1.d
```

The same happens for the `_z` version as well with even worse code generated.

```
ptrue   p0.b, all
movprfx z0.d, p0/z, z0.d
faddz0.d, p0/m, z0.d, z1.d
```

This optimization is only done for the `_x` variance. Clang optimizes this for
all variance.

[Bug target/106324] New: ptrue not reused between vector instructions and predicate instructions

2022-07-16 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106324

Bug ID: 106324
   Summary: ptrue not reused between vector instructions and
predicate instructions
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The following code has two use of `svptrue_b64()`s and none of the instructions
using them should be clearning it so only one `ptrue` instruction should be
needed.

```
svfloat64_t test(svbool_t pg, svfloat64_t a, svfloat64_t b)
{
auto d = svdiv_m(svptrue_b64(), a, b);
return svmul_m(svnot_z(svptrue_b64(), pg), d, d);
}
```

However, the code generated is,

```
ptrue   p2.b, all
ptrue   p1.d, all
fdivz0.d, p2/m, z0.d, z1.d
not p0.b, p1/z, p0.b
fmulz0.d, p0/m, z0.d, z0.d
ret
```

which has an extra `ptrue`.

OTOH, clang generates,

```
ptrue   p1.d
fdivz0.d, p1/m, z0.d, z1.d
not p0.b, p1/z, p0.b
fmulz0.d, p0/m, z0.d, z0.d
ret
```

and the same `ptrue` is reused in both instructions.

This seems to be caused by gcc insisting on using `svptrue_b8` for the svnot
which does not seem necessary here especially since _b64 is explicitly
requested. Changing svptrue_b64 to svptrue_b8 in the code fixes the issue.

[Bug c++/100161] New: Impossible to suppress Wtype-limits warning involving template parameter.

2021-04-20 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100161

Bug ID: 100161
   Summary: Impossible to suppress Wtype-limits warning involving
template parameter.
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

If a comparison involving a template parameter is always true or false, it
should not raise a warning if it could take other values for other template
parameters.

In particular, the type-limits warning from the code below,

```
void f(unsigned);

template
void g()
{
for (unsigned i = 0; i < n; i++) {
f(i);
}
}

void h()
{
g<0>();
}
```

seems to be impossible to suppress. I think this is a regression around GCC 9
time. (I remember seeing it roughly around the same time/slightly after
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90728)

This is partially related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95148
(which would at least provide a way to suppress the warning).
Also somewhat related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81642
though supposedly the C++ template example given there is fixed.

[Bug tree-optimization/100088] New: ymm store split into two xmm stores

2021-04-14 Thread yyc1992 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100088

Bug ID: 100088
   Summary: ymm store split into two xmm stores
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The following code

```
__attribute__((target("avx2")))
void fill_avx2(double *__restrict__ data, int n, double value)
{
for (int i = 0; i < n * 16; i++) {
data[i] = value;
}
}
```

compiles to

```
fill_avx2:
sall$4, %esi
testl   %esi, %esi
jle .L5
shrl$2, %esi
vbroadcastsd%xmm0, %ymm0
movl%esi, %eax
salq$5, %rax
addq%rdi, %rax
.p2align 4,,10
.p2align 3
.L3:
vmovupd %xmm0, (%rdi)
vextractf128$0x1, %ymm0, 16(%rdi)
addq$32, %rdi
cmpq%rax, %rdi
jne .L3
vzeroupper
.L5:
ret
```

Note that AFAICT

```
vmovupd %xmm0, (%rdi)
vextractf128$0x1, %ymm0, 16(%rdi)
```

is equivalent to

```
vmovupd %ymm0, (%rdi)
```

This issue does not exist for sse or avx512f. Setting `-march=haswell` or
`-mtune=haswell` on the command line also seems to fix this but neither of
these works when added to the target attribute.

[Bug c/96990] New: Regression in aarch64 struct vector member initialization

2020-09-08 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96990

Bug ID: 96990
   Summary: Regression in aarch64 struct vector member
initialization
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The following code used to work on gcc 9.3 but stops working with 10.2 with an
error

```
a.c: In function ‘test_aa64_vec_2’:
a.c:19:24: error: incompatible types when initializing type ‘signed char’ using
type ‘int8x8_t’
   19 | struct_aa64_3 x = {v1 + v1, v2 - v2};
  |^~
a.c:19:33: error: incompatible types when initializing type ‘signed char’ using
type ‘float32x2_t’
   19 | struct_aa64_3 x = {v1 + v1, v2 - v2};
  | ^~
```

Any one of the "working" version or compiling with c++ works.
>From the error message it seems that GCC correctly inferred the return type of
the `v1 + v1` or `v2 - v2` but instead got confused about the field type.
Reverssing the order of `v1` and `v2` in the struct causes the error to change
to `float` instead of `signed char` so it seems that gcc thinks the code is
trying to initialize the first vector member (with element type of `signed
char` or `float` instead). I thought such initialization should have an
additional `{}` instead... Given that explicit casting or compiling in c++ mode
helps I think this is a bug...

```
#include 

typedef struct {
int8x8_t v1;
float32x2_t v2;
} struct_aa64_3;

struct_aa64_3 test_aa64_vec_2(int8x8_t v1, float32x2_t v2)
{
// works
/* int8x8_t vi8 = v1 + v1; */
/* float32x2_t vf = v2 - v2; */
/* struct_aa64_3 x = {vi8, vf}; */

// works
/* struct_aa64_3 x = {(int8x8_t)(v1 + v1), (float32x2_t)(v2 - v2)}; */

// not
struct_aa64_3 x = {v1 + v1, v2 - v2};
return x;
}
```

[Bug c/96629] spurious maybe uninitialized variable warning with difficult control-flow analysis

2020-09-03 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96629

--- Comment #3 from Yichao Yu  ---
Just curious, is it some particular structure that is upsetting it or did it
simply hit some depth limit.

[Bug c/96629] New: spurious uninitialized variable warning with branches at -O1 and higher

2020-08-16 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96629

Bug ID: 96629
   Summary: spurious uninitialized variable warning with branches
at -O1 and higher
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Reduced test code:

```
int mem(char *data);
int cond(void);
void f(char *data, unsigned idx, unsigned inc)
{
char *d2;
int c = cond();
if (idx >= 2) {
if (c)
d2 = data;
mem(data);
}
else if (inc > 3) {
if (c)
d2 = data;
mem(data);
}
else {
if (c) {
d2 = data;
}
}
if (*data) {
}
else if (c) {
mem(d2);
}
}
```

Compiling with `gcc -Wall -Wextra -O{1,2,s,3,fast}` warns about

```
a.c: In function 'f':
a.c:27:9: warning: 'd2' may be used uninitialized in this function
[-Wmaybe-uninitialized]
   27 | mem(d2);
  | ^~~
```

However, it should be clear that `d2` is always assigned when `c` is true. In
fact, it seems that GCC could figure this out in some cases. Changes that can
surpress the warning includes,

1. Remove any of the `mem(data)` calls.
2. Remove any one of the `if`s (leaving only the if or else branch
unconditionally)
3. Change first condition to be on inc instead.
4. Removing the last `*data` branch.

Version tested:
AArch64: 10.2.0
ARM: 9.1.0
x86_64: 10.1.0
mingw64: 10.2.0

[Bug rtl-optimization/96539] Unnecessary no-op copy with Os and tail call with struct argument

2020-08-11 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96539

--- Comment #4 from Yichao Yu  ---
Wow that was fast... thx.

[Bug rtl-optimization/96539] New: Unnecessary no-op copy with Os and tail call with struct argument

2020-08-08 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96539

Bug ID: 96539
   Summary: Unnecessary no-op copy with Os and tail call with
struct argument
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Test C code,

```
struct A {
int a;
int b;
int c;
int d;
int e;
int f;
void *p1;
void *p2;
void *p3;
void *p4;
void *p5;
void *p6;
void *p7;
};

int k(int a);
int f(int a, int b, int c, void *p, struct A s);

int g(int a, int b, int c, void *p, struct A s)
{
k(a);
return f(a, b, c, p, s);
}
```

At `-O2`, the code produced is

```
g:
pushq   %r14
movq%rcx, %r14
pushq   %r13
movl%edx, %r13d
pushq   %r12
movl%esi, %r12d
pushq   %rbp
movl%edi, %ebp
subq$8, %rsp
callk@PLT
addq$8, %rsp
movq%r14, %rcx
movl%r13d, %edx
movl%r12d, %esi
movl%ebp, %edi
popq%rbp
popq%r12
popq%r13
popq%r14
jmp f@PLT
```

I'm not sure why the spill of register and save the argument in those registers
(maybe for latency for the final call?) but both clang and gcc does that so I
assume that's good for performance. However, when I tried `-Os`, the code
produced is,

```
g:
pushq   %r14
movq%rcx, %r14
pushq   %r12
movl%esi, %r12d
pushq   %rbp
movl%edi, %ebp
subq$16, %rsp
movl%edx, 12(%rsp)
callk@PLT
leaq48(%rsp), %rdi
movl$20, %ecx
movq%rdi, %rsi
rep movsl
movq%r14, %rcx
movl%r12d, %esi
movl%ebp, %edi
movl12(%rsp), %edx
addq$16, %rsp
popq%rbp
popq%r12
popq%r14
jmp f@PLT
```

AFAICT, the

```
movq%rdi, %rsi
rep movsl
```

is basically always a no-op (moving from and to the same memory location) other
than potentially triggering memory fault.

The memory being copied in place here is the area where the argument is stored
(80 bytes starting at `rsp + 48`) so maybe it's the copying of the argument
that failed to be removed when it becomes an no-op for tail call?

At `-O1`, the code produced is

```
g:
pushq   %r13
pushq   %r12
pushq   %rbp
pushq   %rbx
subq$8, %rsp
movl%edi, %ebx
movl%esi, %ebp
movl%edx, %r12d
movq%rcx, %r13
callk@PLT
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
pushq   120(%rsp)
movq%r13, %rcx
movl%r12d, %edx
movl%ebp, %esi
movl%ebx, %edi
callf@PLT
addq$88, %rsp
popq%rbx
popq%rbp
popq%r12
popq%r13
ret
```
which shows the copying of 10 pointers that was not no-op without tail call.

[Bug preprocessor/96069] -ffile-prefix-map does not affect print in gfortran

2020-07-08 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069

--- Comment #8 from Yichao Yu  ---
OK, done. It would be nice to mention it on
https://gcc.gnu.org/contribute.html#patches

[Bug preprocessor/96069] -ffile-prefix-map does not affect print in gfortran

2020-07-08 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069

--- Comment #6 from Yichao Yu  ---
https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549411.html

and

https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549413.html

[Bug preprocessor/96069] -ffile-prefix-map does not affect print in gfortran

2020-07-08 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069

--- Comment #4 from Yichao Yu  ---
> Apparently it is.

Yes, but my question is about why should this be "WONTFIX". This feature
(reproducible build) is certainly as useful in fortran as it is in C family.

> Let move the component to 'preprocessor'.

At least for the issue for the fortran code I had it doesn't seem to be in the
preprocessor. I do agree that other frontends should probably use this too but
I have no idea what are the cases they should do it.

Also note that I've already submitted patches to fix this though I haven't got
a reply yet.

[Bug fortran/96069] -ffile-prefix-map does not affect print in gfortran

2020-07-08 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069

--- Comment #2 from Yichao Yu  ---
Why should this feature be c only?

[Bug fortran/96069] New: -ffile-prefix-map does not affect print in gfortran

2020-07-05 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069

Bug ID: 96069
   Summary: -ffile-prefix-map does not affect print in gfortran
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Compiling the following code `a.f`

```
  subroutine f(name)
  implicit none
  character*(*) name
  print *,name
  return
  end
```

with `gfortran -fdebug-prefix-map="${PWD}"=/usr/src/debug
-ffile-prefix-map="${PWD}"=/usr/src/debug -O3 -fPIC "${PWD}/"a.f -o - -S` will
cause the full path to the file to be included in the generated assembly
without respecting the prefix map. This works in C with `-ffile-prefix-map`
which implies `-fmacro-prefix-map` but the lattr isn't supported by gfortran.

[Bug ipa/95775] Command line argument for target_clones?

2020-06-23 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95775

--- Comment #4 from Yichao Yu  ---
> Hey. My opinion is similar to Richi's. If you really want a highly optimized 
> library, you should rather use a dlopen mechanism with pre-built set of 
> options.

Well, a few things,

1. That sounds like an argument against `target_clone` and `target`. If
dlopen'ing different libraries is your recommended solution then none of these
would be needed.
2. The solution you propose put all the pression on the user of the library.
That has a few problems.

   2.1. There are strictly more users than libraries. (Assuming the library is
used at all) so this is forcing more (repeated) work to be done.
   2.2. The author of the library and to a lesser degree the builder of the
library has the best knowledge of the set of features that can benefit the
library/the most useful for the deployment environment. The author of the user
code of the library, who has to implement the dispatch/loading logic in general
has much less complete knowledge of what the target to support.
   2.3. It'll be even worse for code size since this forces each user to carry
their own library, and now all data has to be duplicated as well in additional
to code. Also because,

3. There's no standard way of doing this AFAICT.

Now (3) is really the main point.
I'm fine with whatever mechanism that allows multiple versions of the code to
be available as long as it requires no more effort/cost from/for the user (and
to a lesser degree the author) of the library.

If one such mechanism is provided by gcc/glibc/binutils so that library writers
don't have to invent their own loading and detection mechanism and won't cause
unnecessary indirection (as cheap as ifunc) and will just work for the user to
either link or dlopen, then I think it doesn't really matter if that's backed
by one file/multiple files or whatever one can come up with.

Currently, the only mechanism available that fits this description AFAICT is
`target_clones`/`ifunc`. Unless there's a roadmap that I'm not aware of to
replace this mechanism with a similar one backed by multiple files I don't
think suggesting such a mechanism is the right approach.

Again, I said in the very first post that I totally agree this won't be the
method to give absolutely the best performance, but neither is `target_clones`.
I also completely agree that this option can be misused and the compiler should
not do it on its own before getting smarter but this is far from the first
option that can be misused and given how cheap memory is and how multiple load
of the same library doesn't take more memory this isn't even closoed to be the
worse misused either.

[Bug c/95777] Allow specifying more than one target options at the same time in target and target_clones attribute

2020-06-22 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95777

--- Comment #3 from Yichao Yu  ---
And for backward compatibility maybe
`target_clones("(sse4.1,arch=core2),default")` would work?

[Bug c/95777] Allow specifying more than one target options at the same time in target and target_clones attribute

2020-06-22 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95777

--- Comment #2 from Yichao Yu  ---
I only tested this with `target_clones` and it seems that I misread the
document for `target`. So this is only an issue with `target_clones` attribute.
`target` support this just fine.

So to be more clear, using an example from the doc, it seems impossible to do
the equivalent of `target("sse4.1,arch=core2")` using `target_clones`. Doing
`target_clones("sse4.1,arch=core2")` will create two functions instead of one.
(of course in reality what I might actually want is to make `target_clones` do
`target("sse4.1,arch=core2")` and target("default")).

[Bug ipa/95775] Command line argument for target_clones?

2020-06-22 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95775

--- Comment #2 from Yichao Yu  ---
> But it will blow up code-size considerably.
> So without some major work I don't think simply slapping target_clones on 
> each function is going to fly in practice.

I mean, it'll blow up not much more than the number of targets. I do agree this
is not something that the compiler should just do automatically and especially
not for big libraries and the user has to ask for it.

However, I don't believe code side consumes most memory on any modern desktop
or server systems and when using shared library different process won't even
consume much more memory anyway. It's for sure still the user's choice but OTOH
I think the compiler shouldn't have to make this choice for the user.

Additionally, there are some libraries, like math heavy ones, where virtually
every single functions could benefit from this. Those are the ones that I would
like to apply this option too. I'm also hoping, and I forgot to mention this in
the first post, that this can just work on gfortran as well...

> Eventually it should be possible to do sth like target_clones(auto) where 
> with a new option, the target (or the user) can define "default" targets to 
> clone for but the user still figures which are the important functions to 
> optimize

In julia I'm currently using a simple heuristic of detecting floating point
operation, vector operation and loops...

> [and GCC may, via IPA "spread" the cloned cgraph portion a bit].

and I do this in julia too.

[Bug ipa/95796] New: Inlining works between functions with the same target attribute but not target_clones

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95796

Bug ID: 95796
   Summary: Inlining works between functions with the same target
attribute but not target_clones
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

If two functions with the same target attribute calls each other, GCC can
inline one into another one (although sometimes incorrectly... PR95790). This
can be shown with the following code (all compilation using `g++ -O2 -S
-fno-exceptions -fno-asynchronous-unwind-tables`).

```
__attribute__ ((target ("default")))
static unsigned foo()
{
  return 1;
}

__attribute__ ((target ("avx")))
static unsigned foo() {
  return 1;
}

__attribute__ ((target ("default")))
unsigned bar()
{
return foo();
}

__attribute__ ((target ("avx")))
unsigned bar()
{
return foo();
}
```

which is compiled to

```
.text
.p2align 4
.globl  _Z3barv
.type   _Z3barv, @function
_Z3barv:
movl$1, %eax
ret
.size   _Z3barv, .-_Z3barv
.p2align 4
.globl  _Z3barv.avx
.type   _Z3barv.avx, @function
_Z3barv.avx:
movl$1, %eax
ret
.size   _Z3barv.avx, .-_Z3barv.avx
```

OTOH, the equivalent code using `target_clones`

```
__attribute__ ((target_clones ("default,avx")))
static unsigned foo()
{
  return 1;
}

__attribute__ ((target_clones ("default,avx")))
unsigned bar()
{
return foo();
}
```

compiles to

```
.text
.p2align 4
.type   _ZL3foov.default.1, @function
_ZL3foov.default.1:
movl$1, %eax
ret
.size   _ZL3foov.default.1, .-_ZL3foov.default.1
.p2align 4
.type   _Z3barv.default.1, @function
_Z3barv.default.1:
jmp _ZL3foov.default.1
.size   _Z3barv.default.1, .-_Z3barv.default.1
.p2align 4
.type   _ZL3foov.avx.0, @function
_ZL3foov.avx.0:
movl$1, %eax
ret
.size   _ZL3foov.avx.0, .-_ZL3foov.avx.0
.p2align 4
.type   _Z3barv.avx.0, @function
_Z3barv.avx.0:
jmp _ZL3foov.avx.0
.size   _Z3barv.avx.0, .-_Z3barv.avx.0
.section   
.text._Z3barv.resolver,"axG",@progbits,_Z3barv.resolver,comdat
.p2align 4
.weak   _Z3barv.resolver
.type   _Z3barv.resolver, @function
_Z3barv.resolver:
subq$8, %rsp
call__cpu_indicator_init@PLT
movq__cpu_model@GOTPCREL(%rip), %rax
leaq_Z3barv.avx.0(%rip), %rdx
testb   $2, 13(%rax)
leaq_Z3barv.default.1(%rip), %rax
cmovne  %rdx, %rax
addq$8, %rsp
ret
.size   _Z3barv.resolver, .-_Z3barv.resolver
.globl  _Z3barv
.type   _Z3barv, @gnu_indirect_function
.set_Z3barv,_Z3barv.resolver
.text
.p2align 4
.type   _ZL3foov.resolver, @function
_ZL3foov.resolver:
subq$8, %rsp
call__cpu_indicator_init@PLT
movq__cpu_model@GOTPCREL(%rip), %rax
leaq_ZL3foov.avx.0(%rip), %rdx
testb   $2, 13(%rax)
leaq_ZL3foov.default.1(%rip), %rax
cmovne  %rdx, %rax
addq$8, %rsp
ret
.size   _ZL3foov.resolver, .-_ZL3foov.resolver
```

instead. Which only eliminates the indirect call but does not inline `foo` into
`bar`. (Note that the useless resolver for foo is PR95779). I believe the two
versions should behave the same...

Ref PR95778 (PLT elimination)
Ref PR71990 (similar title but different. That one is about inlining of the
dispatcher itself IIUC and is not about the case that can already be statically
dispatched. It is also not specific to target_clones like this one is)

[Bug ipa/95790] Incorrect static target dispatch

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790

--- Comment #8 from Yichao Yu  ---
And the reason I reported this as a mis-optimization rather than something
completely unsupported is that the following code.

```
#include 

// #define disable_opt __attribute__((flatten))
#define disable_opt

disable_opt __attribute__ ((target ("default")))
static unsigned foo(const char *buf, unsigned size) {
  return 1;
}

disable_opt __attribute__ ((target ("avx")))
static unsigned foo(const char *buf, unsigned size) {
  return 2;
}

disable_opt __attribute__ ((target ("avx2")))
static unsigned foo(const char *buf, unsigned size) {
  return 3;
}

__attribute__ ((target ("default")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}

__attribute__ ((target ("avx")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}

int main()
{
printf("%u\n", bar());
return 0;
}
```

when compiled with `#define disable_opt`, prints the wrong answer `8192` on my
avx2 laptop. OTOH, with `#define disable_opt __attribute__((flatten))` to
disable the inlining using the bug, it prints the correct result of 12288.
Other ways force an independent dispatch like the following using a volatile
slot also works.

```
#include 

__attribute__ ((target ("default")))
static unsigned _foo(const char *buf, unsigned size) {
  return 1;
}

__attribute__ ((target ("avx")))
static unsigned _foo(const char *buf, unsigned size) {
  return 2;
}

__attribute__ ((target ("avx2")))
static unsigned _foo(const char *buf, unsigned size) {
  return 3;
}

static unsigned (* volatile foo)(const char *buf, unsigned size) = _foo;

__attribute__ ((target ("default")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}

__attribute__ ((target ("avx")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}

int main()
{
printf("%u\n", bar());
return 0;
}
```

I think this suggests that the most basic codegen without optimization is
clearly working and this usage (being it nested multiversioning or not) isn't
something that's just not supported. Rather it's only the optimization that's
wrong.

[Bug ipa/95790] Incorrect static target dispatch

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790

--- Comment #7 from Yichao Yu  ---
> Your testcase has nested function multi-versioning.  I don't think it works
at all.  I opened PR 95793.

I'm sorry but what is nested function multi-versioning? and what's the
difference between the test case here and the one in PR95793?

[Bug ipa/95790] Incorrect static target dispatch

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790

--- Comment #5 from Yichao Yu  ---
It’s wrong when running on a target that has avx512f. The unoptimuzed version
will call the correct foo but the unoptimized case won’t.

As I said, this is an issue when the total targets are different between the
callee and caller.

[Bug ipa/95790] Incorrect static target dispatch

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790

--- Comment #3 from Yichao Yu  ---
And the assembly showing the correct dispatch is


.file   "a.c"
.text
.p2align 4
.type   _ZL3fooPKcj, @function
_ZL3fooPKcj:
.LFB0:
.cfi_startproc
movl$1, %eax
ret
.cfi_endproc
.LFE0:
.size   _ZL3fooPKcj, .-_ZL3fooPKcj
.p2align 4
.type   _ZL3fooPKcj.avx, @function
_ZL3fooPKcj.avx:
.LFB1:
.cfi_startproc
movl$2, %eax
ret
.cfi_endproc
.LFE1:
.size   _ZL3fooPKcj.avx, .-_ZL3fooPKcj.avx
.p2align 4
.type   _ZL3fooPKcj.avx512f, @function
_ZL3fooPKcj.avx512f:
.LFB2:
.cfi_startproc
movl$3, %eax
ret
.cfi_endproc
.LFE2:
.size   _ZL3fooPKcj.avx512f, .-_ZL3fooPKcj.avx512f
.section.text.unlikely,"ax",@progbits
.LCOLDB0:
.text
.LHOTB0:
.p2align 4
.type   _ZL3fooPKcj.resolver, @function
_ZL3fooPKcj.resolver:
.LFB6:
.cfi_startproc
subq$8, %rsp
.cfi_def_cfa_offset 16
call__cpu_indicator_init@PLT
movq__cpu_model@GOTPCREL(%rip), %rax
movl12(%rax), %eax
testb   $-128, %ah
je  .L8
leaq_ZL3fooPKcj.avx512f(%rip), %rax
.L7:
addq$8, %rsp
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.section.text.unlikely
.cfi_startproc
.type   _ZL3fooPKcj.resolver.cold, @function
_ZL3fooPKcj.resolver.cold:
.LFSB6:
.L8:
.cfi_def_cfa_offset 16
testb   $2, %ah
leaq_ZL3fooPKcj.avx(%rip), %rdx
leaq_ZL3fooPKcj(%rip), %rax
cmovne  %rdx, %rax
jmp .L7
.cfi_endproc
.LFE6:
.text
.size   _ZL3fooPKcj.resolver, .-_ZL3fooPKcj.resolver
.section.text.unlikely
.size   _ZL3fooPKcj.resolver.cold, .-_ZL3fooPKcj.resolver.cold
.LCOLDE0:
.text
.LHOTE0:
.type   _Z11_ZL3fooPKcjPKcj, @gnu_indirect_function
.set_Z11_ZL3fooPKcjPKcj,_ZL3fooPKcj.resolver
.p2align 4
.globl  _Z3barv
.type   _Z3barv, @function
_Z3barv:
.LFB3:
.cfi_startproc
pushq   %r12
.cfi_def_cfa_offset 16
.cfi_offset 12, -16
xorl%r12d, %r12d
pushq   %rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
pushq   %rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
subq$4112, %rsp
.cfi_def_cfa_offset 4144
movq%fs:40, %rax
movq%rax, 4104(%rsp)
xorl%eax, %eax
movq%rsp, %rbx
leaq4096(%rsp), %rbp
.p2align 4,,10
.p2align 3
.L12:
movq%rbx, %rdi
movl$1, %esi
addq$1, %rbx
call_Z11_ZL3fooPKcjPKcj@PLT
addl%eax, %r12d
cmpq%rbp, %rbx
jne .L12
movq4104(%rsp), %rax
subq%fs:40, %rax
jne .L16
addq$4112, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 32
movl%r12d, %eax
popq%rbx
.cfi_def_cfa_offset 24
popq%rbp
.cfi_def_cfa_offset 16
popq%r12
.cfi_def_cfa_offset 8
ret
.L16:
.cfi_restore_state
call__stack_chk_fail@PLT
.cfi_endproc
.LFE3:
.size   _Z3barv, .-_Z3barv
.p2align 4
.globl  _Z3barv.avx
.type   _Z3barv.avx, @function
_Z3barv.avx:
.LFB4:
.cfi_startproc
pushq   %r12
.cfi_def_cfa_offset 16
.cfi_offset 12, -16
xorl%r12d, %r12d
pushq   %rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
pushq   %rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
subq$4112, %rsp
.cfi_def_cfa_offset 4144
movq%fs:40, %rax
movq%rax, 4104(%rsp)
xorl%eax, %eax
movq%rsp, %rbx
leaq4096(%rsp), %rbp
.p2align 4,,10
.p2align 3
.L18:
movq%rbx, %rdi
movl$1, %esi
addq$1, %rbx
call_Z11_ZL3fooPKcjPKcj@PLT
addl%eax, %r12d
cmpq%rbp, %rbx
jne .L18
movq4104(%rsp), %rax
subq%fs:40, %rax
jne .L22
addq$4112, %rsp
.cfi_remember_state
.cfi_def_cfa_offset 32
movl%r12d, %eax
popq%rbx
.cfi_def_cfa_offset 24
popq%rbp
.cfi_def_cfa_offset 16
popq%r12
.cfi_def_cfa_offset 8
ret
.L22:
.cfi_restore_state
call__stack_chk_fail@PLT
.cfi_endproc
.LFE4:
.size   _Z3barv.avx, .-_Z3barv.avx
.ident  "GCC: (GNU) 10.1.0"
.section.note.GNU-stack,"",@progbits

[Bug ipa/95790] Incorrect static target dispatch

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790

--- Comment #2 from Yichao Yu  ---
The C++ code attached above produces the following incorrect code with `g++ -O2
-S`

.file   "a.c"
.text
.p2align 4
.globl  _Z3barv
.type   _Z3barv, @function
_Z3barv:
.LFB3:
.cfi_startproc
movl$4096, %eax
ret
.cfi_endproc
.LFE3:
.size   _Z3barv, .-_Z3barv
.p2align 4
.globl  _Z3barv.avx
.type   _Z3barv.avx, @function
_Z3barv.avx:
.LFB4:
.cfi_startproc
movl$8192, %eax
ret
.cfi_endproc
.LFE4:
.size   _Z3barv.avx, .-_Z3barv.avx
.ident  "GCC: (GNU) 10.1.0"
.section.note.GNU-stack,"",@progbits



Triggering the bug PR95778 with

__attribute__ ((flatten,target ("default")))
static unsigned foo(const char *buf, unsigned size) {
  return 1;
}

__attribute__ ((flatten,target ("avx")))
static unsigned foo(const char *buf, unsigned size) {
  return 2;
}

__attribute__ ((flatten,target ("avx512f")))
static unsigned foo(const char *buf, unsigned size) {
  return 3;
}

__attribute__ ((target ("default")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}

__attribute__ ((target ("avx")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}

produces the correct code.

[Bug other/95778] target_clones indirection eliminates requires noinline

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778

--- Comment #4 from Yichao Yu  ---
Yeah, after digging further the two issue are indeed the same. I initially
didn't think they are since I didn't realize PR95786 (that the visibility
attribute is simply ignored completely...) and thought static was handled
specially

It also seems that when target attribute is used directly the inlining can
work. Maybe a pass order issue? and that's certainly a different issue so I'll
file another one if there isn't one already when I have time.

[Bug ipa/95790] New: Incorrect static target dispatch

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790

Bug ID: 95790
   Summary: Incorrect static target dispatch
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

The indirection elimination code currently only check for match of the target
for the specific version but doesn't check if all the targets are matching.

Modifying from
https://github.com/gcc-mirror/gcc/commit/b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f#diff-e2d535917af8555baad2e9c8749e96a5

```
__attribute__ ((target ("default")))
static unsigned foo(const char *buf, unsigned size) {
  return 1;
}

__attribute__ ((target ("avx")))
static unsigned foo(const char *buf, unsigned size) {
  return 2;
}

__attribute__ ((target ("avx512f")))
static unsigned foo(const char *buf, unsigned size) {
  return 3;
}

__attribute__ ((target ("default")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}

__attribute__ ((target ("avx")))
unsigned bar() {
  char buf[4096];
  unsigned acc = 0;
  for (int i = 0; i < sizeof(buf); i++) {
acc += foo([i], 1);
  }
  return acc;
}
```

With the optimization disabled, which is possible by adding a flatten attribute
to the functions and triggering PR95780 and PR95778, a resolver function is
automatically generated for foo like

```
.text
.LHOTB0:
.p2align 4
.type   _ZL3fooPKcj.resolver, @function
_ZL3fooPKcj.resolver:
subq$8, %rsp
call__cpu_indicator_init@PLT
movq__cpu_model@GOTPCREL(%rip), %rax
movl12(%rax), %eax
testb   $-128, %ah
je  .L8
leaq_ZL3fooPKcj.avx512f(%rip), %rax
.L7:
addq$8, %rsp
ret
.section.text.unlikely
.type   _ZL3fooPKcj.resolver.cold, @function
_ZL3fooPKcj.resolver.cold:
.L8:
testb   $2, %ah
leaq_ZL3fooPKcj.avx(%rip), %rdx
leaq_ZL3fooPKcj(%rip), %rax
cmovne  %rdx, %rax
jmp .L7
.text
.size   _ZL3fooPKcj.resolver, .-_ZL3fooPKcj.resolver
.section.text.unlikely
.size   _ZL3fooPKcj.resolver.cold, .-_ZL3fooPKcj.resolver.cold
.LCOLDE0:
.text
.LHOTE0:
.type   _Z11_ZL3fooPKcjPKcj, @gnu_indirect_function
.set_Z11_ZL3fooPKcjPKcj,_ZL3fooPKcj.resolver
```

and the calls from bar goes through the PLT. This is the correct behavior
(albeit sub-optimal since the default could call the default directly) and
allows avx512f version of foo to be called on the correct processor from the
avx version of bar.

With the optimization enabled, however, the call of foo's are inlined to bar
and the avx512f version is never used.

This is somewhat a regression caused by
b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f.

It'll also affect my fix for PR95780 and PR95778.
https://gcc.gnu.org/pipermail/gcc-patches/2020-June/548631.html

[Bug tree-optimization/95786] New: Too aggressive target indirection elimination

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95786

Bug ID: 95786
   Summary: Too aggressive target indirection elimination
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

I realize this issue when debugging PR95778 and PR95780 (ref
https://gcc.gnu.org/pipermail/gcc-patches/2020-June/548631.html)

It seems that the indirection elimination logic does not take into account the
linkage and visibility of the callee and will eliminate the indirection even in
cases where a function without target attribute would have use a PLT and, for
example, allows a override from a different library.

The following code generates a direct call beween g2 and f2 without going
through PLT.

```
__attribute__((target_clones("default,avx2"))) int f2(int *p)
{
asm volatile ("" :: "r"(p) : "memory");
return *p;
}

__attribute__((target_clones("default,avx2"))) int g2(int *p)
{
return f2(p);
}
```

but removing the target_clones attribute uses the PLT.

[Bug other/95778] target_clones indirection eliminates requires noinline

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778

--- Comment #2 from Yichao Yu  ---
Also, the original code example had an error, the code that works properly was

```
static __attribute__((noinline,target_clones("default,avx2"))) int f2(int *p)
{
asm volatile ("" :: "r"(p) : "memory");
return *p;
}

__attribute__((noinline,target_clones("default,avx2"))) int g2(int *p)
{
return f2(p);
}
```

[Bug other/95778] target_clones indirection eliminates requires noinline

2020-06-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778

--- Comment #1 from Yichao Yu  ---
Ah, I think this might be the fix for both this issue and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95780 . I'll test more and will
try to submit it later.

```
diff --git a/gcc/multiple_target.c b/gcc/multiple_target.c
index c1cfe8ff978..79a4c87545f 100644
--- a/gcc/multiple_target.c
+++ b/gcc/multiple_target.c
@@ -483,7 +483,7 @@ redirect_to_specific_clone (cgraph_node *node)
DECL_ATTRIBUTES (e->callee->decl));

   /* Function is not calling proper target clone.  */
-  if (!attribute_list_equal (attr_target, attr_target2))
+  if (!attribute_value_equal (attr_target, attr_target2))
{
  while (fv2->prev != NULL)
fv2 = fv2->prev;
@@ -494,7 +494,7 @@ redirect_to_specific_clone (cgraph_node *node)
  cgraph_node *callee = fv2->this_node;
  attr_target2 = lookup_attribute ("target",
   DECL_ATTRIBUTES (callee->decl));
- if (attribute_list_equal (attr_target, attr_target2))
+ if (attribute_value_equal (attr_target, attr_target2))
{
  e->redirect_callee (callee);
  cgraph_edge::redirect_call_stmt_to_callee (e);
```

[Bug other/95781] New: Missing dead code elimination when a recursive function is inlined.

2020-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95781

Bug ID: 95781
   Summary: Missing dead code elimination when a recursive
function is inlined.
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Code,

```
static int 2(int *p, int k)
{
int res = 0;
if (k > 0)
res += 2(p, k - 1);
return *p + res;
}

int g2(int *p)
{
return 2(p, 3);
}
```

Compiling with -O3 the code produced for `g2` is

```
g2:
movl(%rdi), %eax
sall$2, %eax
ret
```

i.e. `*p * 4` that doesn't need to call `2`. However, the code for `2`
is still generated even though it is never used.

It seems that this only happens when the recursive function is sufficiently
complex. Replacing `*p` with a constant or making the `k > 0` branch returning
directly produces code that does not have `2` in it. Seems that there's
some smart late optimization pass that doesn't have a global DCE pass
afterwards?

Looks similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80680 but I'm not
sure if they have the same root cause.

[Bug other/95780] New: target_clones treats internal visibility different from static functions

2020-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95780

Bug ID: 95780
   Summary: target_clones treats internal visibility different
from static functions
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Again using the code in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778. If
the static function `f2` is changed to `visibility("internal")`, i.e.

```

__attribute__((visibility("internal"),noinline,target_clones("default,avx2")))
int f2(int *p)
{
asm volatile ("" :: "r"(p) : "memory");
return *p;
}

__attribute__((noinline,target_clones("default,avx2"))) int g2(int *p)
{
return f2(p);
}
```

the call to `f2` will then use the PLT again. Without `target_clone` the two
has similar effects and both produce a direct call.

[Bug other/95779] New: Unnecessary dispatch function for static target_clones function.

2020-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95779

Bug ID: 95779
   Summary: Unnecessary dispatch function for static target_clones
function.
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Using the code in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778 the full
assembly generated (the version with both noinline) is (disabled unwind info),

```
.file   "b.c"
.text
.p2align 4
.type   f2.default.1, @function
f2.default.1:
movl(%rdi), %eax
ret
.size   f2.default.1, .-f2.default.1
.p2align 4
.type   g2.default.1, @function
g2.default.1:
jmp f2.default.1
.size   g2.default.1, .-g2.default.1
.p2align 4
.type   f2.avx2.0, @function
f2.avx2.0:
movl(%rdi), %eax
ret
.size   f2.avx2.0, .-f2.avx2.0
.p2align 4
.type   g2.avx2.0, @function
g2.avx2.0:
jmp f2.avx2.0
.size   g2.avx2.0, .-g2.avx2.0
.section.text.g2.resolver,"axG",@progbits,g2.resolver,comdat
.p2align 4
.weak   g2.resolver
.type   g2.resolver, @function
g2.resolver:
subq$8, %rsp
call__cpu_indicator_init@PLT
movq__cpu_model@GOTPCREL(%rip), %rax
leaqg2.avx2.0(%rip), %rdx
testb   $4, 13(%rax)
leaqg2.default.1(%rip), %rax
cmovne  %rdx, %rax
addq$8, %rsp
ret
.size   g2.resolver, .-g2.resolver
.globl  g2
.type   g2, @gnu_indirect_function
.setg2,g2.resolver
.text
.p2align 4
.type   f2.resolver, @function
f2.resolver:
subq$8, %rsp
call__cpu_indicator_init@PLT
movq__cpu_model@GOTPCREL(%rip), %rax
leaqf2.avx2.0(%rip), %rdx
testb   $4, 13(%rax)
leaqf2.default.1(%rip), %rax
cmovne  %rdx, %rax
addq$8, %rsp
ret
.size   f2.resolver, .-f2.resolver
.ident  "GCC: (GNU) 10.1.0"
.section.note.GNU-stack,"",@progbits
```

AFAICT the `f2.resolver` is never used anywhere and can be omitted (all caller
of `f2` are statically dispatched).

[Bug other/95778] New: target_clones indirection eliminates requires noinline

2020-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778

Bug ID: 95778
   Summary: target_clones indirection eliminates requires noinline
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Compiling

```
static __attribute__((noinline,target_clones("default,avx2"))) int f2(int *p)
{
asm volatile ("" :: "r"(p) : "memory");
return *p;
}

__attribute__((target_clones("default,avx2"))) int g2(int *p)
{
return f2(p);
}
```

with `-fPIC -O3` generates


```
g2.avx2.0:
jmp f2.avx2.0
```

However, if any of the two `noinline` is removed, the generated code becomes,

```
g2.avx2.0:
jmp f2@PLT
```

which cannot get eliminated later
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95776

I think this should be possible to do and should be possible without LTO (hence
a slightly different bug than
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95776 even though if that one is
fixed turning on LTO can particially fix this).

Also, in this case, the `f2` should be inlinable to `g2`. However, no
combination of `inline`, `always_inline`, `flatten` I've tested can do that,
even though when both functions are marked with `noinline` gcc clearly knows
which function is calling what so it should have no problem inlining.

[Bug c/95777] New: Allow specifying more than one target options at the same time in target and target_clones attribute

2020-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95777

Bug ID: 95777
   Summary: Allow specifying more than one target options at the
same time in target and target_clones attribute
   Product: gcc
   Version: 10.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Currently it seems that (document and own tests) only a single option is
allowed for each version of the function using `target` and `target_clones`.
This can be a problem for options that are not strict subset of each other
(e.g. the AVX512 ones IIUC). Of course specifying `cpu=haswell` and
`cpu=skylake` for the same target doesn't make much sense so some checking
should be in place but I believe allowing multiple directly testable features
to be specified at the same time should be allowed.

A related issue is that while one can indeed do some of these by specifying a
`arch=`. However, even if the runtime CPU supports all the features it'll
still not get selected if the name doesn't exactly match (tested with
`arch=haswell` on my kabelake laptop). If a fallback could be implemented to
make this work that will be also good enough for me at least...

[Bug lto/95776] New: Reduce indirection with target_clones at link time (with LTO)

2020-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95776

Bug ID: 95776
   Summary: Reduce indirection with target_clones at link time
(with LTO)
   Product: gcc
   Version: 10.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Currently, if a function is not not visible outside the final library (static,
or internal or hidden visibility), the call of the plt will be replaced with
the call to the function directly.

With target_clones, this is also possible within the same compilation unit for
static functions as callees. The caller that has the same cloning attribute
will simply call the cloned function without indirection.

However, this stops working when the two are combined. Even with the maximum
options and attribute to help it (hidden visibility, same compilation unit,
-Wl,-Bsymbolic, LTO) the call to the cloned function from a caller with
matching cloning attribute still go through the PLT.

Test code

```
__attribute__((noinline,visibility("hidden"))) int f1(int *p)
{
asm volatile ("" :: "r"(p) : "memory");
return *p;
}

__attribute__((noinline,visibility("hidden"),target_clones("default,avx2")))
int f2(int *p)
{
asm volatile ("" :: "r"(p) : "memory");
return *p;
}

__attribute__((noinline)) int g1(int *p)
{
return f1(p);
}

__attribute__((noinline,target_clones("default,avx2"))) int g2(int *p)
{
return f2(p);
}
```

Compiled with `-fPIC -flto -O3 -Wl,-Bsymbolic -shared`. The `f1` call calls
`f1` directly whereas the two cloned `f2` calls both call `f2@plt`.

The same also applies to inlining, target_clones kills inlining even with lto
on.

I assume this happens because this can only be done at link time which either
didn't get passed enough info to determine this or simply didn't get
implemented? I assume this should be possible since it can be done within a
single compilation unit.

[Bug target/95775] New: Command line argument for target_clones?

2020-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95775

Bug ID: 95775
   Summary: Command line argument for target_clones?
   Product: gcc
   Version: 10.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Would it make sense to add a command line argument that is roughly equivalent
to to adding `target_clones` to all functions?

In terms of usefulness, I believe it will be a very cheap way for many
libraries to turn on the support with minimal code change. It certainly won't
be as optimized as best possible but neither is target_clones attribute itself
compared to hand wrote different implementations using compiler
intrinsics/assembly...

In terms of implementation, I believe most of the issues I've hit when adding
such attribute to functions has been fixed so I have little issue using it now.
It'll also be a new feature so it shouldn't really break any existing code.

And for further improvement, the compiler should have fair knowledge of what
instruction can be/has been used and can omit some of the cloning in order to
reduce code size. I don't think this needs to be included in the first version
though...

And IIUC this is something that icc does automatically? (If that can serve as a
argument for this feature...)

[Bug lto/94659] New: Missing symbol with LTO and target_clones

2020-04-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94659

Bug ID: 94659
   Summary: Missing symbol with LTO and target_clones
   Product: gcc
   Version: 9.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

This is basically the same as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732 except now it only happens
with LTO enabled.

It seems that if a function with `target_clones` attribute isn't used in the
final library and if LTO is enabled, the function will be missing from the
resulting library. Only the `.resolver` symbol appears.

The test code is

```
// b.c

__attribute__((target_clones("default,avx")))
int f1()
{
return 2;
}
```

when compiled with `gcc -g -flto -O3 -fPIC b.c -shared -o libb-lto.so`, the
exported symbols available are,

```
$ objdump -T libb-lto.so

libb-lto.so: file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
  w   D  *UND*   
_ITM_deregisterTMCloneTable
  w   D  *UND*    __gmon_start__
  w   D  *UND*   
_ITM_registerTMCloneTable
  w   DF *UND*    GLIBC_2.2.5 __cxa_finalize
1730 gDF .text  002b  Basef1.resolver
```

Compared to the output lilbrary from `gcc -g -O3 -fPIC b.c -shared -o libb.so`

```
$ objdump -T libb.so

libb.so: file format elf64-x86-64

DYNAMIC SYMBOL TABLE:
  w   D  *UND*   
_ITM_deregisterTMCloneTable
  w   D  *UND*    __gmon_start__
  w   D  *UND*   
_ITM_registerTMCloneTable
  w   DF *UND*    GLIBC_2.2.5 __cxa_finalize
1730  w   DF .text  002b  Basef1.resolver
1730 g   iD  .text  002b  Basef1
```

The exported symbol has the wrong name for the LTO version. `dlsym` result
confirms the difference.

If the function is used somewhere else in the library, the resulting symbol
will then looks the same as the non-LTO version.

[Bug ipa/94656] New: target_clones on alias leads to segfault in the compiler

2020-04-18 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94656

Bug ID: 94656
   Summary: target_clones on alias leads to segfault in the
compiler
   Product: gcc
   Version: 9.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
CC: marxin at gcc dot gnu.org
  Target Milestone: ---

Compiling the following code with `gcc -c` leads to a segfault in the compiler
targetclone pass.

```
__attribute__((target_clones("default,avx")))
void f1()
{
}

__attribute__((target_clones("default,avx"))) void f2()
__attribute__((alias("f1")));
```

The error was.

```
during IPA pass: targetclone
src/s_nextafterl.c:6:1: internal compiler error: Segmentation fault
6 | __attribute__((target_clones("default,avx"))) void f2()
__attribute__((alias("f1")));
  | ^
```

Now this came from a hack I was playing around and I'm not going to argue if
having a target_clone on a alias should be supported (though if the target
agrees I think it would be nice to support it... and if not agree a warning
would be better IMHO). However, I don't think a segfault is what should happen
here = = 

[Bug libstdc++/92759] New: Typo in libstdcxx/v6/xmethods.py

2019-12-02 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92759

Bug ID: 92759
   Summary: Typo in libstdcxx/v6/xmethods.py
   Product: gcc
   Version: 9.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

I get the following warning when running gdb/rr.

```
/usr/lib/../share/gcc-9.2.0/python/libstdcxx/v6/xmethods.py:731: SyntaxWarning:
list indices must be integers or slices, not str; perhaps you missed a comma?
  refcounts = ['_M_refcount']['_M_pi']
```

Looking at the
[code](https://github.com/gcc-mirror/gcc/blob/daa87973f7a00bf3bb81d0644dd60f4efb83bb65/libstdc%2B%2B-v3/python/libstdcxx/v6/xmethods.py#L731)
I think that line should read

```
refcounts = obj['_M_refcount']['_M_pi']
```

instead.

I could submit a patch but I feel like it'll be faster/easier for someone here
to just fix this

[Bug target/54412] minimal 32-byte stack alignment with -mavx on 64-bit Windows

2019-08-25 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

--- Comment #29 from Yichao Yu  ---
See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412#c25

GCC is fully capable of aligning the stack. It just seems that different part
of it disagrees on what the current stack alignment is and whether a
realignment is needed.

[Bug target/90826] Weak symbol does not work reliably on windows

2019-06-10 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90826

--- Comment #2 from Yichao Yu  ---
Also, I just upgraded the compiler on this computer from 7.x to 9.1.0. The
issue appeared before the upgrade as well but I didn't investigate until the
upgrade finished.

[Bug target/90826] Weak symbol does not work reliably on windows

2019-06-10 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90826

--- Comment #1 from Yichao Yu  ---
Oh, forgot to mention that the first assembly was generated with -O3 and adding
`.weak f` to the generated file fixes the issue as well.

[Bug target/90826] New: Weak symbol does not work reliably on windows

2019-06-10 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90826

Bug ID: 90826
   Summary: Weak symbol does not work reliably on windows
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The following code does not link correctly with all optimization levels on
windows with the mingw-w64-x86_64-g++ compiler.

```
#include 

extern "C" void f() __attribute__((weak));

int main()
{
return (int)(uintptr_t)f;
}

```

The assembly generated is

```
.file   "weak.cpp"
.text
.def__main; .scl2;  .type   32; .endef
.section.text.startup,"x"
.p2align 4
.globl  main
.defmain;   .scl2;  .type   32; .endef
.seh_proc   main
main:
.LFB1:
subq$56, %rsp
.seh_stackalloc 56
.seh_endprologue
call__main
movq.refptr.f(%rip), %rax
movq%rax, 40(%rsp)
addq$56, %rsp
ret
.seh_endproc
.ident  "GCC: (Rev2, Built by MSYS2 project) 9.1.0"
.section.rdata$.refptr.f, "dr"
.globl  .refptr.f
.linkonce   discard
.refptr.f:
.quad   f

```

and the error,

```
C:\msys64\tmp\ccQkPfOi.o:weak.cpp:(.rdata$.refptr.f[.refptr.f]+0x0): undefined
reference to `f'
```

This should not happen since `f` is declared weak. (I realized that the symbol
resolution happens at compile time for weak symbol, which is fine for me, but I
just want it to compile...)

Another case where the optimization actually makes this work is,

```
#include 

extern "C" void f() __attribute__((weak));

int main()
{
printf("%p\n", f);
return 0;
}
```

With -O0, the assembly generated is

```
.file   "weak.cpp"
.text
.def__main; .scl2;  .type   32; .endef
.section .rdata,"dr"
.LC0:
.ascii "%p\12\0"
.text
.globl  main
.defmain;   .scl2;  .type   32; .endef
.seh_proc   main
main:
.LFB28:
pushq   %rbp
.seh_pushreg%rbp
movq%rsp, %rbp
.seh_setframe   %rbp, 0
subq$32, %rsp
.seh_stackalloc 32
.seh_endprologue
call__main
movq.refptr.f(%rip), %rdx
leaq.LC0(%rip), %rcx
callprintf
movl$0, %eax
addq$32, %rsp
popq%rbp
ret
.seh_endproc
.ident  "GCC: (Rev2, Built by MSYS2 project) 9.1.0"
.defprintf; .scl2;  .type   32; .endef
.section.rdata$.refptr.f, "dr"
.globl  .refptr.f
.linkonce   discard
.refptr.f:
.quad   f
```

with error,

```
C:\msys64\tmp\ccTiwMKh.o:weak.cpp:(.rdata$.refptr.f[.refptr.f]+0x0): undefined
reference to `f'
```

with -O1 or higher, the assembly produced is,
```
.file   "weak.cpp"
.text
.def__main; .scl2;  .type   32; .endef
.section .rdata,"dr"
.LC0:
.ascii "%p\12\0"
.text
.globl  main
.defmain;   .scl2;  .type   32; .endef
.seh_proc   main
main:
.LFB30:
subq$40, %rsp
.seh_stackalloc 40
.seh_endprologue
call__main
leaqf(%rip), %rdx
leaq.LC0(%rip), %rcx
callprintf
movl$0, %eax
addq$40, %rsp
ret
.seh_endproc
.weak   f
.ident  "GCC: (Rev2, Built by MSYS2 project) 9.1.0"
.deff;  .scl2;  .type   32; .endef
.defprintf; .scl2;  .type   32; .endef
.section.rdata$.refptr.f, "dr"
.globl  .refptr.f
.linkonce   discard
.refptr.f:
.quad   f
```


The difference between the two assembly is

```
--- weak1.s 2019-06-10 19:42:27.039467600 -0400
+++ weak0.s 2019-06-10 19:42:23.709467500 -0400
@@ -9,21 +9,24 @@
.defmain;   .scl2;  .type   32; .endef
.seh_proc   main
 main:
-.LFB30:
-   subq$40, %rsp
-   .seh_stackalloc 40
+.LFB28:
+   pushq   %rbp
+   .seh_pushreg%rbp
+   movq%rsp, %rbp
+   .seh_setframe   %rbp, 0
+   subq$32, %rsp
+   .seh_stackalloc 32
.seh_endprologue
call__main
-   leaqf(%rip), %rdx
+   movq.refptr.f(%rip), %rdx
leaq.LC0(%rip), %rcx
callprintf
movl$0, %eax
-   addq$40, %rsp
+   addq$32, %rsp
+   popq%rbp
ret
.seh_endproc
-   .weak   

[Bug c/90728] New: False positive Wmemset-elt-size with zero size array

2019-06-03 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90728

Bug ID: 90728
   Summary: False positive Wmemset-elt-size with zero size array
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The code below comes from a template expansion (when certain cache feature is
disabled) and all the operation on the `buff` member are no-op.

```
#include 

struct A {
A()
{
memset(, 0xff, sizeof(buff));
}

int buff[0];
};
```

However, this start to raise a warning on GCC 9

```
a.cpp: In constructor 'A::A()':
a.cpp:8:41: warning: 'memset' used with length equal to number of elements
without multiplication by element size [-Wmemset-elt-size]
8 | memset(, 0xff, sizeof(buff));
  | ^
```

It seems that the warning logic simply compare the size (as well as checking
element size != 1) without taking into account the 0 size case.

[Bug tree-optimization/89582] Suboptimal code generated for floating point struct in -O3 compare to -O2

2019-04-04 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582

--- Comment #6 from Yichao Yu  ---
For the vfloat test case, isn't the optimum code just

```
addps   %xmm2, %xmm0
addps   %xmm3, %xmm1
retq
```

It's not making full use of the vector but I assume not having to spill is a
win? This is what clang produces.

And for the LLVM early lowering of the calling convention, a less awkward way
is.

```
define { <2 x float>, <2 x float> } @f2({<2 x float>, <2 x float>}, {<2 x
float>, <2 x float>}) {
  %v0 = extractvalue { <2 x float>, <2 x float> } %0, 0
  %v1 = extractvalue { <2 x float>, <2 x float> } %0, 1
  %v2 = extractvalue { <2 x float>, <2 x float> } %1, 0
  %v3 = extractvalue { <2 x float>, <2 x float> } %1, 1
  %v5 = fadd <2 x float> %v0, %v2
  %v6 = fadd <2 x float> %v1, %v3
  %v7 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %v5, 0
  %v8 = insertvalue { <2 x float>, <2 x float> } %v7, <2 x float> %v6, 1
  ret { <2 x float>, <2 x float> } %v8
}
```

[Bug target/89606] Extra mov after structure load instructions on aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89606

--- Comment #1 from Yichao Yu  ---
Compiled a GCC 9 snapshot for pr89607 and the issue is still present.

[Bug target/89607] Missing optimization for store of multiple registers on aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

--- Comment #8 from Yichao Yu  ---
I see. I don't imagine this to cause a major local speed up though I assume it
should at least not be slower? That's also why I mentioned that this should at
least be done for `-Os`.

[Bug target/89607] Missing optimization for store of multiple registers on aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

--- Comment #6 from Yichao Yu  ---
> For aarch64, there was talk about adding stp for q registers.

What do you mean? I was initially unsure about it too but I assume it already
exist since clang (and now GCC 9) emits it and the arm arch reference manual
also mentions it without mentioning it only available in a later version.

[Bug target/89607] Missing optimization for store of multiple registers on aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

--- Comment #5 from Yichao Yu  ---
I just compiled the 9-20190303 snapshot and this is indeed seems to be fixed.
Should this be closed now or after GCC 9 is released?

[Bug target/89607] Missing optimization for store of multiple registers on arm and aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

--- Comment #3 from Yichao Yu  ---
Done pr89614

[Bug target/89614] New: Missing optimization for store of multiple registers on arm

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89614

Bug ID: 89614
   Summary: Missing optimization for store of multiple registers
on arm
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Separated from pr89607 as requested. Test code and result compiled with any
non-zero optimization levels,

```
#include 

void f4(float32x4x2_t *p, const float *p1)
{
*p = vld2q_f32(p1);
}

void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2)
{
p->val[0] = v1;
p->val[1] = v2;
}
```

```
f4:
vld2.32 {d16-d19}, [r1]
vst1.64 {d16-d19}, [r0:64]
bx  lr
f5:
vst1.64 {d0-d1}, [r0:64]
vstrd2, [r0, #16]
vstrd3, [r0, #24]
bx  lr
```

I believe `f5` should use a single `vst1.64 {d0-d3}, [r0:64]` just like `f4`.

If for some reason doing that is bad for performance (doubt it...) it should at
least be used for -Os.

[Bug target/89607] Missing optimization for store of multiple registers on arm and aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

--- Comment #2 from Yichao Yu  ---
Sure. I'll do that.

[Bug target/89607] New: Missing optimization for store of multiple registers on arm and aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

Bug ID: 89607
   Summary: Missing optimization for store of multiple registers
on arm and aarch64
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Test code, Compiled for arm/aarch64 with -O1/-O2/-O3/-Os/-Ofast

```
#include 

void f4(float32x4x2_t *p, const float *p1)
{
*p = vld2q_f32(p1);
}

void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2)
{
p->val[0] = v1;
p->val[1] = v2;
}
```

arm:

```
f4:
vld2.32 {d16-d19}, [r1]
vst1.64 {d16-d19}, [r0:64]
bx  lr
f5:
vst1.64 {d0-d1}, [r0:64]
vstrd2, [r0, #16]
vstrd3, [r0, #24]
bx  lr
```

aarch64:

```
f4:
ld2 {v0.4s - v1.4s}, [x1]
str q0, [x0]
str q1, [x0, 16]
ret
f5:
str q0, [x0]
str q1, [x0, 16]
ret
```

For arm, it seems that f5 could follow f4 and uses a `vst1.64 {d0-d3}, [r0:64]`
instead. For aarch64, both function should have used a `stp q0, q1, [x0]`

Clang produces what I expected on aarch64 but it only uses pair store
instruction on arm, which use one more instuction for `f4` and one fewer for
`f5`. (I'm not sure why GCC decided to use a pair store and then two single
stores)

Similar to pr89606, this optimization should at least happen with `-Os` if not
for all other optimization levels.

Tested with 8.2.1 on arm and 8.3.0 on aarch64.

[Bug target/89606] New: Extra mov after structure load instructions on aarch64

2019-03-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89606

Bug ID: 89606
   Summary: Extra mov after structure load instructions on aarch64
   Product: gcc
   Version: 8.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Code to reproduce,

```
#include 

#ifdef __aarch64__
float64x2x2_t f(const double *p1, const double *p2)
{
float64x2x2_t v = vld2q_f64(p1);
return vld2q_lane_f64(p2, v, 1);
}

float32x2x2_t f2(const float *p1, const float *p2)
{
float32x2x2_t v = vld2_f32(p1);
return vld2_lane_f32(p2, v, 1);
}
#endif

void f3(float32x2x2_t *p, const float *p1, const float *p2)
{
float32x2x2_t v = vld2_f32(p1);
*p = vld2_lane_f32(p2, v, 1);
}
```

GCC produces (aarch64, -O1/-O2/-O3/-Ofast/-Os),

```
f:
ld2 {v4.2d - v5.2d}, [x0]
mov v0.16b, v4.16b
mov v1.16b, v5.16b
ld2 {v0.d - v1.d}[1], [x1]
ret
f2:
ld2 {v0.2s - v1.2s}, [x0]
mov v2.8b, v0.8b
mov v3.8b, v1.8b
ld2 {v2.s - v3.s}[1], [x1]
mov v1.8b, v3.8b
mov v0.8b, v2.8b
ret
f3:
ld2 {v2.2s - v3.2s}, [x1]
mov v0.8b, v2.8b
mov v1.8b, v3.8b
ld2 {v0.s - v1.s}[1], [x2]
stp d0, d1, [x0]
ret
```

For all three functions, none of the mov's seems necessary. Even if there's
some performance issue when reusing the registers (I highly doubt it...) at
least the `-Os` version should not have those mov's.

Clang produces what I expect in this case,

```
f:
ld2 { v0.2d, v1.2d }, [x0]
ld2 { v0.d, v1.d }[1], [x1]
ret
f2:
ld2 { v0.2s, v1.2s }, [x0]
ld2 { v0.s, v1.s }[1], [x1]
ret
f3:
ld2 { v0.2s, v1.2s }, [x1]
ld2 { v0.s, v1.s }[1], [x2]
stp d0, d1, [x0]
ret
```

Aarch32 doesn't have this issue either with GCC,

```
f3:
vld2.32 {d16-d17}, [r1]
vld2.32 {d16[1], d17[1]}, [r2]
vst1.64 {d16-d17}, [r0:64]
bx  lr
```

so this seems to be aarch64 specific.

[Bug target/89597] New: Inconsistent vector calling convention on windows with Clang and MSVC

2019-03-05 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89597

Bug ID: 89597
   Summary: Inconsistent vector calling convention on windows with
Clang and MSVC
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

For 256bit and 512bit vector return values, Clang and MSVC always returns them
in the corresponding registers even without `__vectorcall`. GCC, however,
returns the value as reference. Together with the missing support of
`__vectorcall`[1], this means that the code GCC generate for functions that
returns vector value is not compatible with any other compilers. The problem
does not exist for 128bit vectors.

Test case.

```
typedef double vdouble __attribute__((vector_size(32)));
vdouble f(vdouble x, vdouble y)
{
return x + y;
}
```

GCC compiles this to,

```
f:
vmovapd (%r8), %ymm0
vaddpd  (%rdx), %ymm0, %ymm0
movq%rcx, %rax
vmovapd %ymm0, (%rcx)
vzeroupper
ret
```

Clang compiles this to,

```
f:
vmovapd (%rcx), %ymm0
vaddpd  (%rdx), %ymm0, %ymm0
retq
```

Given the stack alignment issue[2], I wonder if this can be fixed now without
breaking anyone's code. (i.e. everyone that's using it is probably broken
anyway due to the other bug...)

Disclaimer. I did all my test with clang. I believe MSVC behaves the same from
the compiled result I got from someone else and I don't have MSVC to personally
test it.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89485
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

[Bug target/89581] Unneeded stack alignment on windows x86

2019-03-04 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581

--- Comment #1 from Yichao Yu  ---
The problem is still there when compiled with -O2

```
f:
pushq   %rbp
vmovq   (%r8), %xmm1
movq%rcx, %rax
vmovq   8(%r8), %xmm0
vaddsd  (%rdx), %xmm1, %xmm1
vaddsd  8(%rdx), %xmm0, %xmm0
movq%rsp, %rbp
andq$-16, %rsp
vmovsd  %xmm1, (%rcx)
vmovsd  %xmm0, 8(%rcx)
leave
ret
```


but is not there under `-O2` when the arguments and results are passed
explicitly by reference.

```
void f2(vdouble *res, const vdouble *x, const vdouble *y)
{
*res = (vdouble){x->x1 + y->x1, x->x2 + y->x2};
}
```


```
f2:
vmovsd  8(%rdx), %xmm0
vmovsd  (%rdx), %xmm1
vaddsd  8(%r8), %xmm0, %xmm0
vaddsd  (%r8), %xmm1, %xmm1
vmovsd  %xmm0, 8(%rcx)
vmovsd  %xmm1, (%rcx)
```

The problem comes back, however, with the explicit pass by reference version
when compiled under -O3

```
f2:
pushq   %rbp
vmovapd (%rdx), %xmm0
vaddpd  (%r8), %xmm0, %xmm0
movq%rsp, %rbp
andq$-16, %rsp
vmovaps %xmm0, (%rcx)
leave
ret
```

[Bug target/89582] New: Suboptimal code generated for floating point struct in -O3 compare to -O2

2019-03-04 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582

Bug ID: 89582
   Summary: Suboptimal code generated for floating point struct in
-O3 compare to -O2
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

When testing the code for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581 on
linux, I noticed that the code seems suboptimum when compiled under -O3 rather
than -O2 on linux x64.

```
typedef struct {
double x1;
double x2;
} vdouble __attribute__((aligned(16)));

vdouble f(vdouble x, vdouble y)
{
return (vdouble){x.x1 + y.x1, x.x2 + y.x2};
}
```

Compiled with `-O2` produces
```
f:
addsd   %xmm3, %xmm1
addsd   %xmm2, %xmm0
ret
```

With `-O3` or `-Ofast`, however, the code produced is,

```
f:
movq%xmm0, -40(%rsp)
movq%xmm1, -32(%rsp)
movapd  -40(%rsp), %xmm4
movq%xmm2, -24(%rsp)
movq%xmm3, -16(%rsp)
addpd   -24(%rsp), %xmm4
movaps  %xmm4, -40(%rsp)
movsd   -32(%rsp), %xmm1
movsd   -40(%rsp), %xmm0
ret
```

It seems that gcc tries to use the vector instruction but had to use the stack
for that. I did a quick benchmark which confirms that the -O3 version is much
slower than the -O2 version.

Clang produces

```
f:
addsd   %xmm2, %xmm0
addsd   %xmm3, %xmm1
retq
```

As long as any optimizations are on, which seems appropriate.

[Bug target/89581] New: Unneeded stack alignment on windows x86

2019-03-04 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581

Bug ID: 89581
   Summary: Unneeded stack alignment on windows x86
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

On windows, when compiling the following code with ` gcc -mavx2 a.c -o - -S -O3
-g0 -fno-asynchronous-unwind-tables -fomit-frame-pointer -Wall -Wextra`

```
typedef struct {
double x1;
double x2;
} vdouble __attribute__((aligned(16)));

vdouble f(vdouble x, vdouble y)
{
return (vdouble){x.x1 + y.x1, x.x2 + y.x2};
}
```

I got

```
pushq   %rbp
vmovdqa (%r8), %xmm0
movq%rcx, %rax
vaddpd  (%rdx), %xmm0, %xmm0
movq%rsp, %rbp
andq$-16, %rsp
vmovaps %xmm0, (%rcx)
leave
ret
```

which include 4 extra instructions to align the stack without actually using
it

FWIW, clang has a similar problem on linux...
https://bugs.llvm.org/show_bug.cgi?id=40844

Also worth noting that with -O2 all three vector instructions are splitted into
scalar ones whereas clang does this transformation at -O2...

[Bug target/54412] minimal 32-byte stack alignment with -mavx on 64-bit Windows

2019-02-27 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

--- Comment #24 from Yichao Yu  ---
Oh, and the test case above was compiled with -O3 (and -g -Wall -Wextra).

[Bug target/54412] minimal 32-byte stack alignment with -mavx on 64-bit Windows

2019-02-27 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412

Yichao Yu  changed:

   What|Removed |Added

 CC||yyc1992 at gmail dot com

--- Comment #23 from Yichao Yu  ---
> It is GCC does not realign the stack at all that is the issue.

I hit another related issue that might confirm this as well.

I noticed this when I tried to manually align the stack with inline assembly.

C++ code reduced from my test case,

```
#include 
#include 
#include 

__attribute__((target("avx")))
__attribute__((noinline)) __m256d f(__m256d x, uint32_t a, const double *p)
{
__m256d res;
asm volatile ("vxorpd %0, %0, %0" :
  "=x"(res), "+x"(x), "+r"(a), "+r"(p) ::
  "memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
  "r11", "rbp");
return res;
}

__attribute__((target("avx")))
__attribute__((noinline)) __m256d f2(__m256d x, uint32_t a, const double *p)
{
__m256d res;
asm volatile ("vxorpd %0, %0, %0" :
  "=x"(res), "+x"(x), "+r"(a), "+r"(p) ::
  "memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
  "r11", "rbp");
return res;
}

__attribute__((target("avx")))
__attribute__((noinline)) __m256d f(__m256d x, __m256d y, __m256d z,
uint32_t a, const double *p)
{
__m256d res;
asm volatile ("vxorpd %0, %0, %0" :
  "=x"(res), "+x"(x), "+x"(y), "+x"(z), "+r"(a), "+r"(p) ::
  "memory", "rax", "rcx", "rdx", "r8", "r9", "r10",
  "r11", "rbp");
return res;
}

const double points[] = {0, 0.1, 0.2, 0.6};

__attribute__((target("avx"))) void test_avx()
{
f(__m256d{0, 0, 0, 0}, __m256d{0, 0, 0, 0},
   __m256d{0, 0, 0, 0}, 4, points);
f(__m256d{0, 0, 0, 0}, 4, points);
}

__attribute__((target("avx"))) void test_avx2()
{
f2(__m256d{0, 0, 0, 0}, 4, points);
}

static void call_aligned_stack(void (*p)(void))
{
asm volatile ("movq %%rsp, %%rbp\n"
  "andq $-64, %%rsp\n"
  "subq $64, %%rsp\n"
  "callq *%0\n"
  "movq %%rbp, %%rsp\n"
  :: "r"(p)
  : "memory", "rax", "rcx", "rdx", "r8", "r9", "r10", "r11",
"rbp");
}

int main()
{
call_aligned_stack(test_avx);
fprintf(stderr, "\n");
fflush(stderr);
call_aligned_stack(test_avx2);
return 0;
}
```

(The `fprintf` is there only to make it easier to see when the crash happens.)
The stack alignment code makes sure that the stack is aligned to 64bytes before
making the `call`, which is verified in the debugger, however, when compiled
with GCC 8.2.1 on msys2 (using the mingw-w64-x86_64-gcc package) the `test_avx`
function is happy while `test_avx2` function is not.

Looking at the generated code, for the crashing function:

```
004015c0 <_Z9test_avx2v>:
  4015c0:   48 83 ec 68 sub$0x68,%rsp
  4015c4:   c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0
  4015c8:   4c 8d 0d 51 7a 00 00lea0x7a51(%rip),%r9# 409020
<_ZL6points>
  4015cf:   41 b8 04 00 00 00   mov$0x4,%r8d
  4015d5:   48 8d 4c 24 40  lea0x40(%rsp),%rcx
  4015da:   48 8d 54 24 20  lea0x20(%rsp),%rdx
  4015df:   c5 fd 29 44 24 20   vmovapd %ymm0,0x20(%rsp)
  4015e5:   c5 f8 77vzeroupper 
  4015e8:   e8 a3 ff ff ff  callq  401590 <_Z2f2Dv4_djPKd>
  4015ed:   90  nop
  4015ee:   48 83 c4 68 add$0x68,%rsp
  4015f2:   c3  retq   
```

which tries to write with 32byte alignment with a stack offset from the initial
call instruction: -8 - 0x68 + 0x20 = -80.

OTOH, for the "good" function,

```
00401640 <_Z8test_avxv>:
  401640:   57  push   %rdi
  401641:   56  push   %rsi
  401642:   53  push   %rbx
  401643:   48 81 ec b0 00 00 00sub$0xb0,%rsp
  40164a:   c5 d9 57 e4 vxorpd %xmm4,%xmm4,%xmm4
  40164e:   48 8d 3d cb 79 00 00lea0x79cb(%rip),%rdi#
40902

[Bug c/89485] New: Support vectorcall calling convention on windows

2019-02-24 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89485

Bug ID: 89485
   Summary: Support vectorcall calling convention on windows
   Product: gcc
   Version: 8.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

I'm very surprised that I didn't find an issue for this so sorry if this is
discussed/rejected somewhere else.

It appears that both MSVC and clang supports a vectorcall calling convention
which is very similar to the one used on linux and passes large vectors in the
corresponding vector register instead of on the stack. It'll be nice if gcc can
support that both for efficiency and for compatibility.

Ref
https://docs.microsoft.com/en-us/cpp/cpp/vectorcall?view=vs-2017
https://clang.llvm.org/docs/AttributeReference.html#id335

[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM (aarch32)

2018-01-30 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641

--- Comment #20 from Yichao Yu  ---
Just want to mention that the lack of a way to locally change the arch settings
without lying to the compiler is exactly why I reported this issue.

[Bug target/83110] Relocation error when taking address of protected function in shared library.

2017-11-23 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83110

--- Comment #2 from Yichao Yu  ---
What might be invalid about the source?

[Bug target/83110] New: Relocation error when taking address of protected function in shared library.

2017-11-22 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83110

Bug ID: 83110
   Summary: Relocation error when taking address of protected
function in shared library.
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

This is very similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65248
although that one is marked as fixed.
(This could be a dup of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520 but
I can't really tell...)

The difference from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65248 is that
this now only happens for me with protected functions and not global variables.

The code to reproduce is

```
__attribute__((visibility("protected")))
void f()
{
}

// __attribute__((visibility("protected")))
// int f;

void f2(void (*cb)(void*))
{
cb((void*));
}
```

Which gives the error
```
% LANG=C g++ a.cpp -o liba.so -pthread -fPIC -shared
/bin/ld: /tmp/ccvUACGZ.o: relocation R_X86_64_PC32 against protected symbol
`_Z1fv' can not be used when making a shared object
/bin/ld: final link failed: Bad value
collect2: error: ld returned 1 exit status
```

[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM (aarch32)

2017-11-02 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641

--- Comment #7 from Yichao Yu  ---
It would be great if `+crc` can work if it's not ambiguous. Requiring
`arch=armv8-a+crc` works for me too, and it'll just require more preprocessor
checks.

[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM (aarch32)

2017-10-24 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641

--- Comment #3 from Yichao Yu  ---
> ARMv8-a is the only architecture variant where the CRC extension is optional

Not really. There's also armv8-r and armv8-m. Also, I believe code compiled for
armv7-a can run on armv8-a hardware and can also optionally enable armv8
features including CRC extension. I was hoping that GCC can be smart enough to
enable the correct armv8 variant automatically.

Test case is just

```
#include 

#pragma GCC push_options
#pragma GCC target("armv8-a+crc")
__attribute__((target("armv8-a+crc"))) uint32_t crc32cw(uint32_t crc, uint32_t
val)
{
uint32_t res;
/* asm(".arch armv8-a"); */
/* asm(".arch_extension crc"); */
asm("crc32cw %0, %1, %2" : "=r"(res) : "r"(crc), "r"(val));
/* asm(".arch armv7-a"); */
return res;
}
#pragma GCC pop_options
```

Compiled with either armv7-a or armv8-a march.

[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM

2017-10-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641

--- Comment #1 from Yichao Yu  ---
I've found a workaround in
https://sourceware.org/ml/binutils/2017-04/msg00171.html but it's extremely
ugly (albeit also very clever...).

[Bug target/82641] New: Unable to enable crc32 for a certain function with target attribute on ARM

2017-10-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641

Bug ID: 82641
   Summary: Unable to enable crc32 for a certain function with
target attribute on ARM
   Product: gcc
   Version: 7.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The assembler complains about the target not supporting CRC32 instructions for
certain (generic) targets on ARM and AArch64. On AArch64, this can be lifted
with the `target("+crc")` attribute (or pragma though I've only tested the
function attribute) when writing inline assembly code that uses non-default
processor features and cpu-feature dispatch. However, none of these approaches
works on ARM.

There are multiple issues when trying to do this,

1. "+crc" is not accepted as a feature on ARM (32bit), not even when `march` is
set to `armv8-a`. OTOH, "armv8-a+crc" works though that makes supporting
different arch profile harder...

2. No `.arch` or `.arch_feature` directives are generated in the assembly which
cause the assembler to complain. This is the case for either function attribute
or pragma.

I've tried to manually added a `.arch armv8-a` and a `.arch_extension crc`
before the function that uses the `crc32` instruction and then reset it back
with `.arch armv7-a` in the assembly code and it behaves correctly so I believe
this should be fixable on the GCC side.

[Bug target/80732] target_clones does not work with dlsym

2017-06-19 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732

--- Comment #9 from Yichao Yu  ---
Thanks for the fix!

Does it fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78366 at the same
time?

[Bug target/80732] target_clones does not work with dlsym

2017-05-17 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732

--- Comment #6 from Yichao Yu  ---
Good to know. Thanks.

[Bug target/80732] target_clones does not work with dlsym

2017-05-17 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732

--- Comment #4 from Yichao Yu  ---
`double (*pf1)(double, double, double) = dlsym(hdl, "f1.ifunc");`

Wouldn't it be better if GCC generates local functions `f1.default`, `f1.fma`
as implementation and `f1` to replace `f1.ifunc`? It's quite incontinent if
this detail is exposed.

If one have to use `f1.ifunc`, does it also mean that the declaration of the
function in the header must also have `target_clone` applied?

[Bug target/80732] New: target_clones does not work with dlsym

2017-05-12 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732

Bug ID: 80732
   Summary: target_clones does not work with dlsym
   Product: gcc
   Version: 6.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Compiling the code below to a executable with `gcc -Wall -Wextra -O3 -fPIC -ldl
-rdynamic`. On a haswell+ system, the output is

```
1:
0, 4.93038e-32, 0
2:
4.93038e-32, 4.93038e-32, 4.93038e-32
```

Showing that with the manually created ifunc, dlsym, direct function call, and
accessing function address produces the same result (the fma version) whereas
with `target_clones` only direct function call uses the fma versison.

This might be related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78366 but
I'm not entirely sure. From that bug report I can understand that this is just
how `target_clones` is currently implemented but I do think this is not a
documentation issue and should be fixed / improved instead since

1. in this case there is user observable inconsistency in the result generated
when different code paths are used. The fast math object should be allowed to
produce slightly inaccurate result but I do think it should produce consistent
result every time the function is called.

2. probably more importantly, this behavior makes the `target_clone` attribute
useless for used in public interface if the shared library can ever by
dynamically loaded.

```
#include 
#include 

__attribute__((target_clones("default","fma"),noinline,optimize("fast-math")))
double f1(double a, double b, double c)
{
return a * b + c;
}

double k1(double a, double b, double c, void **p)
{
*p = f1;
return f1(a, b, c);
}

__attribute__((target("fma"),optimize("fast-math")))
static double f2_fma(double a, double b, double c)
{
return a * b + c;
}

__attribute__((optimize("fast-math")))
static double f2_default(double a, double b, double c)
{
return a * b + c;
}

static void *f2_resolve(void)
{
__builtin_cpu_init ();
if (__builtin_cpu_supports("fma"))
return f2_fma;
else
return f2_default;
}

double f2(double a, double b, double c) __attribute__((ifunc("f2_resolve")));

double k2(double a, double b, double c, void **p)
{
*p = f2;
return f2(a, b, c);
}

int main()
{
volatile double a = 1.0002;
volatile double b = -0.9998;
volatile double c = 1.0;

void *hdl = dlopen(NULL, RTLD_NOW);

printf("1:\n");
double (*pf1)(double, double, double) = dlsym(hdl, "f1");
double (*pk1)(double, double, double, void**) = dlsym(hdl, "k1");
double (*_pf1)(double, double, double);

double v1_1 = pf1(a, b, c);
double v1_2 = pk1(a, b, c, (void**)&_pf1);
double v1_3 = _pf1(a, b, c);
printf("%g, %g, %g\n", v1_1, v1_2, v1_3);

printf("2:\n");
double (*pf2)(double, double, double) = dlsym(hdl, "f2");
double (*pk2)(double, double, double, void**) = dlsym(hdl, "k2");
double (*_pf2)(double, double, double);

double v2_1 = pf2(a, b, c);
double v2_2 = pk2(a, b, c, (void**)&_pf2);
double v2_3 = _pf2(a, b, c);
printf("%g, %g, %g\n", v2_1, v2_2, v2_3);

return 0;
}
```

[Bug target/77728] [5/6 Regression] Miscompilation multiple vector iteration on ARM

2017-04-25 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

--- Comment #48 from Yichao Yu  ---
Thanks for fixing this. I didn't follow all the comments since I'm not familiar
with the C++ ABI so just to make sure I understand what's happening is it that
the bug is caused by a inconsistency in C++ ABI for certain classes which can
happen on both ARM and AArch64 (although not for AArch64 in this case)?

Is this now fixed for gcc 7+ for both ARM and AArch64? (Should this be closed
now or only when there's a release?) And btw, when is the estimated release
time of 7.1?

[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM

2017-03-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

--- Comment #6 from Yichao Yu  ---
Anything new here?

[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM

2017-01-13 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

--- Comment #5 from Yichao Yu  ---
Ping again? Anything new or I can help with here?

[Bug middle-end/77996] Miscompilation due to LTO on aarch64

2016-10-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

--- Comment #12 from Yichao Yu  ---
Since the LLVM miscompilation isn't fixed, is there any way to check the alias
assumptions more programmatically? (I can see that the TrailingObject might
easily introduce something like this but given the complexity it's a little
hard for me to see if that's actually the case.)

[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM

2016-10-20 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

--- Comment #4 from Yichao Yu  ---
Ping. Anything I can help with debugging this?

[Bug middle-end/77996] Miscompilation due to LTO on aarch64

2016-10-16 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

--- Comment #11 from Yichao Yu  ---
The case pointed out is fixed in https://reviews.llvm.org/rL284336 although as
expected that doesn't fix the error. Still not sure whose bug is this...

[Bug middle-end/77996] Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

--- Comment #10 from Yichao Yu  ---
That does look like an violation (this particular one should be hidden behind
shared library boundary in the reduced case though). Reported to LLVM at
https://llvm.org/bugs/show_bug.cgi?id=30711 .

[Bug middle-end/77996] Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

--- Comment #8 from Yichao Yu  ---
> Can you try with -fno-strict-aliasing ?

That seems to fix it for both the original case (LLVM) and the reduced case
(the linked tarball). Is there a way to figure out the problematic (either bug
in LLVM's code or gcc's alias detection) aliasing assumption?

[Bug middle-end/77996] Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

--- Comment #6 from Yichao Yu  ---
I've compiled a gcc at 951db45 using the same configuration as archlinux arm
PKGBUILD and I can reproduce the problem using the `code/` in
https://gist.github.com/yuyichao/6c24d4a4bc374425906138359a44479c/raw/f5edb6ae8205d5e4d1eb03a7fb900f15711f/gcc-debug.tar.bz2

[Bug middle-end/77996] Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

--- Comment #5 from Yichao Yu  ---
Compiling current llvm trunk (r284322) still shows the same error. The script I
used to compile LLVM is here
https://github.com/yuyichao/arch-pkg/blob/master/pkg/all/llvm-svn/PKGBUILD.


Compiling gcc 951db45 now.

[Bug middle-end/77996] Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

--- Comment #3 from Yichao Yu  ---
> What exact version of LLVM are you trying to compile?  Revision of the LLVM 
> sources including revision of clang, etc.

I was compiling the trunk version. The version I started reducing from was
https://github.com/llvm-mirror/llvm/commit/0885462106134999f8aa80a3a71bfed160910248
but it happens on at least 3 different version I've tried before this commit.

> Can you try compile GCC from the 6 branch and try again because having just a 
> date might not be enough to reproduce the problem.

The script used to compile GCC is
https://github.com/archlinuxarm/PKGBUILDs/blob/master/core/gcc/PKGBUILD so it
seems to be using commit `c2103c17` I can also try to compile a more recent
version locally (will take some time).

[Bug lto/77997] Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77997

--- Comment #2 from Yichao Yu  ---
. Sorry the first submission gave me a time out so I did again..

[Bug lto/77997] New: Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77997

Bug ID: 77997
   Summary: Miscompilation due to LTO on aarch64
   Product: gcc
   Version: 6.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

I'm seeing a miscompilation of LLVM's tablegen on AArch64 by gcc 6.2.1 when LTO
is enabled. I've tried very hard to reduce it but unfortunately it wasn't very
successful this time and the current repro is still 8000 lines of code.

Attached are the source and resulting binaries. (Edit: the tarball is too big
(3M) so I uploaded to gist instead. Please find it here
https://gist.githubusercontent.com/yuyichao/6c24d4a4bc374425906138359a44479c/raw/f5edb6ae8205d5e4d1eb03a7fb900f15711f/report.md)

The `code/` directory has a simple cmake projects reduced from the LLVM one (I
can turn that into a makefile or a shell script on request but the current form
should be pretty simple already). To reproduce make a `build/` directory in
`code/` and run `CFLAGS='-flto -O3' CXXFLAGS='-flto -O3' LDFLAGS='-O3 -flto'
cmake .. -DCMAKE_BUILD_TYPE=Release; make llvm-tblgen; bin/llvm-tblgen`. Remove
the `-flto` should get rid of the error in the last command.

Changes in seemingly unrelated lines can also make the error go away. (If
there's anything I learnt from reducing it, the error seems to appear only when
the code is complex). One of such changes is commenting out
`SCTrans.PredTerm = Preds;` close to the end of `CodeGenSchedule.cpp` (used to
generate the `good/` version included). In fact, removing almost any lines in
this file can make the error go away even though not a single line of code
there should be executed.

The `bad/` and the `good/` directories conatins compilation results using the
flags mentioned above with the unmodified code and the code with the one line
commented out. They have all the object files, binary files and the disassemble
of the resulting executable/the bad function. The asm's are disassembled from
the final binary since I don't know how to get it directly when compiling with
LTO.

The direct error seems to be in `CodeGenRegister::computeSubRegs` in the branch
before the `printf("5\n")`. The `DenseMap::insert` method (which is called
twice in this function and nowhere else) is inlined but returns corrupted
iterator sometimes when the inserted key already exists in the map causing the
check to fail. The difference of asm of this function is in the toplevel of the
tarball.

Original repro is compiling LLVM with LTO on on AArch64. The compilation should
fail when generating target information for AArch64.

GCC version is GCC binary package from ArchLinux ARM repo. `gcc --version`
gives `gcc (GCC) 6.2.1 20160830`

[Bug lto/77996] New: Miscompilation due to LTO on aarch64

2016-10-15 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996

Bug ID: 77996
   Summary: Miscompilation due to LTO on aarch64
   Product: gcc
   Version: 6.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

I'm seeing a miscompilation of LLVM's tablegen on AArch64 by gcc 6.2.1 when LTO
is enabled. I've tried very hard to reduce it but unfortunately it wasn't very
successful this time and the current repro is still 8000 lines of code.

Attached are the source and resulting binaries.

The code/ directory has a simple cmake projects reduced from the LLVM one (I
can turn that into a makefile or a shell script on request but the current form
should be pretty simple already). To reproduce make a `build/` directory in
`code/` and run `CFLAGS='-flto -O3' CXXFLAGS='-flto -O3' LDFLAGS='-O3 -flto'
cmake .. -DCMAKE_BUILD_TYPE=Release; make llvm-tblgen; bin/llvm-tblgen`. Remove
the `-flto` should get rid of the error in the last command.

Changes in seemingly unrelated lines can also make the error go away. (If
there's anything I learnt from reducing it, the error seems to appear only when
the code is complex). One of such changes is commenting out
`SCTrans.PredTerm = Preds;` close to the end of `CodeGenSchedule.cpp` (used to
generate the good/ version included). In fact, removing almost any lines in
this file can make the error go away even though not a single line of code
there should be executed.

The bad/ and the good/ directories conatins compilation results using the flags
mentioned above with the unmodified code and the code with the one line
commented out. They have all the object files, binary files and the disassemble
of the resulting executable/the bad function. The asm's are disassembled from
the final binary since I don't know how to get it directly when compiling with
LTO.

The direct error seems to be in `CodeGenRegister::computeSubRegs` in the branch
before the `printf("5\n")`. The `DenseMap::insert` method (which is called
twice in this function and nowhere else) is inlined but returns corrupted
iterator sometimes when the inserted key already exists in the map causing the
check to fail. The difference of asm of this function is in the toplevel of the
tarball.

Original repro is compiling LLVM with LTO on on AArch64. The compilation should
fail when generating target information for AArch64.

GCC version is GCC binary package from ArchLinux ARM repo. `gcc --version`
gives `gcc (GCC) 6.2.1 20160830`

[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM

2016-09-26 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

--- Comment #2 from Yichao Yu  ---
I should add that turning on lto works around the issue both in the simple code
attached and for the original issue I was having in julia (i.e. compiling llvm
with LTO makes the issue go away).

[Bug target/77728] New: Miscompilation multiple vector iteration on ARM

2016-09-24 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728

Bug ID: 77728
   Summary: Miscompilation multiple vector iteration on ARM
   Product: gcc
   Version: 6.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Code to reproduce is at
https://gist.github.com/yuyichao/a66edb9d05d18755fb7587b12e021a8a. The two cpp
files are

```c++
#include 
#include 

typedef std::vector<std::pair<uint64_t, uint64_t>> DWARFAddressRangesVector;

void dumpRanges(const DWARFAddressRangesVector& Ranges) {
for (const auto : Ranges) {
(void)Range;
}
}

void collectChildrenAddressRanges(DWARFAddressRangesVector& Ranges)
{
const DWARFAddressRangesVector  = DWARFAddressRangesVector();
Ranges.insert(Ranges.end(), DIERanges.begin(), DIERanges.end());
}
```

```c++
#include 
#include 

typedef std::vector<std::pair<uint64_t, uint64_t>> DWARFAddressRangesVector;

void collectAddressRanges(DWARFAddressRangesVector ,
  const DWARFAddressRangesVector )
{
CURanges.insert(CURanges.end(), CUDIERanges.begin(), CUDIERanges.end());
}

int main()
{
std::vector<std::pair<uint64_t, uint64_t>> CURanges;
std::vector<std::pair<uint64_t, uint64_t>> CUDIERanges{{1, 2}};
collectAddressRanges(CURanges, CUDIERanges);
return 0;
}
```

Both compiled with `g++ -O2` and linked together. When running the compiled
program, it raises an exception in the `insert`

```
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_range_insert
```

which shouldn't happen. The issue seems to be related to merging duplicated
code since it is important to put the code into two files and the present of
the second .o file is important even though none of the code in it is used. The
iterations also have to be all on the const reference of vector. Removing one
of the const also makes the issue go away.

The g++ is version 6.2.1 from the ArchLinuxARM armv7h repository. This might be
a regression in gcc 5 since other devs using gcc <=4.9 doesn't seem to have
this issue and I was able to reproduce this on archlinux on 4-5 different
systems with gcc >=5.

This causes https://github.com/JuliaLang/julia/issues/14550

[Bug target/70814] atomic store of __int128 is not lock free on aarch64

2016-06-28 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814

--- Comment #4 from Yichao Yu  ---
Thanks for the explanation. I didn't realize that the load is the problem. Just
curious (since I somehow can't find documentation about it), would `ldaxp`
provide the right semantics without the corresponding store?

[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions

2016-06-07 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #7 from Yichao Yu  ---
If I add `-fvariable-expansion-in-unroller` (omg this options is like half the
command line ;-p ...), the performance matches the clang one after the clang
3.8 regression.

```
% gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2
benchmark.c -o benchmark2 
% ./benchmark2 
45.588861
% ./benchmark-gcc
80.518152
% ./benchmark-clang38
41.920054
% ./benchmark-clang37
25.093145
```

[Bug other/71414] 2x slower than clang summing small float array

2016-06-06 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

--- Comment #4 from Yichao Yu  ---
The C code is in the gist linked `a` is a cacheline aligned pointer and `n` is
1024 so `a` should even fits in L1d, which is 32kB on both processors I
benchmarked.

More precise timing (ns per loop)

6700K

```
% ./benchmark-gcc   
80.553456
% ./benchmark-clang37 
28.81
% ./benchmark-clang38 
41.782532
```

4702HQ

```
% ./benchmark-gcc 
140.744893
% ./benchmark-clang37 
50.835441
% ./benchmark-clang38
70.220946
```

Pasting the whole program over for completeness.
The alignment line gives some weird timing on clang without `-mcore-avx2` but
doesn't change anything too much with `-Ofast -mcore-avx2`

```
//

#include 
#include 
#include 
#include 
#include 

uint64_t gettime_ns()
{
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, );
return t.tv_sec * (uint64_t) 1e9 + t.tv_nsec;
}


__attribute__((noinline)) float sum32(float *a, size_t n)
{
/* a = (float*)__builtin_assume_aligned(a, 64); */
float s = 0;
for (size_t i = 0;i < n;i++)
s += a[i];
__asm__ volatile ("" ::: "memory");
return s;
}

int main()
{
float *p = aligned_alloc(64, sizeof(float) * 1024);
memset(p, 0, sizeof(float) * 1024);
uint64_t start = gettime_ns();
for (int i = 0;i < 1024 * 1024;i++)
sum32(p, 1024);
free(p);
uint64_t end = gettime_ns();
printf("%f\n", (end - start) / (1024.0 * 1024.0));
return 0;
}
```

[Bug other/71414] New: 2x slower than clang summing small float array

2016-06-04 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414

Bug ID: 71414
   Summary: 2x slower than clang  summing small float array
   Product: gcc
   Version: 6.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Ref https://llvm.org/bugs/show_bug.cgi?id=28002

C source code.

```c
__attribute__((noinline)) float sum32(float *a, size_t n)
{
/* a = (float*)__builtin_assume_aligned(a, 64); */
float s = 0;
for (size_t i = 0;i < n;i++)
s += a[i];
return s;
}```


See [this
gist](https://gist.github.com/yuyichao/5b07f71c1f19248ec5511d758532a4b0) for
assembly output by different compilers. GCC appears to be ~2x slower than clang
on the two machines (4702HQ and 6700K) I benchmarked this.

[Bug target/71056] [6/7 Regression] __builtin_bswap32 NEON instruction error with -O3

2016-05-21 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71056

--- Comment #4 from Yichao Yu  ---
(Sorry I'm not sure how to understand that cross link). Is the fix merged?

[Bug target/71056] New: __builtin_bswap32 NEON instruction error with -O3

2016-05-10 Thread yyc1992 at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71056

Bug ID: 71056
   Summary: __builtin_bswap32 NEON instruction error with -O3
   Product: gcc
   Version: 6.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

The following code generate a NEON instruction not available error when
compiling with `gcc -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -O3 -o
/dev/null -c a.c` on ARM on gcc 6.1.1 (ArchLinuxARM).

```c
#include 
#include 

extern char *buff;
int f2();
struct T1 {
int32_t reserved[2];
uint32_t ip;
uint16_t cs;
uint16_t rsrv2;
};
void f3(const char *p)
{
struct T1 x;
memcpy(, p, sizeof(struct T1));
x.reserved[0] = __builtin_bswap32(x.reserved[0]);
x.reserved[1] = __builtin_bswap32(x.reserved[1]);
x.ip = __builtin_bswap32(x.ip);
x.cs = x.cs << 8 | x.cs >> 8;
x.rsrv2 = x.rsrv2 << 8 | x.rsrv2 >> 8;
if (f2()) {
memcpy(buff, "\n", 1);
}
}
```

Error message

```
alarm% gcc -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -O3 -o /dev/null -c
a.c
a.c: In function ‘f3’:
a.c:16:21: fatal error: You must enable NEON instructions (e.g.
-mfloat-abi=softfp -mfpu=neon) to use these intrinsics.
 x.reserved[0] = __builtin_bswap32(x.reserved[0]);
 ^~~~
compilation terminated.
```

Note that `NEON` isn't enabled and there's no direct use of NEON
instructions/intrinsics in the code so the NEON instructions must have been
added by the optimizer.

Seemingly subtle change can make the error disappear. This includes.

1. -O3 -> -O2 (ok, this one is not particularly subtle)
2. Remove any of the byteswap or field
3. Remove any of the memcpy
4. Make the second memcpy unconditional
5. Remove `f2()` (but leave the memcpy condition in some other way)
6. Pass in `x` as argument (either as value or pointer)

The asm generated when compiling with `fpu=neon`

```
f3:
@ args = 0, pretend = 0, frame = 16
@ frame_needed = 0, uses_anonymous_args = 0
mov r3, r0
str lr, [sp, #-4]!
sub sp, sp, #20
ldr r2, [r3, #8]@ unaligned
ldr r1, [r3, #4]@ unaligned
mov ip, sp
ldr r0, [r0]@ unaligned
ldr r3, [r3, #12]   @ unaligned
stmia   ip!, {r0, r1, r2, r3}
mov r3, r2
ldrhip, [sp, #12]
rev r3, r3
ldrhr0, [sp, #14]
vldrd16, [sp]
lsr r1, ip, #8
str r3, [sp, #8]
vrev32.8d16, d16
lsr r2, r0, #8
orr r2, r2, r0, lsl #8
orr r1, r1, ip, lsl #8
strhr2, [sp, #14]   @ movhi
strhr1, [sp, #12]   @ movhi
vstrd16, [sp]
bl  f2
cmp r0, #0
movwne  r3, #:lower16:buff
movtne  r3, #:upper16:buff
movne   r2, #10
ldrne   r3, [r3]
strbne  r2, [r3]
add sp, sp, #20
@ sp needed
ldr pc, [sp], #4
.size   f3, .-f3
.ident  "GCC: (GNU) 6.1.1 20160501"
```

And it seems that the NEON instruction it want to generate is `vrev32.8`

The case is simplified from
https://github.com/llvm-mirror/llvm/blob/da4b82ab1387da8c959a4e2439bce10b9cefbc8a/tools/llvm-objdump/MachODump.cpp#L8240-L8263

I don't remember seeing this on gcc 5.

  1   2   >