[Bug target/106340] flag set from SVE svwhilelt intrinsic not reused in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340 Yichao Yu changed: What|Removed |Added Resolution|--- |INVALID Status|UNCONFIRMED |RESOLVED --- Comment #2 from Yichao Yu --- Over at the llvm bug report, it was pointed out to me that the standard pattern to use is to do the branch based on ptest intrinsics. It matches the flag setting of the whilelt family of instructions better and gcc is already able to omit the ptest instruction in such case.
[Bug target/106324] ptrue not reused between vector instructions and predicate instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106324 --- Comment #3 from Yichao Yu --- Actually I just realized that the not instruction used the .d version as requested, the vector instruction didn’t….. I got it reversed in the original post……
[Bug target/106340] flag set from SVE svwhilelt intrinsic not reused in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340 --- Comment #1 from Yichao Yu --- Also note that this is for code I've tweaked to match what the finally code as much as possible. For a complete implementation of this, I expect the loop transformation done for normal loop should move the whilelt as well so that source code like the following would generate pretty much the same code. ``` void set3(uint32_t *__restrict__ out, size_t m) { auto svelen = svcntw(); auto v = svdup_u32(1); for (size_t i = 0; i < m; i += svelen) { auto pg = svwhilelt_b32(i, m); svst1(pg, [i], v); } } ``` Currently, while the cmp was moved to the end of the loop body and the loop header, the whilelt that is meant to be paired with it did not so the flag from the whilelt instruction isn't directly usable as is in the code.
[Bug target/106340] New: flag set from SVE svwhilelt intrinsic not reused in loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106340 Bug ID: 106340 Summary: flag set from SVE svwhilelt intrinsic not reused in loop Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- I'm experimenting with manually writing VLA loops and trying to match the assembly code I expect/from autovectorizer. One of the main area I can't get it to work is when setting the loop predicate using the svwhilelt intrinsics. The instruction it corresponds to set the flags and can be directly used to terminate the loop. Indeed, when using the autovectorizer, this is exactly what happens. ``` void set1(uint32_t *__restrict__ out, size_t m) { for (size_t i = 0; i < m; i++) { out[i] = 1; } } ``` compiles to ``` cbz x1, .L1 mov x2, 0 cntwx3 whilelo p0.s, xzr, x1 mov z0.s, #1 .p2align 3,,7 .L3: st1wz0.s, p0, [x0, x2, lsl 2] add x2, x2, x3 whilelo p0.s, x2, x1 b.any .L3 .L1: ret ``` (Here I believe the flag set from the loop header whilelo could also be used for the jump but that doesn't same much in this case.) However, no matter how I trie to replicate this using manually written code using the sve intrinsics, there is always an additional cmp instruction generated. The closest I can get is by replicating the structure of the auto-vectorized loop as much as possible with, ``` void set2(uint32_t *__restrict__ out, size_t m) { auto svelen = svcntw(); auto v = svdup_u32(1); if (m != 0) { auto pg = svwhilelt_b32(0ul, m); for (size_t i = 0; i < m; i += svelen, pg = svwhilelt_b32(i, m)) { svst1(pg, [i], v); } } } ``` which is compiled to ``` cbz x1, .L9 mov x2, 0 cntwx3 whilelo p0.s, xzr, x1 mov z0.s, #1 .p2align 3,,7 .L11: st1wz0.s, p0, [x0, x2, lsl 2] add x2, x2, x3 whilelo p0.s, x2, x1 cmp x1, x2 bhi .L11 .L9: ret ``` which is literally the same code down to register allocation except that the branch following the `whilelo` instruction is replaced with another comparison and branch.
[Bug target/106329] New: No optimization for SVE pfalse predicate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106329 Bug ID: 106329 Summary: No optimization for SVE pfalse predicate Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- If a known-all-false predicate is used on an SVE intrinsic, the result should be fully no-op, undefined, zeroing and no actual instruction (other than potentially returning a zero) should be generated. This does not seem to be happening even when a `svpfalse_b()` is explicitly passed in as the predicate. As an example, ``` svfloat64_t add(svfloat64_t a, svfloat64_t b) { return svadd_x(svpfalse_b(), a, b); } ``` is being compiled to ``` pfalse p0.b faddz0.d, p0/m, z0.d, z1.d ret ``` when it could simply be an empty function.
[Bug target/106327] New: side-effect-free _x variance not optimized to unpredicated instruction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106327 Bug ID: 106327 Summary: side-effect-free _x variance not optimized to unpredicated instruction Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106326 . According to the Arm C Language Extension for SVE, when the _x predicate is used, > The compiler can then pick whichever form of instruction seems to give the > best code. This includes using unpredicated instructions, where available and > suitable Because of this, I'm expecting the following to be optimized to a single add instruction, as if a `svptrue_b64()` predicate is used. ``` svfloat64_t add(svfloat64_t a, svfloat64_t b) { auto und_ok = svcmpge(svptrue_b64(), a, b); return svadd_x(und_ok, a, b); } ``` However, gcc compiles this as _m and generates ``` ptrue p0.b, all fcmge p0.d, p0/z, z0.d, z1.d faddz0.d, p0/m, z0.d, z1.d ``` In general, is there any reason not to treat an `add_x` (also other side-effect-free functions) with an unknown predicate as unpredicated one?
[Bug target/106326] New: _m and _z version of SVE instrinsics not optimized to predicate-free version
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106326 Bug ID: 106326 Summary: _m and _z version of SVE instrinsics not optimized to predicate-free version Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The following code should generate a predicate-free fadd instruction since all the predicates are true. ``` svfloat64_t test(svfloat64_t a, svfloat64_t b) { return svadd_m(svptrue_b64(), a, b); } ``` but gcc instead generates an all-tree predicate and use that instead, i.e. ``` ptrue p0.b, all faddz0.d, p0/m, z0.d, z1.d ``` The same happens for the `_z` version as well with even worse code generated. ``` ptrue p0.b, all movprfx z0.d, p0/z, z0.d faddz0.d, p0/m, z0.d, z1.d ``` This optimization is only done for the `_x` variance. Clang optimizes this for all variance.
[Bug target/106324] New: ptrue not reused between vector instructions and predicate instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106324 Bug ID: 106324 Summary: ptrue not reused between vector instructions and predicate instructions Product: gcc Version: 12.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The following code has two use of `svptrue_b64()`s and none of the instructions using them should be clearning it so only one `ptrue` instruction should be needed. ``` svfloat64_t test(svbool_t pg, svfloat64_t a, svfloat64_t b) { auto d = svdiv_m(svptrue_b64(), a, b); return svmul_m(svnot_z(svptrue_b64(), pg), d, d); } ``` However, the code generated is, ``` ptrue p2.b, all ptrue p1.d, all fdivz0.d, p2/m, z0.d, z1.d not p0.b, p1/z, p0.b fmulz0.d, p0/m, z0.d, z0.d ret ``` which has an extra `ptrue`. OTOH, clang generates, ``` ptrue p1.d fdivz0.d, p1/m, z0.d, z1.d not p0.b, p1/z, p0.b fmulz0.d, p0/m, z0.d, z0.d ret ``` and the same `ptrue` is reused in both instructions. This seems to be caused by gcc insisting on using `svptrue_b8` for the svnot which does not seem necessary here especially since _b64 is explicitly requested. Changing svptrue_b64 to svptrue_b8 in the code fixes the issue.
[Bug c++/100161] New: Impossible to suppress Wtype-limits warning involving template parameter.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100161 Bug ID: 100161 Summary: Impossible to suppress Wtype-limits warning involving template parameter. Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- If a comparison involving a template parameter is always true or false, it should not raise a warning if it could take other values for other template parameters. In particular, the type-limits warning from the code below, ``` void f(unsigned); template void g() { for (unsigned i = 0; i < n; i++) { f(i); } } void h() { g<0>(); } ``` seems to be impossible to suppress. I think this is a regression around GCC 9 time. (I remember seeing it roughly around the same time/slightly after https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90728) This is partially related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95148 (which would at least provide a way to suppress the warning). Also somewhat related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81642 though supposedly the C++ template example given there is fixed.
[Bug tree-optimization/100088] New: ymm store split into two xmm stores
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100088 Bug ID: 100088 Summary: ymm store split into two xmm stores Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The following code ``` __attribute__((target("avx2"))) void fill_avx2(double *__restrict__ data, int n, double value) { for (int i = 0; i < n * 16; i++) { data[i] = value; } } ``` compiles to ``` fill_avx2: sall$4, %esi testl %esi, %esi jle .L5 shrl$2, %esi vbroadcastsd%xmm0, %ymm0 movl%esi, %eax salq$5, %rax addq%rdi, %rax .p2align 4,,10 .p2align 3 .L3: vmovupd %xmm0, (%rdi) vextractf128$0x1, %ymm0, 16(%rdi) addq$32, %rdi cmpq%rax, %rdi jne .L3 vzeroupper .L5: ret ``` Note that AFAICT ``` vmovupd %xmm0, (%rdi) vextractf128$0x1, %ymm0, 16(%rdi) ``` is equivalent to ``` vmovupd %ymm0, (%rdi) ``` This issue does not exist for sse or avx512f. Setting `-march=haswell` or `-mtune=haswell` on the command line also seems to fix this but neither of these works when added to the target attribute.
[Bug c/96990] New: Regression in aarch64 struct vector member initialization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96990 Bug ID: 96990 Summary: Regression in aarch64 struct vector member initialization Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The following code used to work on gcc 9.3 but stops working with 10.2 with an error ``` a.c: In function ‘test_aa64_vec_2’: a.c:19:24: error: incompatible types when initializing type ‘signed char’ using type ‘int8x8_t’ 19 | struct_aa64_3 x = {v1 + v1, v2 - v2}; |^~ a.c:19:33: error: incompatible types when initializing type ‘signed char’ using type ‘float32x2_t’ 19 | struct_aa64_3 x = {v1 + v1, v2 - v2}; | ^~ ``` Any one of the "working" version or compiling with c++ works. >From the error message it seems that GCC correctly inferred the return type of the `v1 + v1` or `v2 - v2` but instead got confused about the field type. Reverssing the order of `v1` and `v2` in the struct causes the error to change to `float` instead of `signed char` so it seems that gcc thinks the code is trying to initialize the first vector member (with element type of `signed char` or `float` instead). I thought such initialization should have an additional `{}` instead... Given that explicit casting or compiling in c++ mode helps I think this is a bug... ``` #include typedef struct { int8x8_t v1; float32x2_t v2; } struct_aa64_3; struct_aa64_3 test_aa64_vec_2(int8x8_t v1, float32x2_t v2) { // works /* int8x8_t vi8 = v1 + v1; */ /* float32x2_t vf = v2 - v2; */ /* struct_aa64_3 x = {vi8, vf}; */ // works /* struct_aa64_3 x = {(int8x8_t)(v1 + v1), (float32x2_t)(v2 - v2)}; */ // not struct_aa64_3 x = {v1 + v1, v2 - v2}; return x; } ```
[Bug c/96629] spurious maybe uninitialized variable warning with difficult control-flow analysis
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96629 --- Comment #3 from Yichao Yu --- Just curious, is it some particular structure that is upsetting it or did it simply hit some depth limit.
[Bug c/96629] New: spurious uninitialized variable warning with branches at -O1 and higher
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96629 Bug ID: 96629 Summary: spurious uninitialized variable warning with branches at -O1 and higher Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Reduced test code: ``` int mem(char *data); int cond(void); void f(char *data, unsigned idx, unsigned inc) { char *d2; int c = cond(); if (idx >= 2) { if (c) d2 = data; mem(data); } else if (inc > 3) { if (c) d2 = data; mem(data); } else { if (c) { d2 = data; } } if (*data) { } else if (c) { mem(d2); } } ``` Compiling with `gcc -Wall -Wextra -O{1,2,s,3,fast}` warns about ``` a.c: In function 'f': a.c:27:9: warning: 'd2' may be used uninitialized in this function [-Wmaybe-uninitialized] 27 | mem(d2); | ^~~ ``` However, it should be clear that `d2` is always assigned when `c` is true. In fact, it seems that GCC could figure this out in some cases. Changes that can surpress the warning includes, 1. Remove any of the `mem(data)` calls. 2. Remove any one of the `if`s (leaving only the if or else branch unconditionally) 3. Change first condition to be on inc instead. 4. Removing the last `*data` branch. Version tested: AArch64: 10.2.0 ARM: 9.1.0 x86_64: 10.1.0 mingw64: 10.2.0
[Bug rtl-optimization/96539] Unnecessary no-op copy with Os and tail call with struct argument
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96539 --- Comment #4 from Yichao Yu --- Wow that was fast... thx.
[Bug rtl-optimization/96539] New: Unnecessary no-op copy with Os and tail call with struct argument
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96539 Bug ID: 96539 Summary: Unnecessary no-op copy with Os and tail call with struct argument Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Test C code, ``` struct A { int a; int b; int c; int d; int e; int f; void *p1; void *p2; void *p3; void *p4; void *p5; void *p6; void *p7; }; int k(int a); int f(int a, int b, int c, void *p, struct A s); int g(int a, int b, int c, void *p, struct A s) { k(a); return f(a, b, c, p, s); } ``` At `-O2`, the code produced is ``` g: pushq %r14 movq%rcx, %r14 pushq %r13 movl%edx, %r13d pushq %r12 movl%esi, %r12d pushq %rbp movl%edi, %ebp subq$8, %rsp callk@PLT addq$8, %rsp movq%r14, %rcx movl%r13d, %edx movl%r12d, %esi movl%ebp, %edi popq%rbp popq%r12 popq%r13 popq%r14 jmp f@PLT ``` I'm not sure why the spill of register and save the argument in those registers (maybe for latency for the final call?) but both clang and gcc does that so I assume that's good for performance. However, when I tried `-Os`, the code produced is, ``` g: pushq %r14 movq%rcx, %r14 pushq %r12 movl%esi, %r12d pushq %rbp movl%edi, %ebp subq$16, %rsp movl%edx, 12(%rsp) callk@PLT leaq48(%rsp), %rdi movl$20, %ecx movq%rdi, %rsi rep movsl movq%r14, %rcx movl%r12d, %esi movl%ebp, %edi movl12(%rsp), %edx addq$16, %rsp popq%rbp popq%r12 popq%r14 jmp f@PLT ``` AFAICT, the ``` movq%rdi, %rsi rep movsl ``` is basically always a no-op (moving from and to the same memory location) other than potentially triggering memory fault. The memory being copied in place here is the area where the argument is stored (80 bytes starting at `rsp + 48`) so maybe it's the copying of the argument that failed to be removed when it becomes an no-op for tail call? At `-O1`, the code produced is ``` g: pushq %r13 pushq %r12 pushq %rbp pushq %rbx subq$8, %rsp movl%edi, %ebx movl%esi, %ebp movl%edx, %r12d movq%rcx, %r13 callk@PLT pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) pushq 120(%rsp) movq%r13, %rcx movl%r12d, %edx movl%ebp, %esi movl%ebx, %edi callf@PLT addq$88, %rsp popq%rbx popq%rbp popq%r12 popq%r13 ret ``` which shows the copying of 10 pointers that was not no-op without tail call.
[Bug preprocessor/96069] -ffile-prefix-map does not affect print in gfortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069 --- Comment #8 from Yichao Yu --- OK, done. It would be nice to mention it on https://gcc.gnu.org/contribute.html#patches
[Bug preprocessor/96069] -ffile-prefix-map does not affect print in gfortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069 --- Comment #6 from Yichao Yu --- https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549411.html and https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549413.html
[Bug preprocessor/96069] -ffile-prefix-map does not affect print in gfortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069 --- Comment #4 from Yichao Yu --- > Apparently it is. Yes, but my question is about why should this be "WONTFIX". This feature (reproducible build) is certainly as useful in fortran as it is in C family. > Let move the component to 'preprocessor'. At least for the issue for the fortran code I had it doesn't seem to be in the preprocessor. I do agree that other frontends should probably use this too but I have no idea what are the cases they should do it. Also note that I've already submitted patches to fix this though I haven't got a reply yet.
[Bug fortran/96069] -ffile-prefix-map does not affect print in gfortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069 --- Comment #2 from Yichao Yu --- Why should this feature be c only?
[Bug fortran/96069] New: -ffile-prefix-map does not affect print in gfortran
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96069 Bug ID: 96069 Summary: -ffile-prefix-map does not affect print in gfortran Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Compiling the following code `a.f` ``` subroutine f(name) implicit none character*(*) name print *,name return end ``` with `gfortran -fdebug-prefix-map="${PWD}"=/usr/src/debug -ffile-prefix-map="${PWD}"=/usr/src/debug -O3 -fPIC "${PWD}/"a.f -o - -S` will cause the full path to the file to be included in the generated assembly without respecting the prefix map. This works in C with `-ffile-prefix-map` which implies `-fmacro-prefix-map` but the lattr isn't supported by gfortran.
[Bug ipa/95775] Command line argument for target_clones?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95775 --- Comment #4 from Yichao Yu --- > Hey. My opinion is similar to Richi's. If you really want a highly optimized > library, you should rather use a dlopen mechanism with pre-built set of > options. Well, a few things, 1. That sounds like an argument against `target_clone` and `target`. If dlopen'ing different libraries is your recommended solution then none of these would be needed. 2. The solution you propose put all the pression on the user of the library. That has a few problems. 2.1. There are strictly more users than libraries. (Assuming the library is used at all) so this is forcing more (repeated) work to be done. 2.2. The author of the library and to a lesser degree the builder of the library has the best knowledge of the set of features that can benefit the library/the most useful for the deployment environment. The author of the user code of the library, who has to implement the dispatch/loading logic in general has much less complete knowledge of what the target to support. 2.3. It'll be even worse for code size since this forces each user to carry their own library, and now all data has to be duplicated as well in additional to code. Also because, 3. There's no standard way of doing this AFAICT. Now (3) is really the main point. I'm fine with whatever mechanism that allows multiple versions of the code to be available as long as it requires no more effort/cost from/for the user (and to a lesser degree the author) of the library. If one such mechanism is provided by gcc/glibc/binutils so that library writers don't have to invent their own loading and detection mechanism and won't cause unnecessary indirection (as cheap as ifunc) and will just work for the user to either link or dlopen, then I think it doesn't really matter if that's backed by one file/multiple files or whatever one can come up with. Currently, the only mechanism available that fits this description AFAICT is `target_clones`/`ifunc`. Unless there's a roadmap that I'm not aware of to replace this mechanism with a similar one backed by multiple files I don't think suggesting such a mechanism is the right approach. Again, I said in the very first post that I totally agree this won't be the method to give absolutely the best performance, but neither is `target_clones`. I also completely agree that this option can be misused and the compiler should not do it on its own before getting smarter but this is far from the first option that can be misused and given how cheap memory is and how multiple load of the same library doesn't take more memory this isn't even closoed to be the worse misused either.
[Bug c/95777] Allow specifying more than one target options at the same time in target and target_clones attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95777 --- Comment #3 from Yichao Yu --- And for backward compatibility maybe `target_clones("(sse4.1,arch=core2),default")` would work?
[Bug c/95777] Allow specifying more than one target options at the same time in target and target_clones attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95777 --- Comment #2 from Yichao Yu --- I only tested this with `target_clones` and it seems that I misread the document for `target`. So this is only an issue with `target_clones` attribute. `target` support this just fine. So to be more clear, using an example from the doc, it seems impossible to do the equivalent of `target("sse4.1,arch=core2")` using `target_clones`. Doing `target_clones("sse4.1,arch=core2")` will create two functions instead of one. (of course in reality what I might actually want is to make `target_clones` do `target("sse4.1,arch=core2")` and target("default")).
[Bug ipa/95775] Command line argument for target_clones?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95775 --- Comment #2 from Yichao Yu --- > But it will blow up code-size considerably. > So without some major work I don't think simply slapping target_clones on > each function is going to fly in practice. I mean, it'll blow up not much more than the number of targets. I do agree this is not something that the compiler should just do automatically and especially not for big libraries and the user has to ask for it. However, I don't believe code side consumes most memory on any modern desktop or server systems and when using shared library different process won't even consume much more memory anyway. It's for sure still the user's choice but OTOH I think the compiler shouldn't have to make this choice for the user. Additionally, there are some libraries, like math heavy ones, where virtually every single functions could benefit from this. Those are the ones that I would like to apply this option too. I'm also hoping, and I forgot to mention this in the first post, that this can just work on gfortran as well... > Eventually it should be possible to do sth like target_clones(auto) where > with a new option, the target (or the user) can define "default" targets to > clone for but the user still figures which are the important functions to > optimize In julia I'm currently using a simple heuristic of detecting floating point operation, vector operation and loops... > [and GCC may, via IPA "spread" the cloned cgraph portion a bit]. and I do this in julia too.
[Bug ipa/95796] New: Inlining works between functions with the same target attribute but not target_clones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95796 Bug ID: 95796 Summary: Inlining works between functions with the same target attribute but not target_clones Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- If two functions with the same target attribute calls each other, GCC can inline one into another one (although sometimes incorrectly... PR95790). This can be shown with the following code (all compilation using `g++ -O2 -S -fno-exceptions -fno-asynchronous-unwind-tables`). ``` __attribute__ ((target ("default"))) static unsigned foo() { return 1; } __attribute__ ((target ("avx"))) static unsigned foo() { return 1; } __attribute__ ((target ("default"))) unsigned bar() { return foo(); } __attribute__ ((target ("avx"))) unsigned bar() { return foo(); } ``` which is compiled to ``` .text .p2align 4 .globl _Z3barv .type _Z3barv, @function _Z3barv: movl$1, %eax ret .size _Z3barv, .-_Z3barv .p2align 4 .globl _Z3barv.avx .type _Z3barv.avx, @function _Z3barv.avx: movl$1, %eax ret .size _Z3barv.avx, .-_Z3barv.avx ``` OTOH, the equivalent code using `target_clones` ``` __attribute__ ((target_clones ("default,avx"))) static unsigned foo() { return 1; } __attribute__ ((target_clones ("default,avx"))) unsigned bar() { return foo(); } ``` compiles to ``` .text .p2align 4 .type _ZL3foov.default.1, @function _ZL3foov.default.1: movl$1, %eax ret .size _ZL3foov.default.1, .-_ZL3foov.default.1 .p2align 4 .type _Z3barv.default.1, @function _Z3barv.default.1: jmp _ZL3foov.default.1 .size _Z3barv.default.1, .-_Z3barv.default.1 .p2align 4 .type _ZL3foov.avx.0, @function _ZL3foov.avx.0: movl$1, %eax ret .size _ZL3foov.avx.0, .-_ZL3foov.avx.0 .p2align 4 .type _Z3barv.avx.0, @function _Z3barv.avx.0: jmp _ZL3foov.avx.0 .size _Z3barv.avx.0, .-_Z3barv.avx.0 .section .text._Z3barv.resolver,"axG",@progbits,_Z3barv.resolver,comdat .p2align 4 .weak _Z3barv.resolver .type _Z3barv.resolver, @function _Z3barv.resolver: subq$8, %rsp call__cpu_indicator_init@PLT movq__cpu_model@GOTPCREL(%rip), %rax leaq_Z3barv.avx.0(%rip), %rdx testb $2, 13(%rax) leaq_Z3barv.default.1(%rip), %rax cmovne %rdx, %rax addq$8, %rsp ret .size _Z3barv.resolver, .-_Z3barv.resolver .globl _Z3barv .type _Z3barv, @gnu_indirect_function .set_Z3barv,_Z3barv.resolver .text .p2align 4 .type _ZL3foov.resolver, @function _ZL3foov.resolver: subq$8, %rsp call__cpu_indicator_init@PLT movq__cpu_model@GOTPCREL(%rip), %rax leaq_ZL3foov.avx.0(%rip), %rdx testb $2, 13(%rax) leaq_ZL3foov.default.1(%rip), %rax cmovne %rdx, %rax addq$8, %rsp ret .size _ZL3foov.resolver, .-_ZL3foov.resolver ``` instead. Which only eliminates the indirect call but does not inline `foo` into `bar`. (Note that the useless resolver for foo is PR95779). I believe the two versions should behave the same... Ref PR95778 (PLT elimination) Ref PR71990 (similar title but different. That one is about inlining of the dispatcher itself IIUC and is not about the case that can already be statically dispatched. It is also not specific to target_clones like this one is)
[Bug ipa/95790] Incorrect static target dispatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790 --- Comment #8 from Yichao Yu --- And the reason I reported this as a mis-optimization rather than something completely unsupported is that the following code. ``` #include // #define disable_opt __attribute__((flatten)) #define disable_opt disable_opt __attribute__ ((target ("default"))) static unsigned foo(const char *buf, unsigned size) { return 1; } disable_opt __attribute__ ((target ("avx"))) static unsigned foo(const char *buf, unsigned size) { return 2; } disable_opt __attribute__ ((target ("avx2"))) static unsigned foo(const char *buf, unsigned size) { return 3; } __attribute__ ((target ("default"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } __attribute__ ((target ("avx"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } int main() { printf("%u\n", bar()); return 0; } ``` when compiled with `#define disable_opt`, prints the wrong answer `8192` on my avx2 laptop. OTOH, with `#define disable_opt __attribute__((flatten))` to disable the inlining using the bug, it prints the correct result of 12288. Other ways force an independent dispatch like the following using a volatile slot also works. ``` #include __attribute__ ((target ("default"))) static unsigned _foo(const char *buf, unsigned size) { return 1; } __attribute__ ((target ("avx"))) static unsigned _foo(const char *buf, unsigned size) { return 2; } __attribute__ ((target ("avx2"))) static unsigned _foo(const char *buf, unsigned size) { return 3; } static unsigned (* volatile foo)(const char *buf, unsigned size) = _foo; __attribute__ ((target ("default"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } __attribute__ ((target ("avx"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } int main() { printf("%u\n", bar()); return 0; } ``` I think this suggests that the most basic codegen without optimization is clearly working and this usage (being it nested multiversioning or not) isn't something that's just not supported. Rather it's only the optimization that's wrong.
[Bug ipa/95790] Incorrect static target dispatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790 --- Comment #7 from Yichao Yu --- > Your testcase has nested function multi-versioning. I don't think it works at all. I opened PR 95793. I'm sorry but what is nested function multi-versioning? and what's the difference between the test case here and the one in PR95793?
[Bug ipa/95790] Incorrect static target dispatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790 --- Comment #5 from Yichao Yu --- It’s wrong when running on a target that has avx512f. The unoptimuzed version will call the correct foo but the unoptimized case won’t. As I said, this is an issue when the total targets are different between the callee and caller.
[Bug ipa/95790] Incorrect static target dispatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790 --- Comment #3 from Yichao Yu --- And the assembly showing the correct dispatch is .file "a.c" .text .p2align 4 .type _ZL3fooPKcj, @function _ZL3fooPKcj: .LFB0: .cfi_startproc movl$1, %eax ret .cfi_endproc .LFE0: .size _ZL3fooPKcj, .-_ZL3fooPKcj .p2align 4 .type _ZL3fooPKcj.avx, @function _ZL3fooPKcj.avx: .LFB1: .cfi_startproc movl$2, %eax ret .cfi_endproc .LFE1: .size _ZL3fooPKcj.avx, .-_ZL3fooPKcj.avx .p2align 4 .type _ZL3fooPKcj.avx512f, @function _ZL3fooPKcj.avx512f: .LFB2: .cfi_startproc movl$3, %eax ret .cfi_endproc .LFE2: .size _ZL3fooPKcj.avx512f, .-_ZL3fooPKcj.avx512f .section.text.unlikely,"ax",@progbits .LCOLDB0: .text .LHOTB0: .p2align 4 .type _ZL3fooPKcj.resolver, @function _ZL3fooPKcj.resolver: .LFB6: .cfi_startproc subq$8, %rsp .cfi_def_cfa_offset 16 call__cpu_indicator_init@PLT movq__cpu_model@GOTPCREL(%rip), %rax movl12(%rax), %eax testb $-128, %ah je .L8 leaq_ZL3fooPKcj.avx512f(%rip), %rax .L7: addq$8, %rsp .cfi_def_cfa_offset 8 ret .cfi_endproc .section.text.unlikely .cfi_startproc .type _ZL3fooPKcj.resolver.cold, @function _ZL3fooPKcj.resolver.cold: .LFSB6: .L8: .cfi_def_cfa_offset 16 testb $2, %ah leaq_ZL3fooPKcj.avx(%rip), %rdx leaq_ZL3fooPKcj(%rip), %rax cmovne %rdx, %rax jmp .L7 .cfi_endproc .LFE6: .text .size _ZL3fooPKcj.resolver, .-_ZL3fooPKcj.resolver .section.text.unlikely .size _ZL3fooPKcj.resolver.cold, .-_ZL3fooPKcj.resolver.cold .LCOLDE0: .text .LHOTE0: .type _Z11_ZL3fooPKcjPKcj, @gnu_indirect_function .set_Z11_ZL3fooPKcjPKcj,_ZL3fooPKcj.resolver .p2align 4 .globl _Z3barv .type _Z3barv, @function _Z3barv: .LFB3: .cfi_startproc pushq %r12 .cfi_def_cfa_offset 16 .cfi_offset 12, -16 xorl%r12d, %r12d pushq %rbp .cfi_def_cfa_offset 24 .cfi_offset 6, -24 pushq %rbx .cfi_def_cfa_offset 32 .cfi_offset 3, -32 subq$4112, %rsp .cfi_def_cfa_offset 4144 movq%fs:40, %rax movq%rax, 4104(%rsp) xorl%eax, %eax movq%rsp, %rbx leaq4096(%rsp), %rbp .p2align 4,,10 .p2align 3 .L12: movq%rbx, %rdi movl$1, %esi addq$1, %rbx call_Z11_ZL3fooPKcjPKcj@PLT addl%eax, %r12d cmpq%rbp, %rbx jne .L12 movq4104(%rsp), %rax subq%fs:40, %rax jne .L16 addq$4112, %rsp .cfi_remember_state .cfi_def_cfa_offset 32 movl%r12d, %eax popq%rbx .cfi_def_cfa_offset 24 popq%rbp .cfi_def_cfa_offset 16 popq%r12 .cfi_def_cfa_offset 8 ret .L16: .cfi_restore_state call__stack_chk_fail@PLT .cfi_endproc .LFE3: .size _Z3barv, .-_Z3barv .p2align 4 .globl _Z3barv.avx .type _Z3barv.avx, @function _Z3barv.avx: .LFB4: .cfi_startproc pushq %r12 .cfi_def_cfa_offset 16 .cfi_offset 12, -16 xorl%r12d, %r12d pushq %rbp .cfi_def_cfa_offset 24 .cfi_offset 6, -24 pushq %rbx .cfi_def_cfa_offset 32 .cfi_offset 3, -32 subq$4112, %rsp .cfi_def_cfa_offset 4144 movq%fs:40, %rax movq%rax, 4104(%rsp) xorl%eax, %eax movq%rsp, %rbx leaq4096(%rsp), %rbp .p2align 4,,10 .p2align 3 .L18: movq%rbx, %rdi movl$1, %esi addq$1, %rbx call_Z11_ZL3fooPKcjPKcj@PLT addl%eax, %r12d cmpq%rbp, %rbx jne .L18 movq4104(%rsp), %rax subq%fs:40, %rax jne .L22 addq$4112, %rsp .cfi_remember_state .cfi_def_cfa_offset 32 movl%r12d, %eax popq%rbx .cfi_def_cfa_offset 24 popq%rbp .cfi_def_cfa_offset 16 popq%r12 .cfi_def_cfa_offset 8 ret .L22: .cfi_restore_state call__stack_chk_fail@PLT .cfi_endproc .LFE4: .size _Z3barv.avx, .-_Z3barv.avx .ident "GCC: (GNU) 10.1.0" .section.note.GNU-stack,"",@progbits
[Bug ipa/95790] Incorrect static target dispatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790 --- Comment #2 from Yichao Yu --- The C++ code attached above produces the following incorrect code with `g++ -O2 -S` .file "a.c" .text .p2align 4 .globl _Z3barv .type _Z3barv, @function _Z3barv: .LFB3: .cfi_startproc movl$4096, %eax ret .cfi_endproc .LFE3: .size _Z3barv, .-_Z3barv .p2align 4 .globl _Z3barv.avx .type _Z3barv.avx, @function _Z3barv.avx: .LFB4: .cfi_startproc movl$8192, %eax ret .cfi_endproc .LFE4: .size _Z3barv.avx, .-_Z3barv.avx .ident "GCC: (GNU) 10.1.0" .section.note.GNU-stack,"",@progbits Triggering the bug PR95778 with __attribute__ ((flatten,target ("default"))) static unsigned foo(const char *buf, unsigned size) { return 1; } __attribute__ ((flatten,target ("avx"))) static unsigned foo(const char *buf, unsigned size) { return 2; } __attribute__ ((flatten,target ("avx512f"))) static unsigned foo(const char *buf, unsigned size) { return 3; } __attribute__ ((target ("default"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } __attribute__ ((target ("avx"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } produces the correct code.
[Bug other/95778] target_clones indirection eliminates requires noinline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778 --- Comment #4 from Yichao Yu --- Yeah, after digging further the two issue are indeed the same. I initially didn't think they are since I didn't realize PR95786 (that the visibility attribute is simply ignored completely...) and thought static was handled specially It also seems that when target attribute is used directly the inlining can work. Maybe a pass order issue? and that's certainly a different issue so I'll file another one if there isn't one already when I have time.
[Bug ipa/95790] New: Incorrect static target dispatch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95790 Bug ID: 95790 Summary: Incorrect static target dispatch Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- The indirection elimination code currently only check for match of the target for the specific version but doesn't check if all the targets are matching. Modifying from https://github.com/gcc-mirror/gcc/commit/b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f#diff-e2d535917af8555baad2e9c8749e96a5 ``` __attribute__ ((target ("default"))) static unsigned foo(const char *buf, unsigned size) { return 1; } __attribute__ ((target ("avx"))) static unsigned foo(const char *buf, unsigned size) { return 2; } __attribute__ ((target ("avx512f"))) static unsigned foo(const char *buf, unsigned size) { return 3; } __attribute__ ((target ("default"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } __attribute__ ((target ("avx"))) unsigned bar() { char buf[4096]; unsigned acc = 0; for (int i = 0; i < sizeof(buf); i++) { acc += foo([i], 1); } return acc; } ``` With the optimization disabled, which is possible by adding a flatten attribute to the functions and triggering PR95780 and PR95778, a resolver function is automatically generated for foo like ``` .text .LHOTB0: .p2align 4 .type _ZL3fooPKcj.resolver, @function _ZL3fooPKcj.resolver: subq$8, %rsp call__cpu_indicator_init@PLT movq__cpu_model@GOTPCREL(%rip), %rax movl12(%rax), %eax testb $-128, %ah je .L8 leaq_ZL3fooPKcj.avx512f(%rip), %rax .L7: addq$8, %rsp ret .section.text.unlikely .type _ZL3fooPKcj.resolver.cold, @function _ZL3fooPKcj.resolver.cold: .L8: testb $2, %ah leaq_ZL3fooPKcj.avx(%rip), %rdx leaq_ZL3fooPKcj(%rip), %rax cmovne %rdx, %rax jmp .L7 .text .size _ZL3fooPKcj.resolver, .-_ZL3fooPKcj.resolver .section.text.unlikely .size _ZL3fooPKcj.resolver.cold, .-_ZL3fooPKcj.resolver.cold .LCOLDE0: .text .LHOTE0: .type _Z11_ZL3fooPKcjPKcj, @gnu_indirect_function .set_Z11_ZL3fooPKcjPKcj,_ZL3fooPKcj.resolver ``` and the calls from bar goes through the PLT. This is the correct behavior (albeit sub-optimal since the default could call the default directly) and allows avx512f version of foo to be called on the correct processor from the avx version of bar. With the optimization enabled, however, the call of foo's are inlined to bar and the avx512f version is never used. This is somewhat a regression caused by b8ce8129a560f64f8b2855c4a3812b7c3c0ebf3f. It'll also affect my fix for PR95780 and PR95778. https://gcc.gnu.org/pipermail/gcc-patches/2020-June/548631.html
[Bug tree-optimization/95786] New: Too aggressive target indirection elimination
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95786 Bug ID: 95786 Summary: Too aggressive target indirection elimination Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- I realize this issue when debugging PR95778 and PR95780 (ref https://gcc.gnu.org/pipermail/gcc-patches/2020-June/548631.html) It seems that the indirection elimination logic does not take into account the linkage and visibility of the callee and will eliminate the indirection even in cases where a function without target attribute would have use a PLT and, for example, allows a override from a different library. The following code generates a direct call beween g2 and f2 without going through PLT. ``` __attribute__((target_clones("default,avx2"))) int f2(int *p) { asm volatile ("" :: "r"(p) : "memory"); return *p; } __attribute__((target_clones("default,avx2"))) int g2(int *p) { return f2(p); } ``` but removing the target_clones attribute uses the PLT.
[Bug other/95778] target_clones indirection eliminates requires noinline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778 --- Comment #2 from Yichao Yu --- Also, the original code example had an error, the code that works properly was ``` static __attribute__((noinline,target_clones("default,avx2"))) int f2(int *p) { asm volatile ("" :: "r"(p) : "memory"); return *p; } __attribute__((noinline,target_clones("default,avx2"))) int g2(int *p) { return f2(p); } ```
[Bug other/95778] target_clones indirection eliminates requires noinline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778 --- Comment #1 from Yichao Yu --- Ah, I think this might be the fix for both this issue and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95780 . I'll test more and will try to submit it later. ``` diff --git a/gcc/multiple_target.c b/gcc/multiple_target.c index c1cfe8ff978..79a4c87545f 100644 --- a/gcc/multiple_target.c +++ b/gcc/multiple_target.c @@ -483,7 +483,7 @@ redirect_to_specific_clone (cgraph_node *node) DECL_ATTRIBUTES (e->callee->decl)); /* Function is not calling proper target clone. */ - if (!attribute_list_equal (attr_target, attr_target2)) + if (!attribute_value_equal (attr_target, attr_target2)) { while (fv2->prev != NULL) fv2 = fv2->prev; @@ -494,7 +494,7 @@ redirect_to_specific_clone (cgraph_node *node) cgraph_node *callee = fv2->this_node; attr_target2 = lookup_attribute ("target", DECL_ATTRIBUTES (callee->decl)); - if (attribute_list_equal (attr_target, attr_target2)) + if (attribute_value_equal (attr_target, attr_target2)) { e->redirect_callee (callee); cgraph_edge::redirect_call_stmt_to_callee (e); ```
[Bug other/95781] New: Missing dead code elimination when a recursive function is inlined.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95781 Bug ID: 95781 Summary: Missing dead code elimination when a recursive function is inlined. Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Code, ``` static int 2(int *p, int k) { int res = 0; if (k > 0) res += 2(p, k - 1); return *p + res; } int g2(int *p) { return 2(p, 3); } ``` Compiling with -O3 the code produced for `g2` is ``` g2: movl(%rdi), %eax sall$2, %eax ret ``` i.e. `*p * 4` that doesn't need to call `2`. However, the code for `2` is still generated even though it is never used. It seems that this only happens when the recursive function is sufficiently complex. Replacing `*p` with a constant or making the `k > 0` branch returning directly produces code that does not have `2` in it. Seems that there's some smart late optimization pass that doesn't have a global DCE pass afterwards? Looks similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80680 but I'm not sure if they have the same root cause.
[Bug other/95780] New: target_clones treats internal visibility different from static functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95780 Bug ID: 95780 Summary: target_clones treats internal visibility different from static functions Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Again using the code in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778. If the static function `f2` is changed to `visibility("internal")`, i.e. ``` __attribute__((visibility("internal"),noinline,target_clones("default,avx2"))) int f2(int *p) { asm volatile ("" :: "r"(p) : "memory"); return *p; } __attribute__((noinline,target_clones("default,avx2"))) int g2(int *p) { return f2(p); } ``` the call to `f2` will then use the PLT again. Without `target_clone` the two has similar effects and both produce a direct call.
[Bug other/95779] New: Unnecessary dispatch function for static target_clones function.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95779 Bug ID: 95779 Summary: Unnecessary dispatch function for static target_clones function. Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Using the code in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778 the full assembly generated (the version with both noinline) is (disabled unwind info), ``` .file "b.c" .text .p2align 4 .type f2.default.1, @function f2.default.1: movl(%rdi), %eax ret .size f2.default.1, .-f2.default.1 .p2align 4 .type g2.default.1, @function g2.default.1: jmp f2.default.1 .size g2.default.1, .-g2.default.1 .p2align 4 .type f2.avx2.0, @function f2.avx2.0: movl(%rdi), %eax ret .size f2.avx2.0, .-f2.avx2.0 .p2align 4 .type g2.avx2.0, @function g2.avx2.0: jmp f2.avx2.0 .size g2.avx2.0, .-g2.avx2.0 .section.text.g2.resolver,"axG",@progbits,g2.resolver,comdat .p2align 4 .weak g2.resolver .type g2.resolver, @function g2.resolver: subq$8, %rsp call__cpu_indicator_init@PLT movq__cpu_model@GOTPCREL(%rip), %rax leaqg2.avx2.0(%rip), %rdx testb $4, 13(%rax) leaqg2.default.1(%rip), %rax cmovne %rdx, %rax addq$8, %rsp ret .size g2.resolver, .-g2.resolver .globl g2 .type g2, @gnu_indirect_function .setg2,g2.resolver .text .p2align 4 .type f2.resolver, @function f2.resolver: subq$8, %rsp call__cpu_indicator_init@PLT movq__cpu_model@GOTPCREL(%rip), %rax leaqf2.avx2.0(%rip), %rdx testb $4, 13(%rax) leaqf2.default.1(%rip), %rax cmovne %rdx, %rax addq$8, %rsp ret .size f2.resolver, .-f2.resolver .ident "GCC: (GNU) 10.1.0" .section.note.GNU-stack,"",@progbits ``` AFAICT the `f2.resolver` is never used anywhere and can be omitted (all caller of `f2` are statically dispatched).
[Bug other/95778] New: target_clones indirection eliminates requires noinline
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95778 Bug ID: 95778 Summary: target_clones indirection eliminates requires noinline Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Compiling ``` static __attribute__((noinline,target_clones("default,avx2"))) int f2(int *p) { asm volatile ("" :: "r"(p) : "memory"); return *p; } __attribute__((target_clones("default,avx2"))) int g2(int *p) { return f2(p); } ``` with `-fPIC -O3` generates ``` g2.avx2.0: jmp f2.avx2.0 ``` However, if any of the two `noinline` is removed, the generated code becomes, ``` g2.avx2.0: jmp f2@PLT ``` which cannot get eliminated later https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95776 I think this should be possible to do and should be possible without LTO (hence a slightly different bug than https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95776 even though if that one is fixed turning on LTO can particially fix this). Also, in this case, the `f2` should be inlinable to `g2`. However, no combination of `inline`, `always_inline`, `flatten` I've tested can do that, even though when both functions are marked with `noinline` gcc clearly knows which function is calling what so it should have no problem inlining.
[Bug c/95777] New: Allow specifying more than one target options at the same time in target and target_clones attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95777 Bug ID: 95777 Summary: Allow specifying more than one target options at the same time in target and target_clones attribute Product: gcc Version: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Currently it seems that (document and own tests) only a single option is allowed for each version of the function using `target` and `target_clones`. This can be a problem for options that are not strict subset of each other (e.g. the AVX512 ones IIUC). Of course specifying `cpu=haswell` and `cpu=skylake` for the same target doesn't make much sense so some checking should be in place but I believe allowing multiple directly testable features to be specified at the same time should be allowed. A related issue is that while one can indeed do some of these by specifying a `arch=`. However, even if the runtime CPU supports all the features it'll still not get selected if the name doesn't exactly match (tested with `arch=haswell` on my kabelake laptop). If a fallback could be implemented to make this work that will be also good enough for me at least...
[Bug lto/95776] New: Reduce indirection with target_clones at link time (with LTO)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95776 Bug ID: 95776 Summary: Reduce indirection with target_clones at link time (with LTO) Product: gcc Version: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- Currently, if a function is not not visible outside the final library (static, or internal or hidden visibility), the call of the plt will be replaced with the call to the function directly. With target_clones, this is also possible within the same compilation unit for static functions as callees. The caller that has the same cloning attribute will simply call the cloned function without indirection. However, this stops working when the two are combined. Even with the maximum options and attribute to help it (hidden visibility, same compilation unit, -Wl,-Bsymbolic, LTO) the call to the cloned function from a caller with matching cloning attribute still go through the PLT. Test code ``` __attribute__((noinline,visibility("hidden"))) int f1(int *p) { asm volatile ("" :: "r"(p) : "memory"); return *p; } __attribute__((noinline,visibility("hidden"),target_clones("default,avx2"))) int f2(int *p) { asm volatile ("" :: "r"(p) : "memory"); return *p; } __attribute__((noinline)) int g1(int *p) { return f1(p); } __attribute__((noinline,target_clones("default,avx2"))) int g2(int *p) { return f2(p); } ``` Compiled with `-fPIC -flto -O3 -Wl,-Bsymbolic -shared`. The `f1` call calls `f1` directly whereas the two cloned `f2` calls both call `f2@plt`. The same also applies to inlining, target_clones kills inlining even with lto on. I assume this happens because this can only be done at link time which either didn't get passed enough info to determine this or simply didn't get implemented? I assume this should be possible since it can be done within a single compilation unit.
[Bug target/95775] New: Command line argument for target_clones?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95775 Bug ID: 95775 Summary: Command line argument for target_clones? Product: gcc Version: 10.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Would it make sense to add a command line argument that is roughly equivalent to to adding `target_clones` to all functions? In terms of usefulness, I believe it will be a very cheap way for many libraries to turn on the support with minimal code change. It certainly won't be as optimized as best possible but neither is target_clones attribute itself compared to hand wrote different implementations using compiler intrinsics/assembly... In terms of implementation, I believe most of the issues I've hit when adding such attribute to functions has been fixed so I have little issue using it now. It'll also be a new feature so it shouldn't really break any existing code. And for further improvement, the compiler should have fair knowledge of what instruction can be/has been used and can omit some of the cloning in order to reduce code size. I don't think this needs to be included in the first version though... And IIUC this is something that icc does automatically? (If that can serve as a argument for this feature...)
[Bug lto/94659] New: Missing symbol with LTO and target_clones
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94659 Bug ID: 94659 Summary: Missing symbol with LTO and target_clones Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- This is basically the same as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732 except now it only happens with LTO enabled. It seems that if a function with `target_clones` attribute isn't used in the final library and if LTO is enabled, the function will be missing from the resulting library. Only the `.resolver` symbol appears. The test code is ``` // b.c __attribute__((target_clones("default,avx"))) int f1() { return 2; } ``` when compiled with `gcc -g -flto -O3 -fPIC b.c -shared -o libb-lto.so`, the exported symbols available are, ``` $ objdump -T libb-lto.so libb-lto.so: file format elf64-x86-64 DYNAMIC SYMBOL TABLE: w D *UND* _ITM_deregisterTMCloneTable w D *UND* __gmon_start__ w D *UND* _ITM_registerTMCloneTable w DF *UND* GLIBC_2.2.5 __cxa_finalize 1730 gDF .text 002b Basef1.resolver ``` Compared to the output lilbrary from `gcc -g -O3 -fPIC b.c -shared -o libb.so` ``` $ objdump -T libb.so libb.so: file format elf64-x86-64 DYNAMIC SYMBOL TABLE: w D *UND* _ITM_deregisterTMCloneTable w D *UND* __gmon_start__ w D *UND* _ITM_registerTMCloneTable w DF *UND* GLIBC_2.2.5 __cxa_finalize 1730 w DF .text 002b Basef1.resolver 1730 g iD .text 002b Basef1 ``` The exported symbol has the wrong name for the LTO version. `dlsym` result confirms the difference. If the function is used somewhere else in the library, the resulting symbol will then looks the same as the non-LTO version.
[Bug ipa/94656] New: target_clones on alias leads to segfault in the compiler
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94656 Bug ID: 94656 Summary: target_clones on alias leads to segfault in the compiler Product: gcc Version: 9.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: ipa Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com CC: marxin at gcc dot gnu.org Target Milestone: --- Compiling the following code with `gcc -c` leads to a segfault in the compiler targetclone pass. ``` __attribute__((target_clones("default,avx"))) void f1() { } __attribute__((target_clones("default,avx"))) void f2() __attribute__((alias("f1"))); ``` The error was. ``` during IPA pass: targetclone src/s_nextafterl.c:6:1: internal compiler error: Segmentation fault 6 | __attribute__((target_clones("default,avx"))) void f2() __attribute__((alias("f1"))); | ^ ``` Now this came from a hack I was playing around and I'm not going to argue if having a target_clone on a alias should be supported (though if the target agrees I think it would be nice to support it... and if not agree a warning would be better IMHO). However, I don't think a segfault is what should happen here = =
[Bug libstdc++/92759] New: Typo in libstdcxx/v6/xmethods.py
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92759 Bug ID: 92759 Summary: Typo in libstdcxx/v6/xmethods.py Product: gcc Version: 9.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- I get the following warning when running gdb/rr. ``` /usr/lib/../share/gcc-9.2.0/python/libstdcxx/v6/xmethods.py:731: SyntaxWarning: list indices must be integers or slices, not str; perhaps you missed a comma? refcounts = ['_M_refcount']['_M_pi'] ``` Looking at the [code](https://github.com/gcc-mirror/gcc/blob/daa87973f7a00bf3bb81d0644dd60f4efb83bb65/libstdc%2B%2B-v3/python/libstdcxx/v6/xmethods.py#L731) I think that line should read ``` refcounts = obj['_M_refcount']['_M_pi'] ``` instead. I could submit a patch but I feel like it'll be faster/easier for someone here to just fix this
[Bug target/54412] minimal 32-byte stack alignment with -mavx on 64-bit Windows
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412 --- Comment #29 from Yichao Yu --- See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412#c25 GCC is fully capable of aligning the stack. It just seems that different part of it disagrees on what the current stack alignment is and whether a realignment is needed.
[Bug target/90826] Weak symbol does not work reliably on windows
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90826 --- Comment #2 from Yichao Yu --- Also, I just upgraded the compiler on this computer from 7.x to 9.1.0. The issue appeared before the upgrade as well but I didn't investigate until the upgrade finished.
[Bug target/90826] Weak symbol does not work reliably on windows
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90826 --- Comment #1 from Yichao Yu --- Oh, forgot to mention that the first assembly was generated with -O3 and adding `.weak f` to the generated file fixes the issue as well.
[Bug target/90826] New: Weak symbol does not work reliably on windows
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90826 Bug ID: 90826 Summary: Weak symbol does not work reliably on windows Product: gcc Version: 9.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The following code does not link correctly with all optimization levels on windows with the mingw-w64-x86_64-g++ compiler. ``` #include extern "C" void f() __attribute__((weak)); int main() { return (int)(uintptr_t)f; } ``` The assembly generated is ``` .file "weak.cpp" .text .def__main; .scl2; .type 32; .endef .section.text.startup,"x" .p2align 4 .globl main .defmain; .scl2; .type 32; .endef .seh_proc main main: .LFB1: subq$56, %rsp .seh_stackalloc 56 .seh_endprologue call__main movq.refptr.f(%rip), %rax movq%rax, 40(%rsp) addq$56, %rsp ret .seh_endproc .ident "GCC: (Rev2, Built by MSYS2 project) 9.1.0" .section.rdata$.refptr.f, "dr" .globl .refptr.f .linkonce discard .refptr.f: .quad f ``` and the error, ``` C:\msys64\tmp\ccQkPfOi.o:weak.cpp:(.rdata$.refptr.f[.refptr.f]+0x0): undefined reference to `f' ``` This should not happen since `f` is declared weak. (I realized that the symbol resolution happens at compile time for weak symbol, which is fine for me, but I just want it to compile...) Another case where the optimization actually makes this work is, ``` #include extern "C" void f() __attribute__((weak)); int main() { printf("%p\n", f); return 0; } ``` With -O0, the assembly generated is ``` .file "weak.cpp" .text .def__main; .scl2; .type 32; .endef .section .rdata,"dr" .LC0: .ascii "%p\12\0" .text .globl main .defmain; .scl2; .type 32; .endef .seh_proc main main: .LFB28: pushq %rbp .seh_pushreg%rbp movq%rsp, %rbp .seh_setframe %rbp, 0 subq$32, %rsp .seh_stackalloc 32 .seh_endprologue call__main movq.refptr.f(%rip), %rdx leaq.LC0(%rip), %rcx callprintf movl$0, %eax addq$32, %rsp popq%rbp ret .seh_endproc .ident "GCC: (Rev2, Built by MSYS2 project) 9.1.0" .defprintf; .scl2; .type 32; .endef .section.rdata$.refptr.f, "dr" .globl .refptr.f .linkonce discard .refptr.f: .quad f ``` with error, ``` C:\msys64\tmp\ccTiwMKh.o:weak.cpp:(.rdata$.refptr.f[.refptr.f]+0x0): undefined reference to `f' ``` with -O1 or higher, the assembly produced is, ``` .file "weak.cpp" .text .def__main; .scl2; .type 32; .endef .section .rdata,"dr" .LC0: .ascii "%p\12\0" .text .globl main .defmain; .scl2; .type 32; .endef .seh_proc main main: .LFB30: subq$40, %rsp .seh_stackalloc 40 .seh_endprologue call__main leaqf(%rip), %rdx leaq.LC0(%rip), %rcx callprintf movl$0, %eax addq$40, %rsp ret .seh_endproc .weak f .ident "GCC: (Rev2, Built by MSYS2 project) 9.1.0" .deff; .scl2; .type 32; .endef .defprintf; .scl2; .type 32; .endef .section.rdata$.refptr.f, "dr" .globl .refptr.f .linkonce discard .refptr.f: .quad f ``` The difference between the two assembly is ``` --- weak1.s 2019-06-10 19:42:27.039467600 -0400 +++ weak0.s 2019-06-10 19:42:23.709467500 -0400 @@ -9,21 +9,24 @@ .defmain; .scl2; .type 32; .endef .seh_proc main main: -.LFB30: - subq$40, %rsp - .seh_stackalloc 40 +.LFB28: + pushq %rbp + .seh_pushreg%rbp + movq%rsp, %rbp + .seh_setframe %rbp, 0 + subq$32, %rsp + .seh_stackalloc 32 .seh_endprologue call__main - leaqf(%rip), %rdx + movq.refptr.f(%rip), %rdx leaq.LC0(%rip), %rcx callprintf movl$0, %eax - addq$40, %rsp + addq$32, %rsp + popq%rbp ret .seh_endproc - .weak
[Bug c/90728] New: False positive Wmemset-elt-size with zero size array
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90728 Bug ID: 90728 Summary: False positive Wmemset-elt-size with zero size array Product: gcc Version: 9.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The code below comes from a template expansion (when certain cache feature is disabled) and all the operation on the `buff` member are no-op. ``` #include struct A { A() { memset(, 0xff, sizeof(buff)); } int buff[0]; }; ``` However, this start to raise a warning on GCC 9 ``` a.cpp: In constructor 'A::A()': a.cpp:8:41: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size] 8 | memset(, 0xff, sizeof(buff)); | ^ ``` It seems that the warning logic simply compare the size (as well as checking element size != 1) without taking into account the 0 size case.
[Bug tree-optimization/89582] Suboptimal code generated for floating point struct in -O3 compare to -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582 --- Comment #6 from Yichao Yu --- For the vfloat test case, isn't the optimum code just ``` addps %xmm2, %xmm0 addps %xmm3, %xmm1 retq ``` It's not making full use of the vector but I assume not having to spill is a win? This is what clang produces. And for the LLVM early lowering of the calling convention, a less awkward way is. ``` define { <2 x float>, <2 x float> } @f2({<2 x float>, <2 x float>}, {<2 x float>, <2 x float>}) { %v0 = extractvalue { <2 x float>, <2 x float> } %0, 0 %v1 = extractvalue { <2 x float>, <2 x float> } %0, 1 %v2 = extractvalue { <2 x float>, <2 x float> } %1, 0 %v3 = extractvalue { <2 x float>, <2 x float> } %1, 1 %v5 = fadd <2 x float> %v0, %v2 %v6 = fadd <2 x float> %v1, %v3 %v7 = insertvalue { <2 x float>, <2 x float> } undef, <2 x float> %v5, 0 %v8 = insertvalue { <2 x float>, <2 x float> } %v7, <2 x float> %v6, 1 ret { <2 x float>, <2 x float> } %v8 } ```
[Bug target/89606] Extra mov after structure load instructions on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89606 --- Comment #1 from Yichao Yu --- Compiled a GCC 9 snapshot for pr89607 and the issue is still present.
[Bug target/89607] Missing optimization for store of multiple registers on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607 --- Comment #8 from Yichao Yu --- I see. I don't imagine this to cause a major local speed up though I assume it should at least not be slower? That's also why I mentioned that this should at least be done for `-Os`.
[Bug target/89607] Missing optimization for store of multiple registers on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607 --- Comment #6 from Yichao Yu --- > For aarch64, there was talk about adding stp for q registers. What do you mean? I was initially unsure about it too but I assume it already exist since clang (and now GCC 9) emits it and the arm arch reference manual also mentions it without mentioning it only available in a later version.
[Bug target/89607] Missing optimization for store of multiple registers on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607 --- Comment #5 from Yichao Yu --- I just compiled the 9-20190303 snapshot and this is indeed seems to be fixed. Should this be closed now or after GCC 9 is released?
[Bug target/89607] Missing optimization for store of multiple registers on arm and aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607 --- Comment #3 from Yichao Yu --- Done pr89614
[Bug target/89614] New: Missing optimization for store of multiple registers on arm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89614 Bug ID: 89614 Summary: Missing optimization for store of multiple registers on arm Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Separated from pr89607 as requested. Test code and result compiled with any non-zero optimization levels, ``` #include void f4(float32x4x2_t *p, const float *p1) { *p = vld2q_f32(p1); } void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2) { p->val[0] = v1; p->val[1] = v2; } ``` ``` f4: vld2.32 {d16-d19}, [r1] vst1.64 {d16-d19}, [r0:64] bx lr f5: vst1.64 {d0-d1}, [r0:64] vstrd2, [r0, #16] vstrd3, [r0, #24] bx lr ``` I believe `f5` should use a single `vst1.64 {d0-d3}, [r0:64]` just like `f4`. If for some reason doing that is bad for performance (doubt it...) it should at least be used for -Os.
[Bug target/89607] Missing optimization for store of multiple registers on arm and aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607 --- Comment #2 from Yichao Yu --- Sure. I'll do that.
[Bug target/89607] New: Missing optimization for store of multiple registers on arm and aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607 Bug ID: 89607 Summary: Missing optimization for store of multiple registers on arm and aarch64 Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Test code, Compiled for arm/aarch64 with -O1/-O2/-O3/-Os/-Ofast ``` #include void f4(float32x4x2_t *p, const float *p1) { *p = vld2q_f32(p1); } void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2) { p->val[0] = v1; p->val[1] = v2; } ``` arm: ``` f4: vld2.32 {d16-d19}, [r1] vst1.64 {d16-d19}, [r0:64] bx lr f5: vst1.64 {d0-d1}, [r0:64] vstrd2, [r0, #16] vstrd3, [r0, #24] bx lr ``` aarch64: ``` f4: ld2 {v0.4s - v1.4s}, [x1] str q0, [x0] str q1, [x0, 16] ret f5: str q0, [x0] str q1, [x0, 16] ret ``` For arm, it seems that f5 could follow f4 and uses a `vst1.64 {d0-d3}, [r0:64]` instead. For aarch64, both function should have used a `stp q0, q1, [x0]` Clang produces what I expected on aarch64 but it only uses pair store instruction on arm, which use one more instuction for `f4` and one fewer for `f5`. (I'm not sure why GCC decided to use a pair store and then two single stores) Similar to pr89606, this optimization should at least happen with `-Os` if not for all other optimization levels. Tested with 8.2.1 on arm and 8.3.0 on aarch64.
[Bug target/89606] New: Extra mov after structure load instructions on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89606 Bug ID: 89606 Summary: Extra mov after structure load instructions on aarch64 Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Code to reproduce, ``` #include #ifdef __aarch64__ float64x2x2_t f(const double *p1, const double *p2) { float64x2x2_t v = vld2q_f64(p1); return vld2q_lane_f64(p2, v, 1); } float32x2x2_t f2(const float *p1, const float *p2) { float32x2x2_t v = vld2_f32(p1); return vld2_lane_f32(p2, v, 1); } #endif void f3(float32x2x2_t *p, const float *p1, const float *p2) { float32x2x2_t v = vld2_f32(p1); *p = vld2_lane_f32(p2, v, 1); } ``` GCC produces (aarch64, -O1/-O2/-O3/-Ofast/-Os), ``` f: ld2 {v4.2d - v5.2d}, [x0] mov v0.16b, v4.16b mov v1.16b, v5.16b ld2 {v0.d - v1.d}[1], [x1] ret f2: ld2 {v0.2s - v1.2s}, [x0] mov v2.8b, v0.8b mov v3.8b, v1.8b ld2 {v2.s - v3.s}[1], [x1] mov v1.8b, v3.8b mov v0.8b, v2.8b ret f3: ld2 {v2.2s - v3.2s}, [x1] mov v0.8b, v2.8b mov v1.8b, v3.8b ld2 {v0.s - v1.s}[1], [x2] stp d0, d1, [x0] ret ``` For all three functions, none of the mov's seems necessary. Even if there's some performance issue when reusing the registers (I highly doubt it...) at least the `-Os` version should not have those mov's. Clang produces what I expect in this case, ``` f: ld2 { v0.2d, v1.2d }, [x0] ld2 { v0.d, v1.d }[1], [x1] ret f2: ld2 { v0.2s, v1.2s }, [x0] ld2 { v0.s, v1.s }[1], [x1] ret f3: ld2 { v0.2s, v1.2s }, [x1] ld2 { v0.s, v1.s }[1], [x2] stp d0, d1, [x0] ret ``` Aarch32 doesn't have this issue either with GCC, ``` f3: vld2.32 {d16-d17}, [r1] vld2.32 {d16[1], d17[1]}, [r2] vst1.64 {d16-d17}, [r0:64] bx lr ``` so this seems to be aarch64 specific.
[Bug target/89597] New: Inconsistent vector calling convention on windows with Clang and MSVC
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89597 Bug ID: 89597 Summary: Inconsistent vector calling convention on windows with Clang and MSVC Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- For 256bit and 512bit vector return values, Clang and MSVC always returns them in the corresponding registers even without `__vectorcall`. GCC, however, returns the value as reference. Together with the missing support of `__vectorcall`[1], this means that the code GCC generate for functions that returns vector value is not compatible with any other compilers. The problem does not exist for 128bit vectors. Test case. ``` typedef double vdouble __attribute__((vector_size(32))); vdouble f(vdouble x, vdouble y) { return x + y; } ``` GCC compiles this to, ``` f: vmovapd (%r8), %ymm0 vaddpd (%rdx), %ymm0, %ymm0 movq%rcx, %rax vmovapd %ymm0, (%rcx) vzeroupper ret ``` Clang compiles this to, ``` f: vmovapd (%rcx), %ymm0 vaddpd (%rdx), %ymm0, %ymm0 retq ``` Given the stack alignment issue[2], I wonder if this can be fixed now without breaking anyone's code. (i.e. everyone that's using it is probably broken anyway due to the other bug...) Disclaimer. I did all my test with clang. I believe MSVC behaves the same from the compiled result I got from someone else and I don't have MSVC to personally test it. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89485 [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412
[Bug target/89581] Unneeded stack alignment on windows x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581 --- Comment #1 from Yichao Yu --- The problem is still there when compiled with -O2 ``` f: pushq %rbp vmovq (%r8), %xmm1 movq%rcx, %rax vmovq 8(%r8), %xmm0 vaddsd (%rdx), %xmm1, %xmm1 vaddsd 8(%rdx), %xmm0, %xmm0 movq%rsp, %rbp andq$-16, %rsp vmovsd %xmm1, (%rcx) vmovsd %xmm0, 8(%rcx) leave ret ``` but is not there under `-O2` when the arguments and results are passed explicitly by reference. ``` void f2(vdouble *res, const vdouble *x, const vdouble *y) { *res = (vdouble){x->x1 + y->x1, x->x2 + y->x2}; } ``` ``` f2: vmovsd 8(%rdx), %xmm0 vmovsd (%rdx), %xmm1 vaddsd 8(%r8), %xmm0, %xmm0 vaddsd (%r8), %xmm1, %xmm1 vmovsd %xmm0, 8(%rcx) vmovsd %xmm1, (%rcx) ``` The problem comes back, however, with the explicit pass by reference version when compiled under -O3 ``` f2: pushq %rbp vmovapd (%rdx), %xmm0 vaddpd (%r8), %xmm0, %xmm0 movq%rsp, %rbp andq$-16, %rsp vmovaps %xmm0, (%rcx) leave ret ```
[Bug target/89582] New: Suboptimal code generated for floating point struct in -O3 compare to -O2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89582 Bug ID: 89582 Summary: Suboptimal code generated for floating point struct in -O3 compare to -O2 Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- When testing the code for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581 on linux, I noticed that the code seems suboptimum when compiled under -O3 rather than -O2 on linux x64. ``` typedef struct { double x1; double x2; } vdouble __attribute__((aligned(16))); vdouble f(vdouble x, vdouble y) { return (vdouble){x.x1 + y.x1, x.x2 + y.x2}; } ``` Compiled with `-O2` produces ``` f: addsd %xmm3, %xmm1 addsd %xmm2, %xmm0 ret ``` With `-O3` or `-Ofast`, however, the code produced is, ``` f: movq%xmm0, -40(%rsp) movq%xmm1, -32(%rsp) movapd -40(%rsp), %xmm4 movq%xmm2, -24(%rsp) movq%xmm3, -16(%rsp) addpd -24(%rsp), %xmm4 movaps %xmm4, -40(%rsp) movsd -32(%rsp), %xmm1 movsd -40(%rsp), %xmm0 ret ``` It seems that gcc tries to use the vector instruction but had to use the stack for that. I did a quick benchmark which confirms that the -O3 version is much slower than the -O2 version. Clang produces ``` f: addsd %xmm2, %xmm0 addsd %xmm3, %xmm1 retq ``` As long as any optimizations are on, which seems appropriate.
[Bug target/89581] New: Unneeded stack alignment on windows x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89581 Bug ID: 89581 Summary: Unneeded stack alignment on windows x86 Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- On windows, when compiling the following code with ` gcc -mavx2 a.c -o - -S -O3 -g0 -fno-asynchronous-unwind-tables -fomit-frame-pointer -Wall -Wextra` ``` typedef struct { double x1; double x2; } vdouble __attribute__((aligned(16))); vdouble f(vdouble x, vdouble y) { return (vdouble){x.x1 + y.x1, x.x2 + y.x2}; } ``` I got ``` pushq %rbp vmovdqa (%r8), %xmm0 movq%rcx, %rax vaddpd (%rdx), %xmm0, %xmm0 movq%rsp, %rbp andq$-16, %rsp vmovaps %xmm0, (%rcx) leave ret ``` which include 4 extra instructions to align the stack without actually using it FWIW, clang has a similar problem on linux... https://bugs.llvm.org/show_bug.cgi?id=40844 Also worth noting that with -O2 all three vector instructions are splitted into scalar ones whereas clang does this transformation at -O2...
[Bug target/54412] minimal 32-byte stack alignment with -mavx on 64-bit Windows
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412 --- Comment #24 from Yichao Yu --- Oh, and the test case above was compiled with -O3 (and -g -Wall -Wextra).
[Bug target/54412] minimal 32-byte stack alignment with -mavx on 64-bit Windows
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412 Yichao Yu changed: What|Removed |Added CC||yyc1992 at gmail dot com --- Comment #23 from Yichao Yu --- > It is GCC does not realign the stack at all that is the issue. I hit another related issue that might confirm this as well. I noticed this when I tried to manually align the stack with inline assembly. C++ code reduced from my test case, ``` #include #include #include __attribute__((target("avx"))) __attribute__((noinline)) __m256d f(__m256d x, uint32_t a, const double *p) { __m256d res; asm volatile ("vxorpd %0, %0, %0" : "=x"(res), "+x"(x), "+r"(a), "+r"(p) :: "memory", "rax", "rcx", "rdx", "r8", "r9", "r10", "r11", "rbp"); return res; } __attribute__((target("avx"))) __attribute__((noinline)) __m256d f2(__m256d x, uint32_t a, const double *p) { __m256d res; asm volatile ("vxorpd %0, %0, %0" : "=x"(res), "+x"(x), "+r"(a), "+r"(p) :: "memory", "rax", "rcx", "rdx", "r8", "r9", "r10", "r11", "rbp"); return res; } __attribute__((target("avx"))) __attribute__((noinline)) __m256d f(__m256d x, __m256d y, __m256d z, uint32_t a, const double *p) { __m256d res; asm volatile ("vxorpd %0, %0, %0" : "=x"(res), "+x"(x), "+x"(y), "+x"(z), "+r"(a), "+r"(p) :: "memory", "rax", "rcx", "rdx", "r8", "r9", "r10", "r11", "rbp"); return res; } const double points[] = {0, 0.1, 0.2, 0.6}; __attribute__((target("avx"))) void test_avx() { f(__m256d{0, 0, 0, 0}, __m256d{0, 0, 0, 0}, __m256d{0, 0, 0, 0}, 4, points); f(__m256d{0, 0, 0, 0}, 4, points); } __attribute__((target("avx"))) void test_avx2() { f2(__m256d{0, 0, 0, 0}, 4, points); } static void call_aligned_stack(void (*p)(void)) { asm volatile ("movq %%rsp, %%rbp\n" "andq $-64, %%rsp\n" "subq $64, %%rsp\n" "callq *%0\n" "movq %%rbp, %%rsp\n" :: "r"(p) : "memory", "rax", "rcx", "rdx", "r8", "r9", "r10", "r11", "rbp"); } int main() { call_aligned_stack(test_avx); fprintf(stderr, "\n"); fflush(stderr); call_aligned_stack(test_avx2); return 0; } ``` (The `fprintf` is there only to make it easier to see when the crash happens.) The stack alignment code makes sure that the stack is aligned to 64bytes before making the `call`, which is verified in the debugger, however, when compiled with GCC 8.2.1 on msys2 (using the mingw-w64-x86_64-gcc package) the `test_avx` function is happy while `test_avx2` function is not. Looking at the generated code, for the crashing function: ``` 004015c0 <_Z9test_avx2v>: 4015c0: 48 83 ec 68 sub$0x68,%rsp 4015c4: c5 f9 57 c0 vxorpd %xmm0,%xmm0,%xmm0 4015c8: 4c 8d 0d 51 7a 00 00lea0x7a51(%rip),%r9# 409020 <_ZL6points> 4015cf: 41 b8 04 00 00 00 mov$0x4,%r8d 4015d5: 48 8d 4c 24 40 lea0x40(%rsp),%rcx 4015da: 48 8d 54 24 20 lea0x20(%rsp),%rdx 4015df: c5 fd 29 44 24 20 vmovapd %ymm0,0x20(%rsp) 4015e5: c5 f8 77vzeroupper 4015e8: e8 a3 ff ff ff callq 401590 <_Z2f2Dv4_djPKd> 4015ed: 90 nop 4015ee: 48 83 c4 68 add$0x68,%rsp 4015f2: c3 retq ``` which tries to write with 32byte alignment with a stack offset from the initial call instruction: -8 - 0x68 + 0x20 = -80. OTOH, for the "good" function, ``` 00401640 <_Z8test_avxv>: 401640: 57 push %rdi 401641: 56 push %rsi 401642: 53 push %rbx 401643: 48 81 ec b0 00 00 00sub$0xb0,%rsp 40164a: c5 d9 57 e4 vxorpd %xmm4,%xmm4,%xmm4 40164e: 48 8d 3d cb 79 00 00lea0x79cb(%rip),%rdi# 40902
[Bug c/89485] New: Support vectorcall calling convention on windows
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89485 Bug ID: 89485 Summary: Support vectorcall calling convention on windows Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- I'm very surprised that I didn't find an issue for this so sorry if this is discussed/rejected somewhere else. It appears that both MSVC and clang supports a vectorcall calling convention which is very similar to the one used on linux and passes large vectors in the corresponding vector register instead of on the stack. It'll be nice if gcc can support that both for efficiency and for compatibility. Ref https://docs.microsoft.com/en-us/cpp/cpp/vectorcall?view=vs-2017 https://clang.llvm.org/docs/AttributeReference.html#id335
[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM (aarch32)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641 --- Comment #20 from Yichao Yu --- Just want to mention that the lack of a way to locally change the arch settings without lying to the compiler is exactly why I reported this issue.
[Bug target/83110] Relocation error when taking address of protected function in shared library.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83110 --- Comment #2 from Yichao Yu --- What might be invalid about the source?
[Bug target/83110] New: Relocation error when taking address of protected function in shared library.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83110 Bug ID: 83110 Summary: Relocation error when taking address of protected function in shared library. Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- This is very similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65248 although that one is marked as fixed. (This could be a dup of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19520 but I can't really tell...) The difference from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65248 is that this now only happens for me with protected functions and not global variables. The code to reproduce is ``` __attribute__((visibility("protected"))) void f() { } // __attribute__((visibility("protected"))) // int f; void f2(void (*cb)(void*)) { cb((void*)); } ``` Which gives the error ``` % LANG=C g++ a.cpp -o liba.so -pthread -fPIC -shared /bin/ld: /tmp/ccvUACGZ.o: relocation R_X86_64_PC32 against protected symbol `_Z1fv' can not be used when making a shared object /bin/ld: final link failed: Bad value collect2: error: ld returned 1 exit status ```
[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM (aarch32)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641 --- Comment #7 from Yichao Yu --- It would be great if `+crc` can work if it's not ambiguous. Requiring `arch=armv8-a+crc` works for me too, and it'll just require more preprocessor checks.
[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM (aarch32)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641 --- Comment #3 from Yichao Yu --- > ARMv8-a is the only architecture variant where the CRC extension is optional Not really. There's also armv8-r and armv8-m. Also, I believe code compiled for armv7-a can run on armv8-a hardware and can also optionally enable armv8 features including CRC extension. I was hoping that GCC can be smart enough to enable the correct armv8 variant automatically. Test case is just ``` #include #pragma GCC push_options #pragma GCC target("armv8-a+crc") __attribute__((target("armv8-a+crc"))) uint32_t crc32cw(uint32_t crc, uint32_t val) { uint32_t res; /* asm(".arch armv8-a"); */ /* asm(".arch_extension crc"); */ asm("crc32cw %0, %1, %2" : "=r"(res) : "r"(crc), "r"(val)); /* asm(".arch armv7-a"); */ return res; } #pragma GCC pop_options ``` Compiled with either armv7-a or armv8-a march.
[Bug target/82641] Unable to enable crc32 for a certain function with target attribute on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641 --- Comment #1 from Yichao Yu --- I've found a workaround in https://sourceware.org/ml/binutils/2017-04/msg00171.html but it's extremely ugly (albeit also very clever...).
[Bug target/82641] New: Unable to enable crc32 for a certain function with target attribute on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82641 Bug ID: 82641 Summary: Unable to enable crc32 for a certain function with target attribute on ARM Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The assembler complains about the target not supporting CRC32 instructions for certain (generic) targets on ARM and AArch64. On AArch64, this can be lifted with the `target("+crc")` attribute (or pragma though I've only tested the function attribute) when writing inline assembly code that uses non-default processor features and cpu-feature dispatch. However, none of these approaches works on ARM. There are multiple issues when trying to do this, 1. "+crc" is not accepted as a feature on ARM (32bit), not even when `march` is set to `armv8-a`. OTOH, "armv8-a+crc" works though that makes supporting different arch profile harder... 2. No `.arch` or `.arch_feature` directives are generated in the assembly which cause the assembler to complain. This is the case for either function attribute or pragma. I've tried to manually added a `.arch armv8-a` and a `.arch_extension crc` before the function that uses the `crc32` instruction and then reset it back with `.arch armv7-a` in the assembly code and it behaves correctly so I believe this should be fixable on the GCC side.
[Bug target/80732] target_clones does not work with dlsym
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732 --- Comment #9 from Yichao Yu --- Thanks for the fix! Does it fix https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78366 at the same time?
[Bug target/80732] target_clones does not work with dlsym
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732 --- Comment #6 from Yichao Yu --- Good to know. Thanks.
[Bug target/80732] target_clones does not work with dlsym
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732 --- Comment #4 from Yichao Yu --- `double (*pf1)(double, double, double) = dlsym(hdl, "f1.ifunc");` Wouldn't it be better if GCC generates local functions `f1.default`, `f1.fma` as implementation and `f1` to replace `f1.ifunc`? It's quite incontinent if this detail is exposed. If one have to use `f1.ifunc`, does it also mean that the declaration of the function in the header must also have `target_clone` applied?
[Bug target/80732] New: target_clones does not work with dlsym
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80732 Bug ID: 80732 Summary: target_clones does not work with dlsym Product: gcc Version: 6.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Compiling the code below to a executable with `gcc -Wall -Wextra -O3 -fPIC -ldl -rdynamic`. On a haswell+ system, the output is ``` 1: 0, 4.93038e-32, 0 2: 4.93038e-32, 4.93038e-32, 4.93038e-32 ``` Showing that with the manually created ifunc, dlsym, direct function call, and accessing function address produces the same result (the fma version) whereas with `target_clones` only direct function call uses the fma versison. This might be related to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78366 but I'm not entirely sure. From that bug report I can understand that this is just how `target_clones` is currently implemented but I do think this is not a documentation issue and should be fixed / improved instead since 1. in this case there is user observable inconsistency in the result generated when different code paths are used. The fast math object should be allowed to produce slightly inaccurate result but I do think it should produce consistent result every time the function is called. 2. probably more importantly, this behavior makes the `target_clone` attribute useless for used in public interface if the shared library can ever by dynamically loaded. ``` #include #include __attribute__((target_clones("default","fma"),noinline,optimize("fast-math"))) double f1(double a, double b, double c) { return a * b + c; } double k1(double a, double b, double c, void **p) { *p = f1; return f1(a, b, c); } __attribute__((target("fma"),optimize("fast-math"))) static double f2_fma(double a, double b, double c) { return a * b + c; } __attribute__((optimize("fast-math"))) static double f2_default(double a, double b, double c) { return a * b + c; } static void *f2_resolve(void) { __builtin_cpu_init (); if (__builtin_cpu_supports("fma")) return f2_fma; else return f2_default; } double f2(double a, double b, double c) __attribute__((ifunc("f2_resolve"))); double k2(double a, double b, double c, void **p) { *p = f2; return f2(a, b, c); } int main() { volatile double a = 1.0002; volatile double b = -0.9998; volatile double c = 1.0; void *hdl = dlopen(NULL, RTLD_NOW); printf("1:\n"); double (*pf1)(double, double, double) = dlsym(hdl, "f1"); double (*pk1)(double, double, double, void**) = dlsym(hdl, "k1"); double (*_pf1)(double, double, double); double v1_1 = pf1(a, b, c); double v1_2 = pk1(a, b, c, (void**)&_pf1); double v1_3 = _pf1(a, b, c); printf("%g, %g, %g\n", v1_1, v1_2, v1_3); printf("2:\n"); double (*pf2)(double, double, double) = dlsym(hdl, "f2"); double (*pk2)(double, double, double, void**) = dlsym(hdl, "k2"); double (*_pf2)(double, double, double); double v2_1 = pf2(a, b, c); double v2_2 = pk2(a, b, c, (void**)&_pf2); double v2_3 = _pf2(a, b, c); printf("%g, %g, %g\n", v2_1, v2_2, v2_3); return 0; } ```
[Bug target/77728] [5/6 Regression] Miscompilation multiple vector iteration on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728 --- Comment #48 from Yichao Yu --- Thanks for fixing this. I didn't follow all the comments since I'm not familiar with the C++ ABI so just to make sure I understand what's happening is it that the bug is caused by a inconsistency in C++ ABI for certain classes which can happen on both ARM and AArch64 (although not for AArch64 in this case)? Is this now fixed for gcc 7+ for both ARM and AArch64? (Should this be closed now or only when there's a release?) And btw, when is the estimated release time of 7.1?
[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728 --- Comment #6 from Yichao Yu --- Anything new here?
[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728 --- Comment #5 from Yichao Yu --- Ping again? Anything new or I can help with here?
[Bug middle-end/77996] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 --- Comment #12 from Yichao Yu --- Since the LLVM miscompilation isn't fixed, is there any way to check the alias assumptions more programmatically? (I can see that the TrailingObject might easily introduce something like this but given the complexity it's a little hard for me to see if that's actually the case.)
[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728 --- Comment #4 from Yichao Yu --- Ping. Anything I can help with debugging this?
[Bug middle-end/77996] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 --- Comment #11 from Yichao Yu --- The case pointed out is fixed in https://reviews.llvm.org/rL284336 although as expected that doesn't fix the error. Still not sure whose bug is this...
[Bug middle-end/77996] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 --- Comment #10 from Yichao Yu --- That does look like an violation (this particular one should be hidden behind shared library boundary in the reduced case though). Reported to LLVM at https://llvm.org/bugs/show_bug.cgi?id=30711 .
[Bug middle-end/77996] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 --- Comment #8 from Yichao Yu --- > Can you try with -fno-strict-aliasing ? That seems to fix it for both the original case (LLVM) and the reduced case (the linked tarball). Is there a way to figure out the problematic (either bug in LLVM's code or gcc's alias detection) aliasing assumption?
[Bug middle-end/77996] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 --- Comment #6 from Yichao Yu --- I've compiled a gcc at 951db45 using the same configuration as archlinux arm PKGBUILD and I can reproduce the problem using the `code/` in https://gist.github.com/yuyichao/6c24d4a4bc374425906138359a44479c/raw/f5edb6ae8205d5e4d1eb03a7fb900f15711f/gcc-debug.tar.bz2
[Bug middle-end/77996] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 --- Comment #5 from Yichao Yu --- Compiling current llvm trunk (r284322) still shows the same error. The script I used to compile LLVM is here https://github.com/yuyichao/arch-pkg/blob/master/pkg/all/llvm-svn/PKGBUILD. Compiling gcc 951db45 now.
[Bug middle-end/77996] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 --- Comment #3 from Yichao Yu --- > What exact version of LLVM are you trying to compile? Revision of the LLVM > sources including revision of clang, etc. I was compiling the trunk version. The version I started reducing from was https://github.com/llvm-mirror/llvm/commit/0885462106134999f8aa80a3a71bfed160910248 but it happens on at least 3 different version I've tried before this commit. > Can you try compile GCC from the 6 branch and try again because having just a > date might not be enough to reproduce the problem. The script used to compile GCC is https://github.com/archlinuxarm/PKGBUILDs/blob/master/core/gcc/PKGBUILD so it seems to be using commit `c2103c17` I can also try to compile a more recent version locally (will take some time).
[Bug lto/77997] Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77997 --- Comment #2 from Yichao Yu --- . Sorry the first submission gave me a time out so I did again..
[Bug lto/77997] New: Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77997 Bug ID: 77997 Summary: Miscompilation due to LTO on aarch64 Product: gcc Version: 6.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- I'm seeing a miscompilation of LLVM's tablegen on AArch64 by gcc 6.2.1 when LTO is enabled. I've tried very hard to reduce it but unfortunately it wasn't very successful this time and the current repro is still 8000 lines of code. Attached are the source and resulting binaries. (Edit: the tarball is too big (3M) so I uploaded to gist instead. Please find it here https://gist.githubusercontent.com/yuyichao/6c24d4a4bc374425906138359a44479c/raw/f5edb6ae8205d5e4d1eb03a7fb900f15711f/report.md) The `code/` directory has a simple cmake projects reduced from the LLVM one (I can turn that into a makefile or a shell script on request but the current form should be pretty simple already). To reproduce make a `build/` directory in `code/` and run `CFLAGS='-flto -O3' CXXFLAGS='-flto -O3' LDFLAGS='-O3 -flto' cmake .. -DCMAKE_BUILD_TYPE=Release; make llvm-tblgen; bin/llvm-tblgen`. Remove the `-flto` should get rid of the error in the last command. Changes in seemingly unrelated lines can also make the error go away. (If there's anything I learnt from reducing it, the error seems to appear only when the code is complex). One of such changes is commenting out `SCTrans.PredTerm = Preds;` close to the end of `CodeGenSchedule.cpp` (used to generate the `good/` version included). In fact, removing almost any lines in this file can make the error go away even though not a single line of code there should be executed. The `bad/` and the `good/` directories conatins compilation results using the flags mentioned above with the unmodified code and the code with the one line commented out. They have all the object files, binary files and the disassemble of the resulting executable/the bad function. The asm's are disassembled from the final binary since I don't know how to get it directly when compiling with LTO. The direct error seems to be in `CodeGenRegister::computeSubRegs` in the branch before the `printf("5\n")`. The `DenseMap::insert` method (which is called twice in this function and nowhere else) is inlined but returns corrupted iterator sometimes when the inserted key already exists in the map causing the check to fail. The difference of asm of this function is in the toplevel of the tarball. Original repro is compiling LLVM with LTO on on AArch64. The compilation should fail when generating target information for AArch64. GCC version is GCC binary package from ArchLinux ARM repo. `gcc --version` gives `gcc (GCC) 6.2.1 20160830`
[Bug lto/77996] New: Miscompilation due to LTO on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77996 Bug ID: 77996 Summary: Miscompilation due to LTO on aarch64 Product: gcc Version: 6.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- I'm seeing a miscompilation of LLVM's tablegen on AArch64 by gcc 6.2.1 when LTO is enabled. I've tried very hard to reduce it but unfortunately it wasn't very successful this time and the current repro is still 8000 lines of code. Attached are the source and resulting binaries. The code/ directory has a simple cmake projects reduced from the LLVM one (I can turn that into a makefile or a shell script on request but the current form should be pretty simple already). To reproduce make a `build/` directory in `code/` and run `CFLAGS='-flto -O3' CXXFLAGS='-flto -O3' LDFLAGS='-O3 -flto' cmake .. -DCMAKE_BUILD_TYPE=Release; make llvm-tblgen; bin/llvm-tblgen`. Remove the `-flto` should get rid of the error in the last command. Changes in seemingly unrelated lines can also make the error go away. (If there's anything I learnt from reducing it, the error seems to appear only when the code is complex). One of such changes is commenting out `SCTrans.PredTerm = Preds;` close to the end of `CodeGenSchedule.cpp` (used to generate the good/ version included). In fact, removing almost any lines in this file can make the error go away even though not a single line of code there should be executed. The bad/ and the good/ directories conatins compilation results using the flags mentioned above with the unmodified code and the code with the one line commented out. They have all the object files, binary files and the disassemble of the resulting executable/the bad function. The asm's are disassembled from the final binary since I don't know how to get it directly when compiling with LTO. The direct error seems to be in `CodeGenRegister::computeSubRegs` in the branch before the `printf("5\n")`. The `DenseMap::insert` method (which is called twice in this function and nowhere else) is inlined but returns corrupted iterator sometimes when the inserted key already exists in the map causing the check to fail. The difference of asm of this function is in the toplevel of the tarball. Original repro is compiling LLVM with LTO on on AArch64. The compilation should fail when generating target information for AArch64. GCC version is GCC binary package from ArchLinux ARM repo. `gcc --version` gives `gcc (GCC) 6.2.1 20160830`
[Bug target/77728] [5/6/7 Regression] Miscompilation multiple vector iteration on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728 --- Comment #2 from Yichao Yu --- I should add that turning on lto works around the issue both in the simple code attached and for the original issue I was having in julia (i.e. compiling llvm with LTO makes the issue go away).
[Bug target/77728] New: Miscompilation multiple vector iteration on ARM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77728 Bug ID: 77728 Summary: Miscompilation multiple vector iteration on ARM Product: gcc Version: 6.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Code to reproduce is at https://gist.github.com/yuyichao/a66edb9d05d18755fb7587b12e021a8a. The two cpp files are ```c++ #include #include typedef std::vector<std::pair<uint64_t, uint64_t>> DWARFAddressRangesVector; void dumpRanges(const DWARFAddressRangesVector& Ranges) { for (const auto : Ranges) { (void)Range; } } void collectChildrenAddressRanges(DWARFAddressRangesVector& Ranges) { const DWARFAddressRangesVector = DWARFAddressRangesVector(); Ranges.insert(Ranges.end(), DIERanges.begin(), DIERanges.end()); } ``` ```c++ #include #include typedef std::vector<std::pair<uint64_t, uint64_t>> DWARFAddressRangesVector; void collectAddressRanges(DWARFAddressRangesVector , const DWARFAddressRangesVector ) { CURanges.insert(CURanges.end(), CUDIERanges.begin(), CUDIERanges.end()); } int main() { std::vector<std::pair<uint64_t, uint64_t>> CURanges; std::vector<std::pair<uint64_t, uint64_t>> CUDIERanges{{1, 2}}; collectAddressRanges(CURanges, CUDIERanges); return 0; } ``` Both compiled with `g++ -O2` and linked together. When running the compiled program, it raises an exception in the `insert` ``` terminate called after throwing an instance of 'std::length_error' what(): vector::_M_range_insert ``` which shouldn't happen. The issue seems to be related to merging duplicated code since it is important to put the code into two files and the present of the second .o file is important even though none of the code in it is used. The iterations also have to be all on the const reference of vector. Removing one of the const also makes the issue go away. The g++ is version 6.2.1 from the ArchLinuxARM armv7h repository. This might be a regression in gcc 5 since other devs using gcc <=4.9 doesn't seem to have this issue and I was able to reproduce this on archlinux on 4-5 different systems with gcc >=5. This causes https://github.com/JuliaLang/julia/issues/14550
[Bug target/70814] atomic store of __int128 is not lock free on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70814 --- Comment #4 from Yichao Yu --- Thanks for the explanation. I didn't realize that the load is the problem. Just curious (since I somehow can't find documentation about it), would `ldaxp` provide the right semantics without the corresponding store?
[Bug tree-optimization/71414] 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 --- Comment #7 from Yichao Yu --- If I add `-fvariable-expansion-in-unroller` (omg this options is like half the command line ;-p ...), the performance matches the clang one after the clang 3.8 regression. ``` % gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2 benchmark.c -o benchmark2 % ./benchmark2 45.588861 % ./benchmark-gcc 80.518152 % ./benchmark-clang38 41.920054 % ./benchmark-clang37 25.093145 ```
[Bug other/71414] 2x slower than clang summing small float array
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 --- Comment #4 from Yichao Yu --- The C code is in the gist linked `a` is a cacheline aligned pointer and `n` is 1024 so `a` should even fits in L1d, which is 32kB on both processors I benchmarked. More precise timing (ns per loop) 6700K ``` % ./benchmark-gcc 80.553456 % ./benchmark-clang37 28.81 % ./benchmark-clang38 41.782532 ``` 4702HQ ``` % ./benchmark-gcc 140.744893 % ./benchmark-clang37 50.835441 % ./benchmark-clang38 70.220946 ``` Pasting the whole program over for completeness. The alignment line gives some weird timing on clang without `-mcore-avx2` but doesn't change anything too much with `-Ofast -mcore-avx2` ``` // #include #include #include #include #include uint64_t gettime_ns() { struct timespec t; clock_gettime(CLOCK_MONOTONIC, ); return t.tv_sec * (uint64_t) 1e9 + t.tv_nsec; } __attribute__((noinline)) float sum32(float *a, size_t n) { /* a = (float*)__builtin_assume_aligned(a, 64); */ float s = 0; for (size_t i = 0;i < n;i++) s += a[i]; __asm__ volatile ("" ::: "memory"); return s; } int main() { float *p = aligned_alloc(64, sizeof(float) * 1024); memset(p, 0, sizeof(float) * 1024); uint64_t start = gettime_ns(); for (int i = 0;i < 1024 * 1024;i++) sum32(p, 1024); free(p); uint64_t end = gettime_ns(); printf("%f\n", (end - start) / (1024.0 * 1024.0)); return 0; } ```
[Bug other/71414] New: 2x slower than clang summing small float array
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 Bug ID: 71414 Summary: 2x slower than clang summing small float array Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Ref https://llvm.org/bugs/show_bug.cgi?id=28002 C source code. ```c __attribute__((noinline)) float sum32(float *a, size_t n) { /* a = (float*)__builtin_assume_aligned(a, 64); */ float s = 0; for (size_t i = 0;i < n;i++) s += a[i]; return s; }``` See [this gist](https://gist.github.com/yuyichao/5b07f71c1f19248ec5511d758532a4b0) for assembly output by different compilers. GCC appears to be ~2x slower than clang on the two machines (4702HQ and 6700K) I benchmarked this.
[Bug target/71056] [6/7 Regression] __builtin_bswap32 NEON instruction error with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71056 --- Comment #4 from Yichao Yu --- (Sorry I'm not sure how to understand that cross link). Is the fix merged?
[Bug target/71056] New: __builtin_bswap32 NEON instruction error with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71056 Bug ID: 71056 Summary: __builtin_bswap32 NEON instruction error with -O3 Product: gcc Version: 6.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- The following code generate a NEON instruction not available error when compiling with `gcc -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -O3 -o /dev/null -c a.c` on ARM on gcc 6.1.1 (ArchLinuxARM). ```c #include #include extern char *buff; int f2(); struct T1 { int32_t reserved[2]; uint32_t ip; uint16_t cs; uint16_t rsrv2; }; void f3(const char *p) { struct T1 x; memcpy(, p, sizeof(struct T1)); x.reserved[0] = __builtin_bswap32(x.reserved[0]); x.reserved[1] = __builtin_bswap32(x.reserved[1]); x.ip = __builtin_bswap32(x.ip); x.cs = x.cs << 8 | x.cs >> 8; x.rsrv2 = x.rsrv2 << 8 | x.rsrv2 >> 8; if (f2()) { memcpy(buff, "\n", 1); } } ``` Error message ``` alarm% gcc -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -O3 -o /dev/null -c a.c a.c: In function ‘f3’: a.c:16:21: fatal error: You must enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use these intrinsics. x.reserved[0] = __builtin_bswap32(x.reserved[0]); ^~~~ compilation terminated. ``` Note that `NEON` isn't enabled and there's no direct use of NEON instructions/intrinsics in the code so the NEON instructions must have been added by the optimizer. Seemingly subtle change can make the error disappear. This includes. 1. -O3 -> -O2 (ok, this one is not particularly subtle) 2. Remove any of the byteswap or field 3. Remove any of the memcpy 4. Make the second memcpy unconditional 5. Remove `f2()` (but leave the memcpy condition in some other way) 6. Pass in `x` as argument (either as value or pointer) The asm generated when compiling with `fpu=neon` ``` f3: @ args = 0, pretend = 0, frame = 16 @ frame_needed = 0, uses_anonymous_args = 0 mov r3, r0 str lr, [sp, #-4]! sub sp, sp, #20 ldr r2, [r3, #8]@ unaligned ldr r1, [r3, #4]@ unaligned mov ip, sp ldr r0, [r0]@ unaligned ldr r3, [r3, #12] @ unaligned stmia ip!, {r0, r1, r2, r3} mov r3, r2 ldrhip, [sp, #12] rev r3, r3 ldrhr0, [sp, #14] vldrd16, [sp] lsr r1, ip, #8 str r3, [sp, #8] vrev32.8d16, d16 lsr r2, r0, #8 orr r2, r2, r0, lsl #8 orr r1, r1, ip, lsl #8 strhr2, [sp, #14] @ movhi strhr1, [sp, #12] @ movhi vstrd16, [sp] bl f2 cmp r0, #0 movwne r3, #:lower16:buff movtne r3, #:upper16:buff movne r2, #10 ldrne r3, [r3] strbne r2, [r3] add sp, sp, #20 @ sp needed ldr pc, [sp], #4 .size f3, .-f3 .ident "GCC: (GNU) 6.1.1 20160501" ``` And it seems that the NEON instruction it want to generate is `vrev32.8` The case is simplified from https://github.com/llvm-mirror/llvm/blob/da4b82ab1387da8c959a4e2439bce10b9cefbc8a/tools/llvm-objdump/MachODump.cpp#L8240-L8263 I don't remember seeing this on gcc 5.