[Bug middle-end/95189] [9/10 Regression] memcmp being wrongly stripped like strcmp

2020-10-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95189

Alexander Monakov  changed:

   What|Removed |Added

  Known to fail||9.3.0
  Known to work|9.3.0   |

--- Comment #19 from Alexander Monakov  ---
That was set early on and not adjusted when new testcases were found. Changed
now.

The original testcase in comment #0 happened to work with 9.3, but breaks if
'float z[1]' is changed to 'char z[4]'. Testcases from the dup and from comment
#7 also fail with 9.3.

[Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call

2020-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #11 from Alexander Monakov  ---
Yes, that.

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

--- Comment #5 from Alexander Monakov  ---
afaict LRA is just following IRA decisions, and IRA allocates that pseudo to
memory due to costs.

Not sure where strange cost is coming from, but it depends on x86 tuning
options: with -mtune=skylake we get the expected code, with -mtune=haswell we
get 128-bit vectors right and extra load for 256-bit, with -mtune=generic both
cases have extra loads.

[Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call

2020-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

--- Comment #8 from Alexander Monakov  ---
No, -msoft-stack-reserve-local is really meant to be in bytes: it may not
exceed the amount of .local memory reserved by CUDA driver (which is just 1-2
KB, unless overridden via cuCtxSetLimit, which nvptx-run.c does, but
plugin-nvptx.c does not).

Keep in mind that .local memory reservation is multiplied by number of active
contexts, which could be in range 2-3 when the code was written: 128KB
local memory per active thread would imply a 2.5GB allocation on the GPU.

[Bug target/97203] [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call

2020-10-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97203

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #6 from Alexander Monakov  ---
(In reply to Tom de Vries from comment #4)
> So, I think calling functions from simd code is atm not supported for nvptx.
> 
> Stack variables in simd code are mapped on a per-thread stack rather than on
> the
> usual per-warp stack.
> 
> The functions are compiled with the usual per-warp stack, so calling those
> functions from simd might mean the different lanes are gonna disagree about
> what the value in a stack variable should be.

This is inaccurate. In -msoft-stack mode there's no baked-in assumption that
stacks are always per-warp. The "soft stack" pointer can point either to global
memory (outside of SIMD regions), or to local memory (inside SIMD regions). The
pointer is switched between per-warp global memory and per-lane local memory by
nvptx.c:nvptx_output_softstack_switch.

The main requirement is that functions callable from OpenMP offloaded code are
compiled for -mgomp multilib variant. The design allows calling functions even
from inside SIMD regions, and it should be supported.

It is very disappointing that the first reaction was "I think ... is not
supported" without reaching out and asking questions. Lack of efficient
communication was a huge issue when OpenMP offloading support was contributed,
and it's disappointing to see it again years later.

[Bug libgomp/97291] [SIMT] Move SIMT_XCHG_* out of non-uniform execution region

2020-10-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97291

--- Comment #1 from Alexander Monakov  ---
Reshuffling statements and piling up extra abstraction doesn't help solve the
core issue that GIMPLE passes can duplicate any basic block, but basic blocks
of SIMT loop epilogue should be protected from that. More generally, arbitrary
duplication demonstrably causes miscompilation even without SIMT: PR 80053.

[Bug target/97366] [8/9/10/11 Regression] Redundant load with SSE/AVX vector intrinsics

2020-10-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Intrinsics being type-agnostic cause vector subregs to appear before register
allocation: the pseudo coming from the load has mode V2DI, the shift needs to
be done in mode V4SI, the bitwise-or and the store are done in mode V2DI again.
Subreg in the bitwise-or appears to be handled inefficiently. Didn't dig deeper
as to what happens during allocation.

FWIW, using generic vectors allows to avoid introducing such mismatches, and
indeed the variant coded with generic vectors does not have extra loads. For
your original code you'll have to convert between generic vectors and __m128i
to use the shuffle intrinsic. The last paragraphs in "Vector Extensions"
chapter [1] suggest using a union for that purpose in C; in C++ reinterpreting
via union is formally UB, so another approach could be used (probably simply
converting via assignment).

[1] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

typedef uint32_t u32v4 __attribute__((vector_size(16)));
void gcc_double_load_128(int8_t *__restrict out, const int8_t *__restrict
input)
{
u32v4 *vin = (u32v4 *)input;
u32v4 *vout = (u32v4 *)out;
for (unsigned i=0 ; i<1024; i+=16) {
u32v4 in = *vin++;
*vout++ = in | (in >> 4);
}
}

Above code on Compiler Explorer: https://godbolt.org/z/MKPvxb

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #9 from Alexander Monakov  ---
(In reply to Richard Biener from comment #8)
> Note that currently RTL expansion forces a local vector typed variable
> to the stack (instead of allocating a pseudo) when there are
> variable-index accesses to it.  That might be a reason to also handle
> slightly "expensive" extract cases.  But I guess later falling back
> to a stack slot via a splitter or LRA will lead to worse code.

Indeed, but I struggle to see a good reason to bind the entire lifetime of a
variable to memory just because one operation requires that. Cannot GCC instead
create a fresh temporary early at RTL-expand (not split) time for each extract
operation, letting the original variable live in a pseudo, and binding only
that short-lived temporary to memory?

It can result in extra copies if the temporary needs to be loaded from memory
anyway, but I think passes like RTL CSE should be able to propagate them.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #14 from Alexander Monakov  ---
I see, there are more weaknesses than I thought. For CSE (or rather fwprop?) I
was thinking about a simpler case where the extracted-from value is loaded from
memory, but even in trivial cases RTL optimizers cannot clean it up today (so
it wouldn't get any better with separate temporaries):

#define N 16
typedef int T;
typedef T V __attribute__((vector_size(N)));
T f(V *px, long i)
{
V x = *px;
return x[i];
}

f:
movdqa  (%rdi), %xmm0
movaps  %xmm0, -24(%rsp)
movl-24(%rsp,%rsi,4), %eax
ret

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #11 from Alexander Monakov  ---
Yeah, for inserts such tactic would be inappropriate due to bad store
forwarding stalls anyway. As you've shown in earlier comments, inserts have a
very nice generic way to expand them (that does not touch stack).

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #16 from Alexander Monakov  ---
Mostly because prior to register allocation the compiler does not naturally see
that x = *mem + a*b will need an extra mov when both 'a' and 'b' are live (as
in that case registers allocated for them cannot be immediately reused for
'x').

(this is why RTL combine merges the two instructions, but on the isolated
testcase below we already have the combined form thanks to tree-TER)

Isolated testcase, by the way (gcc -O2 -mfma):

double f(double t, double a, double b, double c[], long i)
{
t = __builtin_fma(a, b, c[i]);
asm("" :: "x"(t), "x"(a), "x"(b));
return t;
}

[Bug target/97127] FMA3 code transformation leads to slowdown on Skylake

2020-09-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #17 from Alexander Monakov  ---
To me this suggests that in fact it's okay to carry the combined form in RTL up
to register allocation, but RA should decompose it to load+fma instead of
inserting a register copy that preserves the live operand.

(not a fan of "RA should deal with it" theme, but here it does look like the
most appropriate resolution)

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
FWIW, Peter Cordes provides an overview of available approaches for extraction
depending on vector length and ISA extensions (up to AVX2, not including
AVX-512) in this StackOverflow answer:
https://stackoverflow.com/a/51414330/4755075

TL;DR: generally through store+load; possible alternatives:
 128b:
  SSSE3: pshufb  (1-byte elements)
  SSSE3: imul+add+pshufb (any element size)
  AVX: vpermilp[sd] (4 or 8-byte elements)
 256b:
  AVX2: vpermps (4-byte elements)

In all cases a (v)movd is needed to move the index to a vector register, and
potentially another (v)movd if the result is needed in a general register.

The basic store+load tactic may look worse latency-wise, but can be better
throughput-wise (especially with multiple extractions from the same vector, as
then the store needs to be done just once, as Peter mentioned).

Why in RTL it is important to do this without referencing the stack?

[Bug libstdc++/98226] Slow std::countr_one

2020-12-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98226

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #10 from Alexander Monakov  ---
(In reply to Oleg Zaikin from comment #6)
> There
> we have a function firstzero(unsigned x) that returns 2^i, with i the
> position of the first 0 of x, and 0 iff there is no 0. Its implementation is:
>   unsigned firstzero(const unsigned x) noexcept {
> #if __cplusplus > 201703L
> return x == unsigned(-1) ? 0 : unsigned(1) << std::countr_one(x);
> #else
> const unsigned y = x+1; return (y ^ x) & y;
> #endif
>   }

But why you are trying to use a more complex branchy expression in C++17 mode
when you already have a more efficient expression as a "fallback"?

Note that a cheaper way is available:

return (x+1) & ~x;

(though gcc can optimize '(y ^ x) & y' you have to the same machine code)

[Bug inline-asm/97708] Inline asm does not use the local register asm specified with register ... asm() as input

2020-11-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97708

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #24 from Alexander Monakov  ---
Segher, did you really mean to mark the bug resolved/fixed?

FWIW, I think Jakub is using an overly broad interpretation of the intended
behavior. Stretching that logic, it's possible to argue that it's okay for GCC
to put an operand in a different register than its asm specification says as
long as the constraint matches. But that would lead to wrong code.

Given that the only supported use of local register variables is passing
operands to inline asm in specific registers, I really think that GCC shouldn't
silently change the operand's location like that. The mismatching constraint
could be a result of a typo (or something like a botched refactoring), and the
compiler should help the user catch such errors.

[Bug target/97734] GCC using branches when a conditional move would be better

2020-11-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97734

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
By the time RTL ce2 pass runs the RTL is suitable for if-conversion, but the
pass rejects the transform (probably due to costs, not visible in dumps so
impossible to tell without gdb'ing the compiler).

Note that this cmov lengthens the loop-carried data dependency on 'i', so it's
only beneficial on workloads where the control dependency it replaces
corresponds to an unpredictable branch. And GCC has no way to know that.

For a recent counter-example (cmov dramatically slowing down a loop with a
trivially predictable branch) please see
https://stackoverflow.com/a/64285902/4755075

At risk of needlessly repeating myself: I think what people doing such research
really need is __builtin_branchless_select that is properly guaranteed to be
branchless (selection statement on GIMPLE and branchless sequence on RTL).
Otherwise they spend time tweaking their code to make cmov appear where they
need it.

[Bug inline-asm/97708] Inline asm does not use the local register asm specified with register ... asm() as input

2020-11-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97708

--- Comment #30 from Alexander Monakov  ---
Asm operand binding should work by looking at bound lvalue: "c"(a) binds an
lvalue so if 'a' is a register var the compiler must remember its associated
register; "c"(a+0) binds an rvalue, so what kind of variable 'a' is is
irrelevant.

The way it should work internally is at gimplification time the compiler should
produce an additional asm statement argument that stores the associated
register names for each operand. For example. for

  register int a asm("%eax");
  asm("" : "r"(a+0), "r"(a), "r"(0));

gcc could internally produce a string ",%eax," signifying that operand 1 is
bound to %eax and operands 0 and 2 are not bound to any particular register.

Then all passes up to IRA don't need to give any particular care for asm
operands, and IRA can use the string to place operands in appropriate registers
(and diagnose a mismatch).

This is also the only proper way to fix PR 87984 as far as I can tell.

[Bug libgomp/98258] Can't compile programs for both OpenMP (CPU) + OpenACC (GPU)

2021-01-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98258

--- Comment #8 from Alexander Monakov  ---
(In reply to Chinoune from comment #7)
> $ gfortran-10 -O3 -fopenmp -fopenacc -c bug_omp_acc.f90
> $ gfortran-10 bug_omp_acc.o -lgomp -o test.x

Contrary to my  suggestion, you have omitted -fopenacc from the second command
line, why?

[Bug libgomp/98258] Can't compile programs for both OpenMP (CPU) + OpenACC (GPU)

2021-01-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98258

--- Comment #10 from Alexander Monakov  ---
Thanks for checking. As for this:

> Please, stop suggesting untested workarounds.

Yes, I should have mentioned those are untested. I was typing the response late
at night without access to offloading-capable GCC, just from memory and
understanding how it works. The suggestions didn't require any extraordinary
actions from you. The quoted part was rude. Please be more considerate of
maintainers' time.


I think one of the suggested workarounds should be fixed to work, probably the
-foffload=-fno-openmp one.

[Bug libgomp/98258] Can't compile programs for both OpenMP (CPU) + OpenACC (GPU)

2021-01-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98258

--- Comment #5 from Alexander Monakov  ---
One possible solution is -foffload=-fno-openmp

Another possible solution is separate compilation and linking, with only
OpenACC enabled at link step (needs explicit -lgomp):

gfortran -fopenmp -fopenacc bug_omp_acc.f90 -c -o test.o
gfortran -fopenacc test.o -lgomp -o test.x

[Bug tree-optimization/98906] New: [8/9/10/11 Regression] Miscompiles code even at -O1

2021-01-31 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98906

Bug ID: 98906
   Summary: [8/9/10/11 Regression] Miscompiles code even at -O1
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50097
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50097=edit
testcase

The attached testcase is clean w.r.t ASan and UBSan. At -O1+, 'main' is
miscompiled to a single basic block reporting an error on initial loop
iteration, since gcc-6, while -Og and '-O1 -fno-inline' yield expected code.

.optimized dump is wrong, so one of GIMPLE passes is the culprit, but a bit
hard to see which one exactly.

[Bug tree-optimization/98906] [8/9/10/11 Regression] Miscompiles code even at -O1

2021-02-01 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98906

--- Comment #6 from Alexander Monakov  ---
Ah, -fsanitize=float-cast-overflow catches it, but it needs to be enabled
explicitly (not implied by -fsanitize=undefined). Thank you!

[Bug middle-end/100593] [ELF] -fno-pic: Use GOT to take address of an external default visibility function

2021-05-18 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593

--- Comment #7 from Alexander Monakov  ---
Thanks. I agree that inferring address significance on the linker side is
problematic.

Thinking about your original request, I was about to say that it would be very
reasonable to do under -fno-plt flag, but then I found it was already
implemented for x86-64 in gcc-7 and for 32-bit x86 in gcc-8. Compiling

int f();
void *g()
{
  return f;
}

with -fno-pic -fno-plt yields

g:
movqf@GOTPCREL(%rip), %rax
ret

(yields GOTPCRELX relocation) and

g:
movlf@GOT, %eax
ret

on 32-bit (yields GOT32X relocation), so on x86 it's already implemented?

[Bug middle-end/100593] [ELF] -fno-pic: Use GOT to take address of an external default visibility function

2021-05-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
It is not necessary to change -fno-pic code generation to gain most of the
-Bsymbolic benefit: as you say, the most important point is to avoid jumping
via PLT trampolines (or, with -fno-plt, GOT loads) for function calls, so the
linker could do -Bsymbolic relaxation for sites where address doesn't matter
(calls and jumps) while keeping a dynamic relocation for address loads? Under
some new option of course, like -Bsymbolic-plt. Right?

[Bug middle-end/100593] [ELF] -fno-pic: Use GOT to take address of an external default visibility function

2021-05-17 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593

--- Comment #3 from Alexander Monakov  ---
I understand what you're saying, but it seems we're talking past each other.

I agree that if a library is linked with any -Bsymbolic* flag, the main
executable is at risk of broken address uniqueness unless it uses GOT
indirection.

I am saying that if the library was linked with a more restrictive variant of
-Bsymbolic (that I called -Bsymbolic-plt), it would still get most the benefit
of -Bsymbolic, while remaining compatible with unmodified executables.

Would you agree?

[Bug middle-end/100593] [ELF] -fno-pic: Use GOT to take address of an external default visibility function

2021-05-17 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593

--- Comment #5 from Alexander Monakov  ---
Hm, I still don't think I'm misunderstanding what you're saying. I'm familiar
with the ELF standard (and FWIW I have read your blog posts on related
matters). I am responding to this sentiment from the opening comment:

> I believe ld -Bsymbolic-functions can materialize most of the savings other
> implementations provide, without introducing complex things to ELF.
> However, since -Bsymbolic-functions doesn't play well with -fno-pic's
> canonical PLT entries, we should fix -fno-pic.

I am saying that fixing -fno-pic is not the only possible way forward. Rather,
a restricted -Bsymbolic-functions that relaxes relocations that are not
address-significant allows to still get some (but not all) of the benefits for
unchanged -fno-pic executables.

> You misunderstand this. Emitting GOT-generating relocation in -fno-pic mode
> is the only way to avoid canonical PLT entry, if the function turns out to
> be defined in a shared object. No -Bsymbolic variant can make this
> compatible.

Well, if you frame the goal as "eliminate canonical PLT entries", then yes, but
that in itself surely is not the end goal? The end goals are reducing startup
time (which my idea helps only partially since it may bind direct calls but not
e.g. vtable definitions) and runtime overheads (where again my proposal is
weaker but not significantly so, assuming address loads are rarely on hot
paths).


To clarify once more. I am not outright rejecting the idea in your opening
comment. I am saying that there potentially is a lighter-weight alternative,
which may be implementable purely in the linker, and still gets most of the
benefit you're promoting (like in your Clang example). Which is nice, because
it can be rolled out sooner, individual libraries/distros/users can opt-in and
experiment as they like, etc.

[Bug c/100483] Extend -fno-semantic-interposition to global variables

2021-05-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100483

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
I'm afraid this is potentially misunderstanding what the word 'semantic' in
-fno-semantic-interposition implies. I am not the author, but I always
understood this like so:

GCC is concerned with two aspects of ELF interposition: address interposition
(for address uniqueness) and functionality interposition (e.g. hooking malloc).
For optimization, the compiler cares a lot about the latter (it blocks inlining
and other optimizations), but not so much about the former (taking an address
of a global is rarely on the hot paths, so it's not critical to convert GOT
loads to pc-relative relocations).

So GCC splits ELF interposition concerns to 'address interposition' and
'semantic interposition', maintains the ability to perform the former (so
address uniqueness is not broken), and allows the programmer to promise that
semantic interposition (interposing a function with another function that acts
differently) does not happen.

To illustrate, compiling

void f(){
  asm("#");
}
void *g(){
  f();
  return f;
}

with -O2 -fpic -fno-semantic-interposition yields

f:
#
ret
g:
#
movqf@GOTPCREL(%rip), %rax
ret

i.e. the call is inlined, but taking the address goes through the GOT.

[Bug c/100618] Add a -fno-semantic-interposition variant which allows variable interposition

2021-05-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100618

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
The implied split is 'code,data,tls-data' rather than 'functions,variables'
(there are constants too, which are not variables, but should be treated like
variables here; and TLS data does not rely on copy relocations).

Since the original option was intended to be used mainly in the negative form,
I think it may be less confusing to introduce

-f[no-]semantic-code-interposition

[Bug c/100618] Add a -fno-semantic-interposition variant which allows variable interposition

2021-05-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100618

--- Comment #3 from Alexander Monakov  ---
Furthermore as discussed in bug 100483 this request appears based on a
misunderstanding what the 'semantic-' part of the option is about. It does not
affect assembly/linker-level binding mechanism, so things like presence of copy
relocations should not be affected.

[Bug middle-end/100593] [ELF] -fno-pic: Use GOT to take address of an external default visibility function

2021-05-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100593

--- Comment #10 from Alexander Monakov  ---
Is there something wrong or undesirable with making this under -fno-plt (or the
noplt attribute as in your example)?

(after all, it is a kind of PLT-avoidance transformation, just for addressing
rather than direct calling/jumping)

[Bug libgomp/100573] [OpenMP] 'omp target teams' fails with nvptx and GCN offloading: FAIL libgomp.c-c++-common/for-3.c + for-9.c

2021-05-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100573

--- Comment #17 from Alexander Monakov  ---
Yes, I'd agree normally it's present in the offload table, but ideally if
you're trying to stub out the call, it should not be present in the offload
table.

I think Tobias is saying that on GIMPLE this function does not have 'omp target
entrypoint' attribute attached to it? If so, that's causing a problem, because
the backend will not synthesize the corresponding PTX .global function.

Each function named in the offload table should be 'omp target entrypoint'.

[Bug libgomp/100573] [OpenMP] 'omp target teams' fails with nvptx and GCN offloading: FAIL libgomp.c-c++-common/for-3.c + for-9.c

2021-05-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100573

--- Comment #19 from Alexander Monakov  ---
Ah, does the issue arise because foo._omp_fn.0 is (before the patch) callable
in two contexts, in one it's called from host and should be 'omp target
entrypoint', and in the other it's called from offloaded code and bears 'omp
declare target'?

If so, I think omp-expand code should make 'omp target entrypoint' prevail over
'omp declare target'?

[Bug libgomp/100573] [OpenMP] 'omp target teams' fails with nvptx and GCN offloading: FAIL libgomp.c-c++-common/for-3.c + for-9.c

2021-05-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100573

--- Comment #14 from Alexander Monakov  ---
I would break in gdb on cuModuleGetFunction and

  x/s $rdx

to print the failing symbol (it's the third argument to the function).

It seems the "inner" entrypoint (which your patch attempted to nullify) is
still registered in offload tables, so the plugin takes its name from the
offload table and attempts to look it up in the offloaded code?

[Bug tree-optimization/100363] gcc generating wider load/store than warranted at -O3

2021-05-01 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
The github issue has a more relevant code quote:

#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS <-- this is enabled for ARCv2
279:PUP(sout) = PUP(sfrom);
#else
PUP(sout) = UP_UNALIGNED(sfrom);
#endif


Most likely the issue is that sout/sfrom are misaligned at runtime, while the
vectorized code somewhere relies on them being sufficiently aligned for a
'short'.

It is unsafe to dereference a misaligned pointer. The pointed-to-type must have
reduced alignment:

typedef unsigned short u16_u __attribute__((aligned(1)));

u16_u *sout = ...

u16_u *sfrom = (void *)(from - OFF);

(without -ffreestanding, memcpy/memmove is a portable way to express a
misaligned access)

https://trust-in-soft.com/blog/2020/04/06/gcc-always-assumes-aligned-pointer-accesses/

[Bug c/93031] Wish: When the underlying ISA does not force pointer alignment, option to make GCC not assume it

2021-05-03 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93031

--- Comment #7 from Alexander Monakov  ---
In comment #2 I touched upon a potentially more practical way to offer
-fno-strict-alignment:

Run early work with ABI alignments: compute __alignof correctly, lay out
composite types as required by ABI, and assign alignments to variables
(including stack variables and function parameters). Then make a pass over
types and reduce their alignment. This way, optimizations see a universe where
types have alignment 1, and variables are defined as if they had an explicit
attribute-align with increased alignment (and likewise for structure fields).

[Bug other/99903] 32-bit x86 frontends randomly crash while reporting timing on Windows

2021-05-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99903

Alexander Monakov  changed:

   What|Removed |Added

 Ever confirmed|1   |0
 Status|WAITING |UNCONFIRMED
 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
32-bit Linux should also be affected (perhaps with less probability if clock()
is more precise). It is surprising we track time in a 'double', a 64-bit
integer storing nanoseconds would be more appropriate.

Removing WAITING, thanks.

[Bug c++/99728] code pessimization when using wrapper classes around SIMD types

2021-03-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99728

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
My testcase changes __m256d to __v2df to avoid __may_alias__, and changes
overloaded operators to make Tvsimple objects passed by value everywhere (in an
attempt to simplify GIMPLE IR).

[Bug target/99582] No intrinsics to access rcl or rcr instruction on x86_64

2021-03-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99582

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
RCL and RCR are supported via microcode sequencer on Intel and involve many (9)
uops on modern AMD, so they are quite slow in comparison to simple
shifts/rotates. Would library developers still want to use them despite the
poor performance? Equivalent code with "classic" shifts should be more
efficient.

https://uops.info/html-instr/RCL_R64_CL.html

[Bug middle-end/99619] New: fails to infer local-dynamic TLS model from hidden visibility

2021-03-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99619

Bug ID: 99619
   Summary: fails to infer local-dynamic TLS model from hidden
visibility
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

Thread-local variables with hidden visibility don't need to use the
"general-dynamic" TLS model: they can use "local-dynamic" model, which is more
efficient when more than one variable is accessed. This is documented in "ELF
handling for thread-local storage".

Testcase:

__attribute__((visibility("hidden")))
extern __thread int a, b;

int f()
{
return a + b;
}

clang -O2 -fpic emits:
f:
.cfi_startproc
pushrax
.cfi_def_cfa_offset 16
lea rdi, [rip + a@TLSLD]
call__tls_get_addr@PLT
mov rcx, rax
mov eax, dword ptr [rax + b@DTPOFF]
add eax, dword ptr [rcx + a@DTPOFF]
pop rcx
.cfi_def_cfa_offset 8
ret

gcc -O2 -fpic emits:
f:
.cfi_startproc
pushrbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
data16  lea rdi, a@tlsgd[rip]
.value  0x
rex64
call__tls_get_addr@PLT
mov rbx, rax
data16  lea rdi, b@tlsgd[rip]
.value  0x
rex64
call__tls_get_addr@PLT
mov eax, DWORD PTR [rax]
add eax, DWORD PTR [rbx]
pop rbx
.cfi_def_cfa_offset 8
ret

[Bug rtl-optimization/99469] ICE: qsort checking failed with selective scheduling on aarch64

2021-03-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99469

Alexander Monakov  changed:

   What|Removed |Added

 Blocks||82407

--- Comment #2 from Alexander Monakov  ---
Related: PR 84566


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82407
[Bug 82407] [meta-bug] qsort_chk fallout tracking

[Bug rtl-optimization/99462] Enhance scheduling to split instructions

2021-03-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99462

--- Comment #3 from Alexander Monakov  ---
(for context, the above patch was for PR 98856, but it's based on incorrect
latency analysis, see bug 98856 comment #38 )

Right now schedulers cannot easily split instructions for that purpose, it
would require computing dependency graph more accurately. Right now
dependencies and priorities are computed with respect to instructions as a
whole, intelligent splitting would require tracking latencies with respect to
individual inputs.

sel-sched does not split, but it can perform "renaming" which basically
overcomes anti-dependencies by scheduling the desired instruction before the
conflicting write (by choosing a different output register), and a reg-reg move
later.

I think on modern x86 profitability of such splitting is quite dubious, because
it would increase the amount of instructions and uops flowing in the CPU
front-end and entering the renamer (which is one of narrowest points in the
pipeline). Especially on AMD, where not only load-op, but also load-op-store
instructions are renamed as a single uop (which is then sent to two or three
execution units).

I think in common cases where overall critical path is unchanged (like in given
examples of pinsrq and various load-op instruction) GCC should simply continue
emitting the combined form.

[Bug tree-optimization/98856] [11 Regression] botan AES-128/XTS is slower by ~17% since r11-6649-g285fa338b06b804e72997c4d876ecf08a9c083af

2021-03-08 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #38 from Alexander Monakov  ---
Late to the party, but latency analysis of vpinsrq starting from comment #18 is
incorrect: its latency is different with respect to operands.

For example, on Zen 2 latency with respect to GPR operand is long (6 cycles,
one more that grp->xmm move latency), while latency with respect to XMM operand
is just one cycle, same as punpcklqdq. See uops.info, which also shows that
vpinsrq involves 2 uops, and it's easy to guess what they are: first uop is for
gpr->xmm inter-unit move (latency 5), and the second is SSE merge:

  https://uops.info/html-instr/VPINSRQ_XMM_XMM_R64_I8.html
  https://uops.info/html-instr/VMOVD_XMM_R32.html

So in the CPU backend there's not much difference between

movq
pinsrq

and

movq
movq
punpcklqdq

both have same uops and overall latency (1 + movq latency).

(though on Intel starting from Haswell pinsrq oddly has latency 2 w.r.t xmm
operand, but on Ice Lake it is again 1 cycle).

[Bug rtl-optimization/86096] [8 Regression] ICE: qsort checking failed (error: qsort comparator non-negative on sorted output: 0)

2021-02-16 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86096

--- Comment #8 from Alexander Monakov  ---
It was fixed on the trunk only, so as the title says it remains an issue on the
gcc-8 branch (which is still open). Bugzilla doesn't have separate resolutions
for different branches, we cannot have this "RESOLVED" on the gcc-9/10/trunk
and "WONTFIX" on gcc-8 branch.

[Bug rtl-optimization/100225] [8/9/10/11/12 Regression] ICE in add_cross_iteration_register_deps, at ddg.c:291

2021-04-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100225

Alexander Monakov  changed:

   What|Removed |Added

 Blocks|85099   |
 CC||amonakov at gcc dot gnu.org,
   ||zhroma at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Hi Martin, this is a modulo-scheduling bug; I think you added "Blocks:
sel-sched" by mistake — removing, and Cc'ing Roman.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85099
[Bug 85099] [meta-bug] selective scheduling issues

[Bug middle-end/102276] New: -ftrivial-auto-var-init fails to initialize a variable, causes a spurious warning

2021-09-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102276

Bug ID: 102276
   Summary: -ftrivial-auto-var-init fails to initialize a
variable, causes a spurious warning
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

int g(int *);
int f1()
{
switch (0) {
int x;
default:
return g();
}
}
int f2()
{
goto L;
{
int x;
L:
return g();
}
}

Compiling with -O2 -ftrivial-auto-var-init=pattern causes spurious

warning: statement will never be executed [-Wswitch-unreachable]
5 | int x;
  | ^


In both f1 and f2, resulting assembly does not in fact initialize 'x':

f1:
subq$24, %rsp
leaq12(%rsp), %rdi
callg
addq$24, %rsp
ret
f2:
subq$24, %rsp
leaq12(%rsp), %rdi
callg
addq$24, %rsp
ret

[Bug middle-end/102206] amd zen hosts running zen-optimized gcc: gimplification ICE after r10-7284

2021-09-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102206

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #13 from Alexander Monakov  ---
Sergei Trofimovich made substantial progress on diagnosing this on Gentoo side,
and according to his findings the crash is due to reading stack canary from a
wrong location. This indicates that the bug is not in GCC, but in the CPU or
maybe the kernel.

Please see comments 73 and 74 in the Gentoo bugreport:
https://bugs.gentoo.org/724314#c73

[Bug middle-end/102276] -ftrivial-auto-var-init fails to initialize a variable, causes a spurious warning

2021-09-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102276

--- Comment #2 from Alexander Monakov  ---
That -ftrivial-auto-var-init places an initialization at the point of the
declaration is an implementation detail: there's no initializer in the testcase
itself, so it is valid C and C++ (spelling this out for the avoidance of
doubt).

[Bug target/93934] Unnecessary fld of uninitialized float stack variable results in ub of valid C++ code

2021-10-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93934

--- Comment #14 from Alexander Monakov  ---
Zoltan, excuse me, could you please clarify what specifically you are worried
about? Your bug title says "results in UB" and the opening comment said "the
behavior [..] is unpredictable", but as far as I see that is not the case here,
it's (as you also said) only raising an FPU exception where it shouldn't.
Unless the program inspects exception flags or enables signal delivery on FPU
exceptions, it won't notice. What is your main concern?

[Bug middle-end/21111] IA-64 NaT consumption faults due to uninitialized register reads

2021-10-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=2

--- Comment #18 from Alexander Monakov  ---
>From my perspective, the main blocker for a nice and clean solution is lack of
"birth" statements on GIMPLE.

Without them, expansion to RTL would either need to insert initialization at
the top of the function (which is silly, extends lifetimes of pseudos that only
live in a small region, complicating RA), or compute something like a lowest
common dominator of all uses and place an initialization there. But perhaps
that's the right way if "birth statements" aren't happening?

Or is there some other approach? Like not trying to insert a single
initialization, but instead substituting a zero in place of each use of the
default def individually?

[Bug hsa/86948] Internal compiler error compiling brig.dg/test/gimple/mulhi.hsail

2021-12-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86948

--- Comment #8 from Alexander Monakov  ---
How does your patch expand 64-bit highpart multiply (i.e. with 128-bit full
product) on 32-bit targets?

[Bug bootstrap/91972] Bootstrap should use -Wmissing-declarations

2021-11-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91972

--- Comment #7 from Alexander Monakov  ---
As I understand, only the gcc subdirectory changed implementation language from
C to C++, so, yes (as far as this bug is concerned).

[Bug middle-end/80053] Label with address taken should prevent duplication of containing basic block

2021-07-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053

--- Comment #13 from Alexander Monakov  ---
Yes, I'm talking only about labels which are potential branch targets, of
course after the jumps have been DCE'd it is not really observable where the
label points to. Unfortunately after four years I do not remember which line in
the RTL machinery made me think RTL is careful about their containing blocks.

Okay, so the main motivation appears to be this: the address of a label may be
passed down to some callee, and that callee couldn't possibly use it in a goto.
So GCC doesn't want to pin down such labels and their containing blocks. This
makes sense.

Effectively GCC wants to distinguish two kinds of labels, ones that can
potentially be used as a goto target in the current function, and those that
can't.

I am not clear though if each pass is supposed to be careful about
computed-goto-like constructs (both plain and asm goto) independently, instead
of having can_duplicate_bb_p cfg hook return false for blocks with
jumpable+addressable labels (and then just using that hook in the pass)? My
testcase forced a miscompilation in the unswitching pass, but other passes
duplicate blocks too, and they may be similarly "vulnerable".

[Bug middle-end/80053] Label with address taken should prevent duplication of containing basic block

2021-07-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053

Alexander Monakov  changed:

   What|Removed |Added

 Resolution|INVALID |---
 Status|RESOLVED|NEW

--- Comment #11 from Alexander Monakov  ---
In the comment you're pointing to Jakub admits that the compiler should account
for asm goto, which is exactly what the testcase is using.

More importantly, the crux of the issue is not whether I can produce a testcase
that satisfies you. The issue is that GIMPLE is (for no good reason as far as
I'm aware) less strict about this than RTL: RTL forbids duplication of such
blocks, GIMPLE does not. There's no explanation anywhere why the lax GIMPLE
behavior is not going to cause miscompilations. Please understand that and
don't rush to close.

[Bug middle-end/80053] Label with address taken should prevent duplication of containing basic block

2021-07-24 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053

Alexander Monakov  changed:

   What|Removed |Added

   Last reconfirmed||2021-07-24
 Resolution|INVALID |---
 Status|RESOLVED|NEW
 Ever confirmed|0   |1

--- Comment #9 from Alexander Monakov  ---
The documentation you're pointing to makes the testcase from comment #2
invalid, and I agree that's the right solution (address of a label remains
valid only as long as its containing function has not returned, similar to
automatic variables), but what is going on in comment #0 is still broken,
please don't close this bug.

[Bug middle-end/80053] Label with address taken should prevent duplication of containing basic block

2021-07-27 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80053

--- Comment #15 from Alexander Monakov  ---
(In reply to Richard Biener from comment #14)
> I think the original asm goto case clearly remains and this is a difficult
> to handle case since the label address only appears as regular input and the
> goto target is statically represented in the CFG.  The testcase is
> miscompiled at -O2 already.
> 
> I think asm goto is prone to such miscompilation in general if combined with
> label addresses as inputs.  I don't think it was supposed to be used in this
> way so we might want to simply amend documentation to make such uses
> undefined ...  in fact one might read
> 
> "The
> @var{GotoLabels} section in an @code{asm goto} statement contains
> a comma-separated
> list of all C labels to which the assembler code may jump."
> 
> that jumps must jump to one of the labels literally (in the way documented
> later).

I don't see a contradiction? 'lp' holds the address of 'l'; label 'l' is listed
in the asm. It doesn't jump to anywhere but 'l'.

[Bug ipa/95558] [9/10/11/12 Regression] Invalid IPA optimizations based on weak definition

2022-01-17 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95558

--- Comment #10 from Alexander Monakov  ---
As comment #5 mentioned, it is still broken, you just need -fno-inline in
addition to -O2 for the original testcase. Andrew's remark is quite useful for
situations like this, you know :)

[Bug c/111884] New: unsigned char no longer aliases anything under -std=c2x

2023-10-19 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111884

Bug ID: 111884
   Summary: unsigned char no longer aliases anything under
-std=c2x
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

int f(int i)
{
int f = 1;
return i[(unsigned char *)];
}
int g(int i)
{
int f = 1;
return i[(signed char *)];
}
int h(int i)
{
int f = 1;
return i[(char *)];
}


gcc -O2 -std=c2x compiles 'f' as though inspecting representation via an
'unsigned char *' is not valid (with a confusing warning under -Wall).

[Bug c/112367] wrong rounding of sum of floating-point constants

2023-11-03 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112367

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
As far as I can tell this was broken for all targets before gcc-12, and fixed
for all targets starting from gcc-12.

Paul, can this bug be closed?

[Bug c++/66487] sanitizer/warnings for lifetime DSE

2023-10-30 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66487

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #26 from Alexander Monakov  ---
RFC patch for detecting lifetime-dse issues via Valgrind (rather than MSan):
https://inbox.sourceware.org/gcc-patches/20231024141124.210708-1-exactl...@ispras.ru/

[Bug target/111655] [11/12/13/14 Regression] wrong code generated for __builtin_signbit and 0./0. on x86-64 -O2

2023-10-02 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org
 Ever confirmed|0   |1
Summary|wrong code generated for|[11/12/13/14 Regression]
   |__builtin_signbit and 0./0. |wrong code generated for
   |on x86-64 -O2   |__builtin_signbit and 0./0.
   ||on x86-64 -O2
 Resolution|DUPLICATE   |---
 Status|RESOLVED|NEW
   Last reconfirmed||2023-10-02

--- Comment #9 from Alexander Monakov  ---
It's true that the sign of 0./0 is unpredictable, but we can fold it only when
the division is being eliminated by the folding. 

It's fine to fold

  t = 0./0;
  s = __builtin_signbit(t);

to

  s = 0

with t eliminated from IR, but it's not OK to fold

  t = 0./0
  s = __builtin_signbit(t);

to

  t = 0./0
  s = 0

because the resulting program runs as if 0./0 was evaluated twice, first to a
positive NaN (which was used for signbit), then to a negative NaN (which fed
the following computations). This is not allowed.

This bug was incorrectly classified as a dup. The fix is either to not fold
this, or fold only when we know that the division will be eliminated (e.g. the
only use was in the signbit). Reopening.

[Bug middle-end/51446] -fno-trapping-math generates NaN constant with different sign

2023-10-02 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51446

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #21 from Alexander Monakov  ---
Bug 111655 is not a dup, I left a comment and reopened.

[Bug middle-end/111655] [11/12/13/14 Regression] wrong code generated for __builtin_signbit and 0./0. on x86-64 -O2

2023-10-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655

--- Comment #11 from Alexander Monakov  ---
(In reply to Richard Biener from comment #10)
> And this conservatively has to apply to all FP divisions where we might infer
> "nonnegative" unless we can also infer !zerop?

Yes, I think the logic in tree_binary_nonnegative_warnv_p is incorrect for
floating-point division. Likewise for multiplication: it returns true for 'x *
x', but when x is a NaN, 'x * x' is also a NaN (potentially with the same
sign).


> On the side of replacing all uses I'd error on simply not folding.

Yes, as preceding transforms might have duplicated the division already. We can
only do such folding very early, when we are sure no duplication might have
taken place.


> Note 6.5.5/6 says "In both operations, if the value of the second operand is
> zero, the behavior is undefined." only remotely implying this doesn't
> apply to non-integer types (remotely by including modulo behavior in this
> sentence).
> 
> Possibly in some other place the C standard makes FP division by zero subject
> to other rules.

I think the intention is that Annex F makes it follow IEEE rules (returns an
Inf/NaN and sets FE_DIVBYZERO/FE_INVALID), but it doesn't seem to be clearly
worded, afaict.

[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3

2023-10-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
This is predcom. It's easier to see what's going wrong with

#pragma GCC unroll 99

added to the innermost loop so it's unrolled at -O2 and comparing

g++ -O2 -fpredictive-commoning --param=max-unroll-times=1

vs. plain g++ -O2 (or diffing 'pcom' dump against the preceding pass).

[Bug tree-optimization/111694] [13/14 Regression] Wrong behavior for signbit of negative zero when optimizing

2023-10-04 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111694

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org
  Component|web |tree-optimization
Summary|Wrong behavior for signbit  |[13/14 Regression] Wrong
   |of negative zero when   |behavior for signbit of
   |optimizing  |negative zero when
   ||optimizing
   Keywords||wrong-code

--- Comment #1 from Alexander Monakov  ---
Reduced:

#define signbit(x) __builtin_signbit(x)

static void test(double l, double r)
{
  if (l == r && (signbit(l) || signbit(r)))
;
  else
__builtin_abort();
}

int main()
{
  test(0.0, -0.0);
}

[Bug middle-end/111701] New: [11/12/13/14 Regression] wrong code for __builtin_signbit(x*x)

2023-10-05 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111701

Bug ID: 111701
   Summary: [11/12/13/14 Regression] wrong code for
__builtin_signbit(x*x)
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, eggert at cs dot ucla.edu,
rguenth at gcc dot gnu.org, unassigned at gcc dot gnu.org
Depends on: 111655
  Target Milestone: ---
Target: x86_64-linux-gnu

+++ This bug was initially created as a clone of Bug #111655 +++

See bug 111655 comment 11: we incorrectly deduce nonnegative_p for
floating-point 'x * x', and the following aborts:

__attribute__((noipa))
static int f(float *x)
{
*x *= *x;
return __builtin_signbit(*x);
}

int main()
{
float x = -__builtin_nan("");
int s = f();
if (s != __builtin_signbit(x))
__builtin_abort();
}


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111655
[Bug 111655] [11/12/13/14 Regression] wrong code generated for
__builtin_signbit and 0./0. on x86-64 -O2

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2023-10-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
Looks that way — even though __seg_gs AS is a subset of the generic AS, the
compiler has no way to find out the base of __seg_gs.

We already skip registering TLS data with ASan:

__thread int x;

int foo (void)
{
  return x;
}

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
I'm afraid hybrid CPUs with varying ISA feature sets are not practical for the
current ecosystem: you wouldn't be able to reschedule from a higher- to
lower-capable core. Not to mention scenarios like Mesa on-disk llvmpipe shader
cache.

"Always" probing all cores is a not a good idea (the compiler would have to
manually reschedule itself to all cores, of which there could be hundreds).
Plus, portable API for such probing across available cores does not exist
afaik.

I think releasing an x86 hybrid CPU with varying capabilities across cores
would require substantial preparatory work in the kernel and likely in the
userland as well, so probably best to leave it until the time comes and
specifics of what can differ are known.

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #9 from Alexander Monakov  ---
(In reply to Arsen Arsenović from comment #8)
> indeed (but I believe it did happen with Alder Lake already, by accident,
> with AVX512 on P-cores but not on E-cores).

AFAIK on those Alder Lake CPUs you could only get AVX-512 by disabling E-cores
in the BIOS, so you couldn't boot in a configuration when both E-cores are
available and AVX-512 on P-cores is available.

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-12 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #11 from Alexander Monakov  ---
(In reply to Hongtao.liu from comment #10)
> > indeed (but I believe it did happen with Alder Lake already, by accident,
> > with AVX512 on P-cores but not on E-cores).
> 
> AVX512 is physically fused off for Alderlake P-core, P-core and E-core share
> the same ISA level(AVX2).

I think Arsen means initial Alder Lake batches, where AVX-512 wasn't yet fused
off (but BIOS support was unofficial/experimental anyway).

[Bug ipa/111643] __attribute__((flatten)) with -O1 runs out of memory (killed cc1)

2023-10-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111643

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #10 from Alexander Monakov  ---
(In reply to Lukas Grätz from comment #9)
> I also wondered whether
> 
> int bar_alias (void) { return bar_original(); }
> 
> could be a portable alternative to attribute alias. Except that current GCC
> does not translate it that way.

That's because function addresses are significant and so

  _alias == _original

must evaluate to false, but would be true for aliases.

In theory compilers could do better by introducing fall-through aliases:
https://gcc.gnu.org/wiki/cauldron2019talks?action=AttachFile=view=fallthrough-aliases.pdf

[Bug sanitizer/111736] Address sanitizer is not compatible with named address spaces

2023-10-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111736

--- Comment #3 from Alexander Monakov  ---
Sorry, the second half of my comment is confusing. To clarify, ASan works fine
for TLS data (the compiler knows that TLS base is at fs:0; libsanitizer uses
some hacks to initialize shadow for TLS anyway, so it seems explicit
registration is not needed).

The difference is,  produces an address in the generic address space by using
the knowledge that fs:0 stores the segment base. For __seg_{fs,gs} that can't
be done, and  is the offset w.r.t segment base.

[Bug tree-optimization/111694] [13/14 Regression] Wrong behavior for signbit of negative zero when optimizing

2023-10-09 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111694

--- Comment #7 from Alexander Monakov  ---
No backport for gcc-13 planned?

[Bug target/111768] X86: -march=native does not support alder lake big.little cache infor correctly

2023-10-11 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111768

--- Comment #5 from Alexander Monakov  ---
I think it's similar to attempting -march=native under distcc, which is already
warned about on Gentoo wiki: https://wiki.gentoo.org/wiki/Distcc

The difference here is that Intel so far decided to make ISA feature set the
same between 'performance' and 'power-efficient' cores, so the differences for
-march=native detection are minimal.

Intel also added a cpuid bit for hybrid CPUs, so in principle native arch
detection could inspect that bit and then override l1-cache-size to 32 KiB
(having the exact size in the param is not important, specifying a lower value
is ok), or just drop it and let cc1 fall back to the default value (64) from
params.opt.

Short term, I would advise users to add --param=l1-cache-size=32 after
-march=native in CFLAGS.

[Bug c/111210] Wrong code at -Os on x86_64-linux-gnu since r12-4849-gf19791565d7

2023-08-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111210

Alexander Monakov  changed:

   What|Removed |Added

 Resolution|--- |INVALID
 Status|UNCONFIRMED |RESOLVED
 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
'c' is called with 'd' pointing to 'long e[2]', so

  return *(int *)(d + 1);

is an aliasing violation (dereferencing a pointer to an incompatible type).

[Bug c/111210] Wrong code at -Os on x86_64-linux-gnu since r12-4849-gf19791565d7

2023-08-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111210

--- Comment #4 from Alexander Monakov  ---
The testcase is small enough to notice the issue by inspection.

Note that you get the "expected" answer with -fno-strict-aliasing, and as
explained in https://gcc.gnu.org/bugs/ it is one of the things you should check
when submitting a bugreport:

Before reporting that GCC compiles your code incorrectly, compile it with gcc
-Wall -Wextra and see whether this shows anything wrong with your code.
Similarly, if compiling with -fno-strict-aliasing -fwrapv
-fno-aggressive-loop-optimizations makes a difference, or if compiling with
-fsanitize=undefined produces any run-time errors, then your code is probably
not correct.

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

--- Comment #6 from Alexander Monakov  ---
Thanks.

i5-1335U has two "performance cores" (with HT, four logical CPUs) and eight
"efficiency cores". They have different micro-architecture. Are you binding the
benchmark to some core in particular?

On the "performance cores", 'add rbx, 1' can be eliminated ("executed" with
zero latency), this optimization appeared in the Alder Lake generation with the
"Golden Cove" uarch and was found by Andreas Abel. There are limitations (e.g.
it works for 64-bit additions but not 32-bit, the addend must be an immediate
less than 1024).

Of course, it is better to have 'add rbx, 1' instead of 'add rbx, rax' in this
loop on any CPU ('mov eax, 1' competes for ALU ports with other instructions,
so when it's delayed due to contention the dependent 'add rbx, rax; movsx rax,
[rbx]' get delayed too), but ascribing the difference to compiler scheduling on
a CPU that does out-of-order dynamic scheduling is strange.

[Bug middle-end/111009] [12/13/14 regression] -fno-strict-overflow erroneously elides null pointer checks and causes SIGSEGV on perf from linux-6.4.10

2023-08-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111009

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
Triggered by GIMPLE loop invariant motion lifting

  a_9 = _8(D)->maj;

across a (dso != NULL) test.

[Bug rtl-optimization/111101] -finline-small-functions may invert FP arguments breaking FP bit accuracy in case of NaNs

2023-08-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=01

Alexander Monakov  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID
 CC||amonakov at gcc dot gnu.org

--- Comment #1 from Alexander Monakov  ---
0x7fe5ed65 is a quiet NaN, not signaling (it differs from the input 0x7fa5ed65
sNaN by the leading mantissa bit 0x0040).

IEEE-754 does not pin down which of the two payloads should be propagated when
both operands are NaNs, and neither do language standards, so for GCC
floating-point addition and similar operations are commutative.

Observed NaN payloads are not predictable and may change depending on
optimization level, choice of x87 vs. SSE instructions, etc. This is not a bug.

[Bug rtl-optimization/111143] [missed optimization] unlikely code slows down diffutils x86-64 ASCII processing

2023-08-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
(In reply to Paul Eggert from comment #0)
> The "movl $1, %eax" immediately followed by "addq %rax, %rbx" is poorly
> scheduled; the resulting dependency makes the code run quite a bit slower
> than it should. Replacing it with "addq $1, %rbx" and readjusting the
> surrounding code accordingly, as is done in the attached file
> code-mcel-opt.s, causes the benchmark to run 38% faster on my laptop's Intel
> i5-1335U.

This is a mischaracterization. The modified loop has one uop less, because you
are replacing 'mov eax, 1; add rbx, rax' with 'add rbx, 1'.

To evaluate scheduling aspect, keep 'mov eax, 1' while changing 'add rbx, rax'
to 'add rbx, 1'.

There are two separate loop-carried data dependencies, both one cycle per
iteration (addition chains over r12 and rbx).

[Bug c++/104631] Visibility of static member s yields duplicate symbols.

2022-04-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104631

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
Note that what matters is not the type of the member, but whether template
parameters have hidden visibility, as the following example demonstrates:

struct S {
};

template
struct TS {
__attribute__((visibility("default")))
static int i;
__attribute__((visibility("default")))
static S s;
};

template
int TS::i{};

template
S TS::s{};

template struct TS;
template struct TS;

[Bug target/82242] IRA spills allocno in loop body if it crosses throwing call outside the loop

2023-11-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82242

--- Comment #5 from Alexander Monakov  ---
The small testcase from comment 3 is now improved on trunk, possibly thanks to
work in PR 110215.

[Bug target/61810] init-regs.c papers over issues elsewhere

2022-05-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61810

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #8 from Alexander Monakov  ---
(In reply to Richard Biener from comment #7)
> But it looks like the testcase is broken:
> 
> __attribute__((always_inline, target("avx2")))
> static __m256i
> load8bit_4x4_avx2(const uint8_t *const src, const uint32_t stride)
> { 
>   __m128i src01, src23;
>   src01 = _mm_cvtsi32_si128(*(int32_t*)(src + 0 * stride));
>   src23 = _mm_insert_epi32(src23, *(int32_t *)(src + 3 * stride), 1);
>   return _mm256_setr_m128i(src01, src23);
> }
> 
> it seems to expect that src23 is zero before inserting the data?

If you look in the original PR 104441 testcase, it has sensible code:

static __m256i __attribute__((always_inline))
load8bit_4x4_avx2(const uint8_t *const src, const uint32_t stride)
{
  __m128i src01, src23;
  src01 = _mm_cvtsi32_si128(*(int32_t*)(src + 0 * stride));
  src01 = _mm_insert_epi32(src01, *(int32_t *)(src + 1 * stride), 1);
  src23 = _mm_cvtsi32_si128(*(int32_t*)(src + 2 * stride));
  src23 = _mm_insert_epi32(src23, *(int32_t *)(src + 3 * stride), 1);
  return _mm256_setr_m128i(src01, src23);
}

[Bug target/105513] [9/10/11/12/13 Regression] Unnecessary SSE spill since r9-5748-g1d4b4f4979171ef0

2022-05-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105513

--- Comment #7 from Alexander Monakov  ---
The second sequence is 3 uops vs 1/2 (issued/executed) uops in first, and on
Haswell and Skylake it ties up port 5 for two cycles.

Unclear if you're microbenchmarking latency or throughput, but in any case on
Haswell and Skylake you should see a close to 2x difference.

[Bug bootstrap/105688] Cannot build GCC 11.3 on Fedora 36

2022-05-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105688

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #12 from Alexander Monakov  ---
(In reply to Andrew Pinski from comment #7)
> /usr/bin/ld:
> /tmp/OBJDIR/x86_64-pc-linux-gnu/libstdc++-v3/src/.libs/libstdc++.so.6:
> version `GLIBCXX_3.4.30' not found (required by /usr/bin/ld)
> 
> 
> The problem is not realted to GCC directly but rather ld being linked
> against a newer version of libstdc++ and now you just compiled an older
> version of libstdc++ and that is in the LD_LIBRARY_PATH some how ...  Which
> should not happen 

Could libtool be erroneously populating LD_LIBRARY_PATH?

> If anything this should be reported to binutils and have ld (I suspect gold
> here) use -static-libstdc++ -static-libgcc while linking just the same way
> GCC does.

No, this doesn't make sense, ld shouldn't work around wrong LD_LIBRARY_PATH
setting.

[Bug target/105700] GCC miscompiles? wine when using -march=pentium-m

2022-05-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105700

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
It seems you're already getting some good advice on the Wine Bugzilla (thanks
for linking it in the URL field).

There should be a note in dmesg when a process segfaults outside of a debugger.
If you run wine without gdb, and winedevice.exe crashes, is there a
corresponding message in dmesg?

Hopefully with the help of Wine folks you'll manage to attach GDB properly to
observe the crash, but one other thing you could do is bisect the miscompiled
binary: if you have two Wine installations, one broken (with -march=pentium-m)
and one working fine (without the flag), then you can take half of binaries
from one and half from another and see if it still crashes. Depending on the
outcome you know which half contains a broken binary. Continuing this process,
you can narrow it down to one file.

[Bug target/105700] GCC miscompiles? wine when using -march=pentium-m

2022-05-23 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105700

--- Comment #5 from Alexander Monakov  ---
(In reply to Artem S. Tashkinov from comment #4)
> > There should be a note in dmesg when a process segfaults outside of a
> > debugger. If you run wine without gdb, and winedevice.exe crashes, is there
> > a corresponding message in dmesg?
> 
> Just this:
> 
> [] Process 577885 (winedevice.exe) of user 1000 dumped core.
> Stack trace of thread 577888:
> #0  0xf7d51e7d n/a (n/a + 0x0)
> #1  0xf7d528f7 n/a (n/a + 0x0)
> ELF object binary architecture: Intel 80386

This is what systemd-coredump prints. Are you sure the kernel is not printing a
notification in dmesg? It may include useful information such as register state
and binary code around the failing instruction. On my system it looks like
this:

a.out[13922]: segfault at 0 ip 08049000 sp ffdc8520 error 4 in
a.out[8048000+2000]
Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  00 00 00 00 0f 0b 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00
m

> That's a good but quite herculean in terms of effort idea. If nothing else 
> works, I will try it.

Each step reduces number of suspicious binaries by half, so only 7 steps for
128 binaries.

[Bug tree-optimization/106019] New: Surprising SLP failure on trivial code

2022-06-17 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106019

Bug ID: 106019
   Summary: Surprising SLP failure on trivial code
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
  Target Milestone: ---

In the following code, 'f' is not SLP-vectorized, but 'g' is. From a brief look
at slp2 dump, looks like dependence analysis for p[i] vs. p[i+1] fails?

void f(double *p, long i)
{
p[i+0] += 1;
p[i+1] += 1;
}
void g(double *p, long i)
{
double *q = p + i;
q[0] += 1;
q[1] += 1;
}

[Bug c/105863] RFE: __attribute__((incbin("file"))) or __builtin_incbin("file")

2022-06-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105863

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
This is #embed, see https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2967.htm

[Bug lto/91299] LTO inlines a weak definition in presence of a non-weak definition from an ELF file

2022-07-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91299

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #9 from Alexander Monakov  ---
I can reproduce it with gcc-10.2. Why is main 'overwritable', but foo is
'available'?

cat /tmp/cchaUSjV.ltrans0.o.079i.inline

;; Function main (main, funcdef_no=0, decl_uid=4385, cgraph_uid=1,
symbol_order=1) (executed once)

weakdef.c:5:5: note: Inlining foo/0 to main/1 with frequency 1.00
foo/0 (foo) @0x7f51a512d168
  Type: function definition analyzed
  Visibility: preempted_reg external public weak
  References:
  Referring:
  Function foo/0 is inline copy in main/1
  Availability: available
  Unit id: 2
  Function flags: count:1073741824 (estimated locally) body nonfreeing_fn
  Called by: main/1 (inlined) (1073741824 (estimated locally),1.00 per call)
  Calls:
main/1 (main) @0x7f51a512d000
  Type: function definition analyzed
  Visibility: externally_visible prevailing_def public
  References:
  Referring:
  Availability: overwritable
  Unit id: 2
  Function flags: count:1073741824 (estimated locally) body
only_called_at_startup nonfreeing_fn executed_once
  Called by:
  Calls: foo/0 (inlined) (1073741824 (estimated locally),1.00 per call)

[Bug rtl-optimization/101347] [11/12 Regression] ICE in cfg_layout_initialize with __builtin_setjmp and -fprofile-generate -fprofile-use

2022-07-20 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101347

Alexander Monakov  changed:

   What|Removed |Added

Summary|[11/12/13 Regression] ICE   |[11/12 Regression] ICE in
   |in cfg_layout_initialize|cfg_layout_initialize with
   |with __builtin_setjmp and   |__builtin_setjmp and
   |-fprofile-generate  |-fprofile-generate
   |-fprofile-use   |-fprofile-use

--- Comment #6 from Alexander Monakov  ---
Should be fixed on the trunk, suggestions regarding backports welcome.

[Bug tree-optimization/106422] [13 Regression] ice in duplicate_block, at cfghooks.cc:1115

2022-07-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106422

--- Comment #7 from Alexander Monakov  ---
I think item 2 from comment #3 (jump threading) still needs to be solved
independently of what is decided about item 1 (leaf functions resuming earlier
returns_twice call).

---

The problem with 'leaf' is that Glibc is taking the longjmp stipulation from
the docs ("a leaf function is not allowed to [...] longjmp into the unit")
literally:

/* All functions, except those with callbacks or those that
   synchronize memory, are leaf functions.  */
# if __GNUC_PREREQ (4, 6) && !defined _LIBC
#  define __LEAF , __leaf__
#  define __LEAF_ATTR __attribute__ ((__leaf__))
# else
#  define __LEAF
#  define __LEAF_ATTR
# endif


and marks many, many functions including 'execve' leaf, while 'execve'
obviously resumes an earlier vfork (if successful).

---

In the original testcase we have 'free' which arguably should not resume any
returns_twice function, so GIMPLE representation is correct and jump threading
needs a fix (or can_duplicate_block_p if we decide it's too conservative and
ought to check presence of abnormal edges). Please let me know if I should
split off the problem with leaf functions in a separate bug.

[Bug tree-optimization/106422] [13 Regression] ice in duplicate_block, at cfghooks.cc:1115

2022-07-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106422

--- Comment #8 from Alexander Monakov  ---
I mean the minimized testcase, the original attachment does execve/_exit after
vfork.

[Bug ipa/106437] Glibc marks functions that resume a returns_twice call as leaf

2022-07-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106437

--- Comment #1 from Alexander Monakov  ---
With the exception of '_exit', exit family of functions (exit, _Exit,
quick_exit) are also marked leaf despite exit and quick_exit invoking
atexit/on_exit/at_quick_exit handlers. Only _Exit is specified not to invoke
handlers. All four can resume a vfork.

[Bug tree-optimization/106422] [13 Regression] ice in duplicate_block, at cfghooks.cc:1115

2022-07-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106422

--- Comment #11 from Alexander Monakov  ---
A cleaner testcase for jump threading (still ICEs despite presence of
ABNORMAL_DISPATCHER):

void vfork() __attribute__((__leaf__));
void semanage_reload_policy(char *arg, void cb(void)) {
  if (!arg) {
cb();
return;
  }
  vfork();
  if (arg)
__builtin_free(arg);
}

[Bug ipa/106437] New: Glibc marks functions that resume a returns_twice call as leaf

2022-07-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106437

Bug ID: 106437
   Summary: Glibc marks functions that resume a returns_twice call
as leaf
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: ipa
  Assignee: unassigned at gcc dot gnu.org
  Reporter: amonakov at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, asolokha at gmx dot com,
dcb314 at hotmail dot com, hubicka at gcc dot gnu.org,
marxin at gcc dot gnu.org, rguenth at gcc dot gnu.org,
unassigned at gcc dot gnu.org
  Target Milestone: ---

In tree-cfg.cc:call_can_make_abnormal_goto GCC implements an assumption that
any function with the 'leaf' attribute will not transfer control to a
returns_twice function. This behavior is from day 1 since attribute-leaf
introduction, but the documentation says:

> leaf functions are not allowed to call callback function passed to it from
> current compilation unit or directly call functions exported by the unit or
> longjmp into the unit

So the manual was talking about longjmp exclusively, even though probably it
meant resumption of returns_twice calls in general.

Today Glibc headers are marking function that can resume vfork as leaf, execve
being the biggest problem since it resumes vfork without being technically UB;
functions such as 'raise' and 'kill' can also resume vfork by terminating the
current process (but pedantically it is UB to invoke them in vfork context).

(there's also the point that 'raise' can invoke signal handlers synchronously,
and I agree with Richard that it makes it non-leaf; it's been discussed as
Glibc issue previously, the most recent instance seems to be here:
https://sourceware.org/bugzilla/show_bug.cgi?id=26802 ; ISTR there was a
discussion on GCC side also, earlier)

Presence of attribute-leaf makes GCC omit modeling of control flow transfer via
ABNORMAL_DISPATCHER, potentially causing miscompilation.

Testcase with execve, notice absence of abnormal edges on GIMPLE:

#include 
#include 

int main()
{
if (!vfork())
for (;;) execve("/bin/false", 0, 0);
}

[Bug tree-optimization/106422] [13 Regression] ice in duplicate_block, at cfghooks.cc:1115

2022-07-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106422

--- Comment #10 from Alexander Monakov  ---
The leaf issue is now PR 106437.

[Bug lto/91299] LTO inlines a weak definition in presence of a non-weak definition from an ELF file

2022-07-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91299

--- Comment #11 from Alexander Monakov  ---
Marxin, you've marked this as WAITING, can you please re-evaluate? The nice
testcase from comment #2 is reproducible on trunk as well.

[Bug target/105135] [11/12/13 Regression] Optimization regression for handrolled branchless assignment since r11-4717-g3e190757fa332d32

2022-07-26 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105135

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #5 from Alexander Monakov  ---
Regarding Clang's code, the key part is not use of 8-bit operations, but setbe
(2 uops) vs. setb (1 uop):

cmpb$25, %cl
setbe   %al

vs

cmpb$26, %al
setb%al

(note comparison against 25 or 26).

---

Regarding cmov being a lottery, unless you mean Pentium4, then not really, it's
just 1 or 2 uops, each latency 1 or 2. uops.info has very nice summaries:

https://uops.info/html-instr/CMOVB_R32_R32.html
https://uops.info/html-instr/CMOVBE_R32_R32.html

[Bug rtl-optimization/101347] [11/12/13 Regression] ICE in cfg_layout_initialize with __builtin_setjmp and -fprofile-generate -fprofile-use

2022-07-14 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101347

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #4 from Alexander Monakov  ---
The label at __builtin_setjmp_receiver was added to
nonlocal_goto_handler_labels twice (because __builtin_setjmp_setup was
duplicated), but remove_node_from_insn_list removed only the first copy. A
simple check would have caught this early:

@@ -2928,6 +2899,7 @@ remove_node_from_insn_list (const rtx_insn *node,
rtx_insn_list **listp)
  else
*listp = temp->next ();

+ gcc_checking_assert (!in_insn_list_p (temp->next (), node));
  return;
}


I think a reasonable solution is to move registration of receiver label from
expansion of __builtin_setjmp_setup to expansion of __builtin_setjmp_receiver:

@@ -7467,15 +7467,7 @@ expand_builtin (tree exp, rtx target, rtx subtarget,
machine_mode mode,
  tree label = TREE_OPERAND (CALL_EXPR_ARG (exp, 1), 0);
  rtx_insn *label_r = label_rtx (label);

- /* This is copied from the handling of non-local gotos.  */
  expand_builtin_setjmp_setup (buf_addr, label_r);
- nonlocal_goto_handler_labels
-   = gen_rtx_INSN_LIST (VOIDmode, label_r,
-nonlocal_goto_handler_labels);
- /* ??? Do not let expand_label treat us as such since we would
-not want to be both on the list of non-local labels and on
-the list of forced labels.  */
- FORCED_LABEL (label) = 0;
  return const0_rtx;
}
   break;
@@ -7488,6 +7480,13 @@ expand_builtin (tree exp, rtx target, rtx subtarget,
machine_mode mode,
  rtx_insn *label_r = label_rtx (label);

  expand_builtin_setjmp_receiver (label_r);
+ nonlocal_goto_handler_labels
+   = gen_rtx_INSN_LIST (VOIDmode, label_r,
+nonlocal_goto_handler_labels);
+ /* ??? Do not let expand_label treat us as such since we would
+not want to be both on the list of non-local labels and on
+the list of forced labels.  */
+ FORCED_LABEL (label) = 0;
  return const0_rtx;
}
   break;

[Bug target/106277] missed-optimization: redundant movzx

2022-07-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106277

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #2 from Alexander Monakov  ---
You probably mean the addition, not the load.

It cannot: it really is an 8-bit addition, and if pseudo 91 is allocated to
e.g. AH or another "high" 8-bit register, there really will need to be an
explicit zero-extend.

  1   2   3   4   >