[Bug middle-end/116651] New: Memory allocation elision for std::vector like cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116651 Bug ID: 116651 Summary: Memory allocation elision for std::vector like cases Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Code close to the following was noted in many user applications: bool test1(const std::vector& in) { return in == std::vector{"*"}; } Here people wish to make sure that the vector contains only a single "*" element. In other words they assume that the above code snippet would be optimized to something like: bool test2(const std::vector& in) { return in.size() == 1 && in[0] == "*"; } Unfortunately that does not happen: https://godbolt.org/z/r59a4nobP Note that all the functions are inlined however the new+delete are not elided. Minimized reproducer: https://godbolt.org/z/jvcEd8zo6
[Bug middle-end/115309] New: Simple coroutine based generator is not optimized well
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115309 Bug ID: 115309 Summary: Simple coroutine based generator is not optimized well Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the following minimal C++ coroutine based generator: #include namespace { struct generator { struct promise_type { using handle = std::coroutine_handle; unsigned value{}; generator get_return_object() noexcept { return generator{handle::from_promise(*this)}; } std::suspend_never initial_suspend() noexcept { return {}; } std::suspend_always final_suspend() noexcept { return {}; } void return_void() noexcept {} void unhandled_exception() { __builtin_abort(); } std::suspend_always yield_value(unsigned v) noexcept { value = v; return {}; } }; ~generator() noexcept { m_coro.destroy(); } unsigned operator*() { return m_coro.promise().value; } private: promise_type::handle m_coro; explicit generator(promise_type::handle coro) noexcept: m_coro{coro} {} }; generator generate_1() { co_yield 1; } } unsigned test() { auto gen = generate_1(); return *gen; } The expected assembly is: test(): mov eax, 1 ret However, trunk GCC with `-O2 -std=c++23` flags generates 60+ instructions with dynamic merory allocations and function calls. Godbolt playground: https://godbolt.org/z/6PvfTfx9n Looks that the main part of the problem is the missing allocation elision for coroutine. Note that the same problem arises with the Standard C++ std::generator: https://godbolt.org/z/EvEPT7d1T
[Bug middle-end/114661] New: Bit operations not optimized to multiplication
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114661 Bug ID: 114661 Summary: Bit operations not optimized to multiplication Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: unsigned mul(unsigned char c) { if (c > 3) __builtin_unreachable(); return c << 18 | c << 15 | c << 12 | c << 9 | c << 6 | c << 3 | c; } GCC with -O2 generates the following assembly: mul(unsigned char): movzx edi, dil lea edx, [rdi+rdi*8] lea eax, [0+rdx*8] mov ecx, edx sal edx, 15 or eax, edi sal ecx, 9 or eax, ecx or eax, edx ret However it could be optimized to just: mul(unsigned char): imul eax, edi, 299593 ret Compiling with -Os does not help. Godbolt playground: https://godbolt.org/z/YszzMbovK P.S.: without `c << 18 | c << 15 |` the bit operations are transformed to multiplication.
[Bug middle-end/114660] Exponentiating by squaring not performed for x * y * y * y * y
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114660 --- Comment #1 from Antony Polukhin --- The above godbolt link for an old version of GCC, here's for 14.0 https://godbolt.org/z/dTPYY1T9W
[Bug middle-end/114660] New: Exponentiating by squaring not performed for x * y * y * y * y
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114660 Bug ID: 114660 Summary: Exponentiating by squaring not performed for x * y * y * y * y Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- For the following code: int mul(int x, int y) { return x * y * y * y * y; } with -O2 GCC produces the frollowing assembly: mul(int, int): mov eax, edi imul eax, esi imul eax, esi imul eax, esi imul eax, esi ret However, a more optimal code could be generated with less multiplications: mul(int, int): mov eax, edi imulesi, esi imuleax, esi imuleax, esi ret Godbolt playground: https://godbolt.org/z/6dP11jPfx
[Bug middle-end/114559] New: After function inlining some optimizations missing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114559 Bug ID: 114559 Summary: After function inlining some optimizations missing Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: template int AtomicUpdate(int& atomic, Func updater) { int old_value = atomic; while (true) { const int new_value = updater(int{old_value}); if (old_value == new_value) return old_value; if (__atomic_compare_exchange_n(&atomic, &old_value, new_value, 1, 5, 5)) return new_value; } } int AtomicMin(int& atomic, int value) { return AtomicUpdate(atomic, [value](int old_value) { return value < old_value ? value : old_value; }); } With -O2 GCC produces the assembly: AtomicMin(int&, int): mov eax, DWORD PTR [rdi] .L3: cmp esi, eax mov edx, eax cmovle edx, esi jge .L4 lock cmpxchgDWORD PTR [rdi], edx jne .L3 .L1: mov eax, edx ret .L4: mov edx, eax jmp .L1 However, a more optimal assembly is possible: AtomicMin(int&, int):# @AtomicMin(int&, int) mov eax, dword ptr [rdi] .LBB0_1:# =>This Inner Loop Header: Depth=1 cmp eax, esi jle .LBB0_4 lockcmpxchg dword ptr [rdi], esi jne .LBB0_1 mov eax, esi .LBB0_4: ret Note that manual inlining of the lambda improves the codegen: int AtomicMin(int& atomic, int value) { int old_value = atomic; while (true) { const int new_value = (value < old_value ? value : old_value); if (old_value == new_value) return old_value; if (__atomic_compare_exchange_n(&atomic, &old_value, new_value, 1, 5, 5)) return new_value; } } Results in AtomicMin(int&, int): mov eax, DWORD PTR [rdi] .L3: cmp esi, eax mov edx, eax cmovle edx, esi jge .L1 lock cmpxchgDWORD PTR [rdi], edx jne .L3 .L1: mov eax, edx ret Godbolt playground: https://godbolt.org/z/G6YEGb15q
[Bug middle-end/114391] catch() and immediate throw; could be optimized to noop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114391 --- Comment #2 from Antony Polukhin --- > Is there something to optimize when foo() cannot be tail-called? Yes. Just `catch (...) { throw; }`, no more restrictions. I do not even think, that it should be the outer most EH region: void foo(); void bar(); void test() { try { foo(); } catch (...) { throw; } bar(); } is fine to optimize to just void test() { foo(); bar(); }
[Bug middle-end/114391] New: catch() and immediate throw; could be optimized to noop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114391 Bug ID: 114391 Summary: catch() and immediate throw; could be optimized to noop Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: void foo(); void test() { try { foo(); } catch (...) { throw; } } At the moment, the compiler at -O2 generates the assembly: test(): sub rsp, 24 call foo() add rsp, 24 ret mov rdi, rax jmp .L2 test() [clone .cold]: .L2: call __cxa_begin_catch call __cxa_rethrow mov QWORD PTR [rsp+8], rax call __cxa_end_catch mov rdi, QWORD PTR [rsp+8] call _Unwind_Resume However, an optimal assembly is: test(): jmp foo() Please, add an optimization that removes catch() + immediate throw. The sample code could be often met in release builds, due to some invariants checks or debug logging are removed depending on NDEBUG: void test() { try { foo(); } catch (...) { #ifdef NDEBUG std::cerr << "Unhandled exception!" << std::endl << boost::current_exception_diagnostic_information(); #endif throw; } } Godbolt playground: https://godbolt.org/z/qdG91cMe1
[Bug middle-end/114270] New: Integer multiplication on floating point constant with conversion back to integer is not optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114270 Bug ID: 114270 Summary: Integer multiplication on floating point constant with conversion back to integer is not optimized Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the following example: unsigned test(unsigned x) { return (unsigned)(x * 0.5); } With -O2 GCC generates the code with a fair conversion to fp and multiplication: test(unsigned int): mov edi, edi pxor xmm0, xmm0 cvtsi2sd xmm0, rdi mulsd xmm0, QWORD PTR .LC0[rip] cvttsd2si rax, xmm0 ret However the multiplication does not overflow and the floating point constant is a normal number. A more optimal code should look like the following: test(unsigned int): mov eax, edi shr eax ret Probably the optimization could be used for * any multiplication of integer on positive fp-number less or equal to 1.0 * any division of integer on positive fp-number greater or equal to 1.0 if the result is converted back to integer
[Bug middle-end/113959] New: Optimize `__builtin_isnan(x) || __builtin_isinf(x)` to `__builtin_isfinite(x)`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113959 Bug ID: 113959 Summary: Optimize `__builtin_isnan(x) || __builtin_isinf(x)` to `__builtin_isfinite(x)` Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Sometimes people check for finite number using `__builtin_isnan(x) || __builtin_isinf(x)`. However `__builtin_isfinite(x)` produces a better assembly. Please, add the optimization. Godbolt playground: https://godbolt.org/z/5r38169fn
[Bug tree-optimization/112683] New: Optimizing memcpy range by extending to word bounds
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112683 Bug ID: 112683 Summary: Optimizing memcpy range by extending to word bounds Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the minimized source code from libstdc++ ``` struct string { unsigned long _M_string_length; enum { _S_local_capacity = 15 }; char _M_local_buf[_S_local_capacity + 1]; }; string copy(const string& __str) noexcept { string result; if (__str._M_string_length > __str._S_local_capacity) __builtin_unreachable(); result._M_string_length = __str._M_string_length; __builtin_memcpy(result._M_local_buf, __str._M_local_buf, __str._M_string_length + 1); return result; } ``` Right now GCC with -O2 emits a long assembly with ~50 instructions https://godbolt.org/z/a89bh17hd However, note that * the `result._M_local_buf` is uninitialized, * there's at most 16 bytes to copy to `result._M_local_buf` which is of size 16 bytes So the compiler could optimize the code to always copy 16 bytes. The behavior change is not observable by user as the uninitialized bytes could contain any data, including the same bytes as `_str._M_local_buf`. As a result of always copying 16 bytes, the assembly becomes more than 7 times shorter, conditional jumps go away: https://godbolt.org/z/r5GPYTs4Y
[Bug libstdc++/112682] New: More efficient std::basic_string move construction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112682 Bug ID: 112682 Summary: More efficient std::basic_string move construction Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- A few places in bits/basic_string.h have the following code: ``` if (__str._M_is_local()) { _M_init_local_buf(); traits_type::copy(_M_local_buf, __str._M_local_buf, __str.length() + 1); } ``` Despite the knowledge, that `__str.length()` is not greater than 15 the compiler emits (and inlines) a memcpy call. That results in a quite big set of instructions https://godbolt.org/z/j35MMfxzq Replacing `__str.length() + 1` with `_S_local_capacity + 1` explicitly forces the compiler to copy the whole `__str._M_local_buf`. As a result the assembly becomes almost 5 times shorter and without any function calls or multiple conditional jumps https://godbolt.org/z/bfq8bxra9 P.S.: not sure, if it is allowed to copy uninitialized data via traits_type::copy and would the sanitizers be happy with such copy attempt.
[Bug tree-optimization/112584] New: Suboptimal stack usage on third memcpy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112584 Bug ID: 112584 Summary: Suboptimal stack usage on third memcpy Product: gcc Version: 13.2.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: struct string_view { const char* data; unsigned long size; }; void AppendToCharArray(char*& data, string_view s1, string_view s2, string_view s3) { __builtin_memcpy(data, s1.data, s1.size); data += s1.size; __builtin_memcpy(data, s2.data, s2.size); data += s2.size; __builtin_memcpy(data, s3.data, s3.size); data += s3.size; } With -O2 it generates an assembly with 6 push and 6 pop instructions. However, there's a better assembly possible: push r15 push r14 push r12 push rbx push rax mov rbx, r8 mov r14, rcx mov r15, rdx mov r12, rdi mov rdi, qword ptr [rdi] call memcpy add r15, qword ptr [r12] mov qword ptr [r12], r15 mov rdi, r15 mov rsi, r14 mov rdx, rbx call memcpy add rbx, qword ptr [r12] mov qword ptr [r12], rbx mov rsi, qword ptr [rsp + 48] mov r14, qword ptr [rsp + 56] mov rdi, rbx mov rdx, r14 call memcpy add qword ptr [r12], r14 add rsp, 8 pop rbx pop r12 pop r14 pop r15 ret Godbolt playground: https://godbolt.org/z/EY8E1GGPz
[Bug libstdc++/112440] New: Compiler does not grok basic_string::resize and basic_string::reserve if _CharT is char
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112440 Bug ID: 112440 Summary: Compiler does not grok basic_string::resize and basic_string::reserve if _CharT is char Product: gcc Version: 13.2.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: #include void test1(std::size_t summ) { std::string result; result.resize(summ); if (result.size() > summ) { __builtin_abort(); } } The resulting assembly contains `call abort` and code to check the string size: https://godbolt.org/z/zcj3Pc3G8 Looks like this is due to char* aliasing with string internals, switching to std::u8string removes the `call abort` related assembly: https://godbolt.org/z/a6bKaqqn5 I've failed to come up with a generic solution, but looks like adding __builtin_unreachable() to the end of basic_string::resize and basic_string::reserve helps: https://godbolt.org/z/vWcjqGK94 P.S.: such hints help to shorten the assembly for reserve+append*n cases https://godbolt.org/z/nsEGsWdP3 , https://godbolt.org/z/qMf4b7dd8 , https://godbolt.org/z/1r6dd6d5M which are quire common
[Bug c++/111690] New: Redefinition of operator == not detected with friend <=>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111690 Bug ID: 111690 Summary: Redefinition of operator == not detected with friend <=> Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: accepts-invalid Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: ``` #include struct Foo { friend auto operator<=>(const Foo&, const Foo&) = default; }; bool operator==(const Foo& x, const Foo& y) noexcept { return true; } void Test() { Foo{} == Foo{}; } ``` If my reading of [class.compare.default] p4 correct, then an == operator function is already declared implicitly due to operator<=>. So there should be an error of redeclaring or redefining the operator==. Godbolt playground: https://godbolt.org/z/YP5vEMeYs
[Bug target/110170] Sub-optimal conditional jumps in conditional-swap with floating point
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110170 --- Comment #13 from Antony Polukhin --- There's a typo at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/g%2B%2B.target/i386/pr110170.C;h=e638b12a5ee2264ecef77acca86432a9f24b103b;hb=d41a57c46df6f8f7dae0c0a8b349e734806a837b#l87 It should be `|| !test3() || !test3r()` rather than `|| !test3() || !test4r()`
[Bug target/110457] Unnecessary movsx eax, dil
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110457 --- Comment #4 from Antony Polukhin --- Oh, if there's an disagreement I'm fine with closing this issue as invalid/later/won't_fix
[Bug tree-optimization/110459] New: Trivial on stack variable was not optimized away
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110459 Bug ID: 110459 Summary: Trivial on stack variable was not optimized away Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: struct array { char data[4]; }; auto sample2(char c) { array buffer = {c, 0, 0, 0}; return buffer; } With GCC-14 and -O2 it produces the following assembly: sample2(char): xor eax, eax mov BYTE PTR [rsp-22], 0 mov WORD PTR [rsp-24], ax mov eax, DWORD PTR [rsp-24] sal eax, 8 mov al, dil ret It could be further optimized to just: sample2(char): movzx eax, dil ret Godbolt playground: https://godbolt.org/z/nxKhvo3ns
[Bug target/110457] Unnecessary movsx eax, dil
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110457 --- Comment #1 from Antony Polukhin --- > However, it could be shortened to just: sample1(char): imul eax,edi,0x10111 ret; missed in previous message
[Bug target/110457] New: Unnecessary movsx eax, dil
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110457 Bug ID: 110457 Summary: Unnecessary movsx eax, dil Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- For the following code int sample1(char c) { return (c << 4) + (c << 8) + (c << 16) + c; } GCC-14 with -O2 generates the assembly: sample1(char): movsx eax,dil imul eax,eax,0x10111 ret However, it could be shortened to just: sample1(char): imul eax,edi,0x10111 Godbolt playground: https://godbolt.org/z/7GGdedEY8
[Bug c++/110363] New: New use-after-move warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110363 Bug ID: 110363 Summary: New use-after-move warning Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- There's a quite common rule "after the object was moved from it is in the indeterminate state; it should be either destroyed or a new value should be assigned to it". The C++ Standard Library follows that rule, many libraries and companies follow that rule. Please introduce some '-Wuse-after-move' that warns if the object could be used after move: struct resource { resource(resource&&) noexcept; ~resource(); void kill_it() && noexcept; void should_warn_use_after_move() const & noexcept; }; void should_warn_use_after_move(resource& r) noexcept; void do_something(resource r) { static_cast(r).kill_it(); // moved out should_warn_use_after_move(r);// warn r.should_warn_use_after_move(); // warn } Some related request on stackoverflow https://stackoverflow.com/questions/72532377/g-detect-use-after-stdmove
[Bug tree-optimization/110362] New: Range information on lower bytes of __uint128_t
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110362 Bug ID: 110362 Summary: Range information on lower bytes of __uint128_t Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the following example: int test (__uint128_t a, __uint128_t b) { __uint128_t __a = b | (a << 32); return __a & 0x; } At the moment GCC-14 with -O2 generates the following assembly: test(unsigned __int128, unsigned __int128): mov rsi, rdi mov rax, rdx sal rsi, 32 or rax, rsi ret Which could be simplified to just: test(unsigned __int128, unsigned __int128): mov rax, rdx ret Godbolt playground: https://godbolt.org/z/K9x5vnhxq
[Bug target/110170] Sub-optimal conditional jumps in conditional-swap with floating point
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110170 --- Comment #2 from Antony Polukhin --- -fno-trapping-math had no effect Some tests with nans seem to produce the same results for both code snippets: https://godbolt.org/z/GaKM3EhMq
[Bug tree-optimization/110170] New: Sub-optimal conditional jumps in conditional-swap with floating point
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110170 Bug ID: 110170 Summary: Sub-optimal conditional jumps in conditional-swap with floating point Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Some of the C++ algorithms are written in attempt to avoid conditional jumps in tight loops. For example, code close the following could be seen in libc++: void __cond_swap(double* __x, double* __y) { bool __r = (*__x < *__y); auto __tmp = __r ? *__x : *__y; *__y = __r ? *__y : *__x; *__x = __tmp; } GCC-14 with -O2 and -march=x86-64 options generates the following code: __cond_swap(double*, double*): movsd xmm1, QWORD PTR [rdi] movsd xmm0, QWORD PTR [rsi] comisd xmm0, xmm1 jbe .L2 movqrax, xmm1 movapd xmm1, xmm0 movqxmm0, rax .L2: movsd QWORD PTR [rsi], xmm1 movsd QWORD PTR [rdi], xmm0 ret A conditional jump could be probably avoided in the following way: __cond_swap(double*, double*): movsd xmm0, qword ptr [rdi] movsd xmm1, qword ptr [rsi] movapd xmm2, xmm0 minsd xmm2, xmm1 maxsd xmm1, xmm0 movsd qword ptr [rsi], xmm1 movsd qword ptr [rdi], xmm2 ret Playground: https://godbolt.org/z/v3jW67x91
[Bug tree-optimization/109931] Knowledge on literal not used in optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109931 --- Comment #3 from Antony Polukhin --- > But that's because nothing in the function asserts this? Without fully > specializing and unrolling on the constant "hello" argument at least. Yes, I was hoping for that unrolling to happen Probably a more simplified case: constexpr bool EqualICase(const char* lowercase, const char* y) noexcept { for (;;) { const auto lowercase_c = *lowercase; if (!lowercase_c) return true; if (lowercase_c != *y) { return false; } ++lowercase; ++y; } } bool test2(const char* y) { return EqualICase("he", y); } With range info for loads from read-only constants I'd expect this to become just a test2(char const*): cmp BYTE PTR [rdi], 104 jne .L3 cmp BYTE PTR [rdi+1], 101 seteal ret .L3: xor eax, eax ret rather than a fair loop with checks for \0 Godbolt: https://godbolt.org/z/z6rTYEzWx
[Bug tree-optimization/109931] New: Knowledge on literal not used in optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109931 Bug ID: 109931 Summary: Knowledge on literal not used in optimization Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Function for comparing a lower-cased string with runtime string of known size: constexpr bool ICaseEqualLowercase(const char* lowercase, const char* y, unsigned size) noexcept { constexpr char kLowerToUpperMask = static_cast(~unsigned{32}); for (unsigned i = 0; i < size; ++i) { const auto lowercase_c = lowercase[i]; if (lowercase_c != y[i]) { if (!('a' <= lowercase_c && lowercase_c <= 'z') || (lowercase_c & kLowerToUpperMask) != y[i]) { return false; } } } return true; } bool test2(const char* y) { return ICaseEqualLowercase("hello", y, 5); } With GCC trunk and -O2 flags the GCC fails to understand that all the characters of `lowercase` are lowercase ASCII and the expression `!('a' <= lowercase_c && lowercase_c <= 'z')` is always `false`. Because of that, additional instructions in loop are emitted: lea esi, [rdx-97] cmp sil, 25 ja .L6 Godbolt playground: https://godbolt.org/z/xrc1T4oeW
[Bug tree-optimization/109829] New: Optimizing __builtin_signbit(x) ? -x : x or abs for FP
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109829 Bug ID: 109829 Summary: Optimizing __builtin_signbit(x) ? -x : x or abs for FP Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the following 2 functions: __float128 abs1(__float128 x) { return __builtin_fabsf128(x); } __float128 abs2(__float128 x) { return __builtin_signbit(x) ? -x : x; } They should provide the same results, however the codegen is different: abs1(__float128): pandxmm0, XMMWORD PTR .LC0[rip] ret abs2(__float128): movmskpseax, xmm0 testal, 8 je .L4 pxorxmm0, XMMWORD PTR .LC1[rip] .L4: ret Looks like match.pd miss the __builtin_signbit(x) ? -x : x -> __builtin_fabs*(x) pattern. Playground: https://godbolt.org/z/bsxeozGqv
[Bug middle-end/108465] New: Optimize (a < b) == (b < a) to a == b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108465 Bug ID: 108465 Summary: Optimize (a < b) == (b < a) to a == b Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- For GCC 12 the following code with -O2: int compare_eq(int a, int b) { return ((a < b) == (b < a)); } compiles into the following assembly: compare_eq(int, int): cmp edi, esi setldl setgal cmp dl, al seteal movzx eax, al ret Which is suboptimal. More optimal assembly would be: compare_eq(int, int): xor eax, eax cmp edi, esi seteal ret Godbolt Playground: https://godbolt.org/z/4sfcTjjjb Motivation: in generic C++ code the comparison is often done via a functor. The algorithm is only allowed to use that functor: if (__comp(a, b) == __comp(b, a)) { return x; } else if (__comp(b, a)) { return y; } Because of that, with the inlined functor the comparison becomes ((a < b) == (b < a))
[Bug libstdc++/71579] type_traits miss checks for type completeness in some traits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71579 --- Comment #22 from Antony Polukhin --- > Maybe we should consider dropping all the static assertions from traits that > are implemented using a compiler built-in. Sounds like the right thing to do. > Our type trait and the __has_virtual_destructor built-in both seem to get > this wrong, rejecting Incomplete[2], which is not a class type, and so > doesn't need to be complete (or maybe the precondition is wrong and there's a > library defect?) The library precondition seems right. As I read it, the trait just checks for `virtual` on the destructor. If there's no destructor - it is fine, no `virtual` on it.
[Bug libstdc++/104361] New: Biased Reference Counting for the standard library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104361 Bug ID: 104361 Summary: Biased Reference Counting for the standard library Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- There's a research named "Biased Reference Counting: Minimizing Atomic Operations in Garbage Collection" that shows how to speed up reference counting for some platforms for more than 20% https://dl.acm.org/doi/pdf/10.1145/3243176.3243195 . The research does not talk about speedup of C++ but it is based on an observation that most objects are only accessed by a single thread, which allows most RC operations to be performed non-atomically. That observation fits std::shared_ptr usage patterns. Such a change seems to be an ABI break for shared_ptr, however may be it could be used for stop_token and other new reference counted types.
[Bug c++/103745] New: Warn on throwing an exception not derived from std::exception
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103745 Bug ID: 103745 Summary: Warn on throwing an exception not derived from std::exception Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Throwing an exception that is derived from std::exception is a common practice. Cases when that practice should be skipped are very rare. However, many beginners do not know about that and erroneously do not derive their exceptions from std::exception. There are also cases when classes have close names and users throw the wrong type by a typo. Please add a warning about throwing an exception not derived from std::exception. Godbolt playground: https://godbolt.org/z/7Phf3nafW
[Bug tree-optimization/19661] unnecessary atexit calls emitted for static objects with empty destructors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19661 --- Comment #10 from Antony Polukhin --- Any progress? Multiple compilers already eliminate the atexit call. Moreover, some of the compilers even eliminate the guard variable after that https://godbolt.org/z/dbdfMrroa Note that the atexit elimination would benefit the libstdc++, as the latter now uses a bunch of constant_init instances that have empty destructor in libstdc++-v3/src/c++17/memory_resource.cc and libstdc++-v3/src/c++11/system_error.cc . It would be possible to eliminate the atexit calls for those cases and speedup startup times https://godbolt.org/z/MKaWKevzq
[Bug middle-end/101253] New: Optimize i % C1 == C0 || i % C1*C2 == C0 to i % C1 == C0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101253 Bug ID: 101253 Summary: Optimize i % C1 == C0 || i % C1*C2 == C0 to i % C1 == C0 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the following code bool test_naive(short i) { return i % 100 == 0 || i % 400 == 0; } It could be optimized into bool test_optim(short i) { return i % 100 == 0; } Godbolt playground: https://godbolt.org/z/zW49qcs7G P.S.: Inspired by the manual optimizations in libstdc++ https://github.com/gcc-mirror/gcc/commit/b92d12d3fe3f1aa56d190d960e40c62869a6cfbb
[Bug middle-end/101252] New: Optimize (b ? i % C0 : i % C1) into i & (b ? C0-1 : C1-1) for power of 2 C0 and C1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101252 Bug ID: 101252 Summary: Optimize (b ? i % C0 : i % C1) into i & (b ? C0-1 : C1-1) for power of 2 C0 and C1 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the following code bool test_naive0(bool b, short i) { return (b ? i % 4 : i % 16)==0; } It could be optimized into bool test_optim0(bool b, short i) { return (i & (b ? 15 : 3))==0; } Godbolt playground: https://godbolt.org/z/8vj999M3c P.S.: Inspired by the manual optimizations in libstdc++ https://github.com/gcc-mirror/gcc/commit/b92d12d3fe3f1aa56d190d960e40c62869a6cfbb
[Bug middle-end/101251] New: Optimize i % (b ? C0 : C1) into i & (b ? C0-1 : C1-1) for power of 2 C0 and C1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101251 Bug ID: 101251 Summary: Optimize i % (b ? C0 : C1) into i & (b ? C0-1 : C1-1) for power of 2 C0 and C1 Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the following code bool test_optim01(bool b, short i) { return i % (b ? 4 : 16)==0; } It could be optimized into bool test_optim0(bool b, short i) { return (i & (b ? 15 : 3))==0; } Godbolt playground: https://godbolt.org/z/j15br4Kd4 P.S.: Inspired by the manual optimizations in libstdc++ https://github.com/gcc-mirror/gcc/commit/b92d12d3fe3f1aa56d190d960e40c62869a6cfbb
[Bug c++/58487] Missed return value optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58487 Antony Polukhin changed: What|Removed |Added CC||antoshkka at gmail dot com --- Comment #3 from Antony Polukhin --- Minimized example, move constructor should not be called: struct A { A() = default; A(A&&); }; A test() { if (true) { A a; return a; } else { return A{}; } } Godbolt playground: https://godbolt.org/z/4Pzq83WWY
[Bug c++/58050] No return value optimization when calling static function through unnamed temporary
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58050 Antony Polukhin changed: What|Removed |Added CC||antoshkka at gmail dot com --- Comment #1 from Antony Polukhin --- This was fixed in GCC-10.1 https://godbolt.org/z/b4ohfnK3x
[Bug c++/100746] New: NRVO should not introduce aliasing
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100746 Bug ID: 100746 Summary: NRVO should not introduce aliasing Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: struct NrvoPassed { NrvoPassed() = default; NrvoPassed(const NrvoPassed&); NrvoPassed(NrvoPassed&&); int i = 0; }; auto test(int* data) { NrvoPassed x; *data = 3; if (x.i != 0) __builtin_abort(); return x; } Resulting assembly contains call to `abort`: test(int*): mov DWORD PTR [rdi], 0 mov DWORD PTR [rsi], 3 mov edx, DWORD PTR [rdi] test edx, edx jne .L3 mov rax, rdi ret test(int*) [clone .cold]: .L3: push rax call abort Optimizer thinks that the value of `x.i` is aliased by `data`, however `data` is a local variable and it's address could not leak before the object is constructed. Some other compilers already have the proposed optimization: https://godbolt.org/z/aqdveadnE Adding `__restrict` to `data` fixes the codegen: test2(int*): mov DWORD PTR [rdi], 0 mov rax, rdi mov DWORD PTR [rsi], 3 ret Probably `__restrict` should be always added to the storage address passed for NRVO.
[Bug libstdc++/89120] std::minmax_element 2.5 times slower than hand written loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89120 --- Comment #2 from Antony Polukhin --- Long story short: I've found no way to improve the standard library code to always work faster. I'm in favor of closing this ticket as invalid/wont fix. Long story: I've tried to add a specialization of minmax_element algorithm for std::less comparators and arithmetic types. That specialization was doing more comparisons but in a more predictable way. On big datasets the performance increased, but decreased on small datasets. Then I've tried another approach. If the comparison of __first with __next is barely predictable, then just avoid branching on it. Portable solution: bool __b = __comp(__next, __first); _ForwardIterator __pots[3] = {__first, __next, __first}; _ForwardIterator __pot_min = *(__pots + __b); _ForwardIterator __pot_max = *(__pots + __b + 1); Special case for random access iterators: bool __b = __comp(__next, __first); _ForwardIterator __pot_min = __first, __pot_max = __next; __pot_min += b; __pot_max -= b; Unfortunately both those approaches add some overhead for small datasets. Another disadvantage, is that those approaches produce orthogonal results on different compilers: GCC-9 performance gets better on big datasets - Benchmark Time CPU Iterations - naive_minmax/2 3 ns 3 ns 247522237 naive_minmax/8 7 ns 7 ns 103044422 naive_minmax/262144 1715635 ns1710406 ns407 naive_minmax/1048576 6970755 ns6947034 ns101 branchless_minmax/28 ns 8 ns 81324904 branchless_minmax/8 30 ns 30 ns 23494608 branchless_minmax/262144 457287 ns 456412 ns 1529 branchless_minmax/10485764267914 ns4219969 ns363 Clang-9 performance degrades on big datasets - Benchmark Time CPU Iterations - naive_minmax/2 2 ns 2 ns 380928404 naive_minmax/8 7 ns 7 ns 92642970 naive_minmax/262144 262921 ns 262288 ns 2630 naive_minmax/1048576 1149407 ns1147626 ns618 branchless_minmax/22 ns 2 ns 307146020 branchless_minmax/8 10 ns 10 ns 74417142 branchless_minmax/262144 425880 ns 425241 ns 1637 branchless_minmax/10485761747785 ns1745725 ns397 Final attempt. Different compilers optimize the algorithm differently. Clang shows good performance on big datasets with >4k elements, GCC - on medium sized datasets with 128-1k elements. Maybe providing more info on probabilities could help both compilers to produce better code. But looks like heuristics already deduce the probabilities to be close to 0.5, __builtin_expect_with_probability(__b, true, 0.5) changed nothing in the assembly https://godbolt.org/z/PqWoaKfhW
[Bug c++/80542] Warn about accidental copying of data in range based for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80542 --- Comment #2 from Antony Polukhin --- This issue could be closed. GCC 11 has the required -Wrange-loop-construct warning: https://godbolt.org/z/343M6WMjb
[Bug libstdc++/99612] New: Remove "#pragma GCC system_header" from atomic file to warn on incorrect memory order
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99612 Bug ID: 99612 Summary: Remove "#pragma GCC system_header" from atomic file to warn on incorrect memory order Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- GCC has -Winvalid-memory-model that warns if wrong memory model is used with atomic auto ret = a.load(std::memory_order_release); // warning a.store(10, std::memory_order_acquire); // warning Unfortunately, that warning does not work by default, because header has a "#pragma GCC system_header" in it. The only way to get the warning is to use -Wsystem-headers that unleashes all the warnings from all system headers. Playground: https://godbolt.org/z/Wca5ef
[Bug middle-end/98817] Optimize if (a != b) a = b;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98817 --- Comment #5 from Antony Polukhin --- Please, close as invalid
[Bug middle-end/98817] Optimize if (a != b) a = b;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98817 --- Comment #2 from Antony Polukhin --- (In reply to Jakub Jelinek from comment #1) > I'm not sure about this. Turning it into an unconditional store would mean > that the memory the reference points to must be writable, that might not be > always the case. Fair pint. How about emitting cmov instead of cmp+je?
[Bug middle-end/98817] New: Optimize if (a != b) a = b;
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98817 Bug ID: 98817 Summary: Optimize if (a != b) a = b; Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: void arithmetic(int& result, int value) { if (result != value) { result = value; } } GCC generates the following assembly: arithmetic(int&, int): cmp DWORD PTR [rdi], esi je .L1 mov DWORD PTR [rdi], esi .L1: ret The assembly seems suboptimal, because 1) cmov could be used 2) conditional jump could be totally removed, reducing the binary size and leaving only one mov instruction: arithmetic(int&, int): mov DWORD PTR [rdi], esi ret Godbolt playground https://godbolt.org/z/Pdz7eP with above sample and std::vector::clear() sample that would also benefit from the above optimization.
[Bug c++/98814] New: Add fix-it hints for missing asterisk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98814 Bug ID: 98814 Summary: Add fix-it hints for missing asterisk Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Iterators and pointers are quite common in C++ code but newbies tend to forget to dereference them: struct my_vector { void push_back(int); }; struct my_iterator { int operator*(); }; void sample(my_vector& vec, my_iterator it) { vec.push_back(it); } A fix-it hint would be helpful for cases when no matching function found but the argument has an operator*() that returns a matching type. More examples via godbolt playground: https://godbolt.org/z/dsrqj8
[Bug c++/98768] New: Improve diagnostics for incorrect result type checking "-> Type" in concepts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98768 Bug ID: 98768 Summary: Improve diagnostics for incorrect result type checking "-> Type" in concepts Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: template concept Callable0Arg = requires(Function func) { func() -> T; }; The expression "-> T" is valid only if the "func()" returns pointer to a type that has member "T". At the same time there is an unused "typename T" in the concept definition. For such cases a warning like "Unused `T` in concept definition. Did you mean `-> std::same_as`" would be really helpful.
[Bug c++/98767] New: Function signature lost in concept diagnostic message
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98767 Bug ID: 98767 Summary: Function signature lost in concept diagnostic message Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: diagnostic Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: antoshkka at gmail dot com Target Milestone: --- Consider the example: template concept Callable1Arg = requires(Function func, T value) { func(value); }; // Should fail and fails: static_assert(Callable1Arg); The diagnotics has the following line: "in requirements with 'Function func', 'T value' [with T = bool; Function = int (*)()]" However the type of the Function is "int (*)(int*)" not "int (*)()" Godbolt playground: https://godbolt.org/z/afKqq5
[Bug tree-optimization/78427] missed optimization of loop condition
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78427 Antony Polukhin changed: What|Removed |Added CC||antoshkka at gmail dot com --- Comment #4 from Antony Polukhin --- Any progress?