[Bug target/114943] New: X86 AVX2: inefficient code generated to convert SIMD Vectors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114943 Bug ID: 114943 Summary: X86 AVX2: inefficient code generated to convert SIMD Vectors Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in the example below (see https://godbolt.org/z/qnfT4fE5G ) convert and covert3 produce code that looks to me inefficient w/r/t convert2 (and clang) for target x86-64-v3 #define VECTOR_EXT(N) __attribute__((vector_size(N))) typedef float VECTOR_EXT(16) float32x4_t; typedef double VECTOR_EXT(32) float64x4_t; float32x4_t f1,f2,f3,f4,f; float64x4_t d1,d2,d3,d4,d; void covert() { for (int i=0;i<4;++i) { d1[i] = f1[i]; d2[i] = f2[i]; d3[i] = f3[i]; d4[i] = f4[i]; } } void covert2() { for (int i=0;i<4;++i) d1[i] = f1[i]; for (int i=0;i<4;++i) d2[i] = f2[i]; for (int i=0;i<4;++i) d3[i] = f3[i]; for (int i=0;i<4;++i) d4[i] = f4[i]; } void covert3() { d1 = __builtin_convertvector(f1,float64x4_t); }
[Bug target/114484] #include changes ::abs in std::abs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484 --- Comment #9 from vincenzo Innocente --- We observe that including xmmintrin.h the behaviour of some code, notably abs(x), when x is float or double changes. And this depends on the platform as xmmintrin.h is x86_64 specific. Yes, is 20 years that is like that and people always wandered why abs(x) was behaving differently in different parts of the code and now asking why it behaves differently on x86_64 and ARM. The workaround is obvious: use std::abs. I personally find very unconfortable that including (even through cascade) xmmintrin.h changes the behaviour of "abs(x)" If everybody on GCC side is confortable with this situation we will just take note and try to be more strict with code visual inspection.
[Bug target/114484] #include changes ::abs in std::abs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484 --- Comment #4 from vincenzo Innocente --- in C++ one is supposed to #include not I do not think that there is an explicit version of C++ headers for the intrinsics that avoids the conflicts between C and C++.
[Bug c++/114484] #include changes ::abs in std::abs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484 --- Comment #2 from vincenzo Innocente --- *** Bug 114483 has been marked as a duplicate of this bug. ***
[Bug c++/114483] #include changes ::abs in std::abs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114483 vincenzo Innocente changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #1 from vincenzo Innocente --- please close this *** This bug has been marked as a duplicate of bug 114484 ***
[Bug c++/114484] #include changes ::abs in std::abs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484 --- Comment #1 from vincenzo Innocente --- xmmintrin.h includes mm_malloc.h which #include which using std::abs; (among others) see https://godbolt.org/z/cxo65rnr9 or this excerpt from c++ -E dump ``` # 32 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/xmmintrin.h" 2 3 4 # 1 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/mm_malloc.h" 1 3 4 # 27 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/mm_malloc.h" 3 4 # 1 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/stdlib.h" 1 3 4 # 36 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/stdlib.h" 3 4 # 1 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/cstdlib" 1 3 4 # 39 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/cstdlib" 3 4 # 40 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/cstdlib" 3 # 37 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/stdlib.h" 2 3 4 using std::abort; using std::atexit; using std::exit; using std::at_quick_exit; using std::quick_exit; using std::div_t; using std::ldiv_t; using std::abs; using std::atof; using std::atoi; using std::atol; using std::bsearch; using std::calloc; using std::div; using std::free; using std::getenv; using std::labs; using std::ldiv; using std::malloc; using std::mblen; using std::mbstowcs; using std::mbtowc; using std::qsort; using std::rand; using std::realloc; using std::srand; using std::strtod; using std::strtol; using std::strtoul; using std::system; using std::wcstombs; using std::wctomb; # 28 "/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/mm_malloc.h" 2 3 4 ```
[Bug c++/114484] New: #include changes ::abs in std::abs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484 Bug ID: 114484 Summary: #include changes ::abs in std::abs Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: ---
[Bug c++/114483] New: #include changes ::abs in std::abs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114483 Bug ID: 114483 Summary: #include changes ::abs in std::abs Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: ---
[Bug tree-optimization/114363] inconsistent optimization of pow(x,2)+pow(y,2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114363 --- Comment #4 from vincenzo Innocente --- Thanks Harald, I missed the point that float z = pow(double(x),2) and float z = x*x would indeed produce exactly the same result, while in all other cases of course not.
[Bug tree-optimization/114363] New: inconsistent optimization of pow(x,2)+pow(y,2)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114363 Bug ID: 114363 Summary: inconsistent optimization of pow(x,2)+pow(y,2) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- while pow(x,2) is optimized in x*x (float x) in pow(x,2)+pow(y,2) x and y are first promoted to double which I find inconsistent see https://godbolt.org/z/rYfoaxr89
[Bug libstdc++/112649] New: [c++23] in presence of inline functions and debug-info stacktrace reports the deepest callee
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112649 Bug ID: 112649 Summary: [c++23] in presence of inline functions and debug-info stacktrace reports the deepest callee Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- Created attachment 56657 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56657=edit a small demo demonstating the description. feature or defect? or missing feature in std::stacktrace... what I find disturbing is that the "symbol name" is different for the very same pc depending if it has been compiled with "-g" or not: in case of debug-info it is set to the deepest callee, w/o to the outermost caller. maybe it is a issue for the libstd committee? DEMO: just compile the attached demo program and compile it with c++ -std=c++23 stackTraceDemo.cpp -lstdc++exp -O2 -DINCLUDE='' -g and run it than without -g one can also try to run gdb to compare with the demo output. ``` gdb ./a.out b instrumentedFunc run where ``` Details: in libstdc++-v3/src/c++23/stacktrace.cc 123 bool 124 stacktrace_entry::_Info::_M_populate(native_handle_type pc) 125 { 126auto cb = [](void* self, uintptr_t, const char* filename, int lineno, 127 const char* function) -> int 128{ 129 auto& info = *static_cast<_Info*>(self); 130 info._M_set_desc(function); 131 info._M_set_file(filename); 132 if (info._M_line) 133*info._M_line = lineno; 134 return function != nullptr; 135}; 136const auto state = init(); 137if (::__glibcxx_backtrace_pcinfo(state, pc, +cb, err_handler, this)) 138 return true; according to doc __glibcxx_backtrace_pcinfo * Given PC, a program counter in the current program, call the callback function with filename, line number, and function name information. This will normally call the callback function exactly once. However, if the PC happens to describe an inlined call, and the debugging information contains the necessary information, then this may call the callback function multiple times. This will make at least one call to either CALLBACK or ERROR_CALLBACK. This returns the first non-zero value returned by CALLBACK, or 0. */ >From my tests last sentence means that if the callback does not return 0 it may be called again. So in the current implementation it will be called just once even in presence of inline functions and therefore the stacktrace-entry will be set to the deepest callee. If one waits till last call (returning always "false") one will be able to set the entry to the outermost caller or even record the full call chain (as GDB does). This last option does not seem to fit std::backtrace interface. -- here is the output of the demo (I prefer to print the stacktrace reversed) # is from the stacktrace entry >> is from __glibcxx_backtrace_pcinfo returning always "false" [innocent@patatrack01 demos]$ c++ -std=c++23 stackTraceDemo.cpp -lstdc++exp -O2 -DINCLUDE=''; ./a.out #0 0x :0 #1 0x40164d _start :0 #2 0x7f4412f23d84 __libc_start_main :0 #3 0x40159a main :0 #4 0x401eeb func(int) :0 #5 0x401ab0 instrumentedFunc(int) :0 10 [innocent@patatrack01 demos]$ c++ -std=c++23 stackTraceDemo.cpp -lstdc++exp -O2 -DINCLUDE='' -g; ./a.out #0 0x :0 #1 0x40164d _start :0 #2 0x7ff80f90ed84 __libc_start_main :0 #3 0x40159a main /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:116 >> 1 main /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:116 #4 0x401eeb nestedFunc2(int) /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:101 >> 1 _Z11nestedFunc2i >> /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:101 >> 2 _Z10nestedFunci >> /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:106 >> 3 _Z4funci /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:112 #5 0x401ab0 instrumentedFunc(int) /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:91 >> 1 _Z16instrumentedFunci >> /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:91 10 [innocent@patatrack01 demos]$ gdb ./a.out GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8 ... Reading symbols from ./a.out...done. (gdb) b instrumentedFunc Breakpoint 1 at 0x401a90: file /afs/cern.ch/work/i/innocent/public/w5/include/c++/14.0.0/bits/new_allocator.h, line 88. (gdb) run Starting program: /data/user/innocent/MallocProfiler/demos/a.out Breakpoint 1, instrumentedFunc (c=4) at /afs/cern.ch/work/i/innocent/public/w5/include/c++/14.0.0/bits/new_allocator.h:88 88__new_allocator() _GLIBCXX_USE_NOEX
[Bug libstdc++/112348] [C++23] defect in struct hash>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112348 --- Comment #1 from vincenzo Innocente --- This patch works for me diff --git a/libstdc++-v3/include/std/stacktrace b/libstdc++-v3/include/std/stacktrace index da0e48d3532..9a0d0b16068 100644 --- a/libstdc++-v3/include/std/stacktrace +++ b/libstdc++-v3/include/std/stacktrace @@ -797,7 +797,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION size_t operator()(const basic_stacktrace<_Allocator>& __st) const noexcept { - hash __h; + hash __h; size_t __val = _Hash_impl::hash(__st.size()); for (const auto& __f : __st) __val = _Hash_impl::__hash_combine(__h(__f), __val);
[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263 --- Comment #12 from vincenzo Innocente --- confirm that the patch solves the issue c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o liba.so -ldl;c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la -Wl,-rpath=.; ./a.out 0# nested_func2(int) at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:63 1# nested_func(int) at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:93 2# func(int) at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:101 3# main at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:106 4# __libc_start_main at :0 5# _start at :0 6# what is the last empty entry is a different story I suppose (not an issue at the moment). Thanks again for the fast action
[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263 --- Comment #8 from vincenzo Innocente --- Thanks Ian for the patch. For testing I will need the full git diff (including the makefile itself as my autoconf is not compatible with gcc14). Backports down to gcc12 will be appreciated. Could you please notify here when the patch enters the various main branches?
[Bug libstdc++/112348] New: [C++23] defect in struct hash>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112348 Bug ID: 112348 Summary: [C++23] defect in struct hash> Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- gcc version 14.0.0 20231028 (experimental) [master r14-4988-g5d2a360f0a5] (GCC) auto k = std::hash()(std::stacktrace::current()); does not compile to me with error In instantiation of 'std::size_t std::hash >::operator()(const std::basic_stacktrace<_Allocator>&) const [with _Allocator = std::allocator; std::size_t = long unsigned int]': testStacktrace.cpp:39:41: required from here 39 |auto k = std::hash()(std::stacktrace::current()); | ^~~~ /afs/cern.ch/work/i/innocent/public/w5/include/c++/14.0.0/stacktrace:803:49: error: no match for call to '(std::hash) (const std::stacktrace_entry&)' 803 | __val = _Hash_impl::__hash_combine(__h(__f), __val); | ~~~^ changed // hash __h; hash __h; and it compiled. (I suspect __f.native_handle() would work as well) Surprised it passed tests.
[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263 --- Comment #6 from vincenzo Innocente --- Sorry, made the (almost) full exercise: read the doc in https://en.cppreference.com/w/cpp/utility/stacktrace_entry and the code in stacktrace header file and in libstdc++-v3/src/c++23/stacktrace.cc (have not read the specs in the C++23 standard) indeed the entry implementation has just the handle as data member and the details are retrieved when the "Query" methods are called. This appears to happen in stacktrace_entry::_Info::_M_populate(native_handle_type pc) which in turn calls ::__glibcxx_backtrace_pcinfo if this fails it calls ::__glibcxx_backtrace_syminfo so most probably the issue is in this last function unless there is a problem with the logic in _M_populate that I failed to identify.
[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263 --- Comment #5 from vincenzo Innocente --- so if I add to std::cout << std::stacktrace::current() << '\n'; I get what needed Dl_info dlinfo; for (auto & entry : std::stacktrace::current() ) { dladdr((const void*)(entry.native_handle()),); std::cout << dlinfo.dli_sname << ' ' << dlinfo.dli_fname <<'\n'; } c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o liba.so -ldl ; c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la -Wl,-rpath=. ; ./a.out 0# at :0 1# at :0 2# func(int) at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:44 3# main at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:49 4# at :0 5# _start at :0 6# _Z12nested_func2i ./liba.so _Z11nested_funci ./liba.so of course not de-mangled so is it a feature or a defect? I'm not sure how the implementation works (did not look to the code) dladdr can be slow and may "hang" in some situations. so it would be useful to have an option that the "name" is not immediately resolved and have a function that returns the name from the native_handle "asynchronously"
[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263 --- Comment #4 from vincenzo Innocente --- intel x86_64 uname -a Linux patatrack01 4.18.0-477.13.1.el8_8.x86_64 #1 SMP Thu May 18 10:27:05 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux boost::backtrace works can provide example
[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263 vincenzo Innocente changed: What|Removed |Added CC||ian at gcc dot gnu.org Component|libstdc++ |libbacktrace --- Comment #2 from vincenzo Innocente --- I suspect libbacktrace even if I do not have ways to test it outside std::stacktrace
[Bug libstdc++/112263] New: [C++23] std::stacktrace does not identify symbols in shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263 Bug ID: 112263 Summary: [C++23] std::stacktrace does not identify symbols in shared library Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- using gcc version 14.0.0 20231028 (experimental) [master r14-4988-g5d2a360f0a5] (GCC) that contains the fix for #111936 This simple example [1] when run as a single executable prints all symbols in the stacktrace when the nested functions are in a shared library their names are missing c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -DINLIB ; ./a.out 0# nested_func2(int) at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:13 1# nested_func(int) at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:18 2# func(int) at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:26 3# main at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:31 4# at :0 5# _start at :0 6# c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o liba.so ; c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la -Wl,-rpath=. ; ./a.out 0# at :0 1# at :0 2# func(int) at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:26 3# main at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:31 4# at :0 5# _start at :0 6# [1] cat testStacktrace.cpp //compile and run with either // c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -DINLIB; ./a.out // or // c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o liba.so;c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la -Wl,-rpath=.; ./a.out // #include #include #ifdef INLIB int nested_func2(int c) { std::cout << std::stacktrace::current() << '\n'; return c + 1; } int nested_func(int c) { return nested_func2(c + 1); } #else int nested_func(int c); #endif #ifdef INMAIN int func(int b) { return nested_func(b + 1); } int main() { std::cout << func(777); return 0; } #endif
[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936 --- Comment #9 from vincenzo Innocente --- Thanks for the second patch. I was indeed struggling with autoconf versions (1.15 vd 1.16) Any chance to backport to gcc12 (our current production version)?
[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936 --- Comment #7 from vincenzo Innocente --- not explicitly in the src tree. only run configure in the build directory. what I need to run in the src tree?
[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936 --- Comment #5 from vincenzo Innocente --- My bad, long time I'm not using archive libraries and forgot about the order rule. The issue is indeed missing -fPIC. Thanks for the fast action. I applied the patch but it seems not sufficient. If I well understood this is where the ar lib is built ar rc .libs/libstdc++_libbacktrace.a std_stacktrace-atomic.o std_stacktrace-backtrace.o std_stacktrace-dwarf.o std_stacktrace-fileline.o std_stacktrace-posix.o std_stacktrace-sort.o std_stacktrace-simple.o std_sta cktrace-state.o std_stacktrace-cp-demangle.o std_stacktrace-elf.o std_stacktrace-mmapio.o std_stacktrace-mmap.o but those are the file compiled w/o -fPIC those with fPIC are under .libs itself... so I did manually ``` ar rc .libs/libstdc++_libbacktrace.a .libs/*.o ../c++23/stacktrace.o ``` and then locally c++ -O3 -pthread -fPIC -shared -std=c++23 getStacktrace.cc /data/user/innocent/gcc_build/x86_64-pc-linux-gnu/libstdc++-v3/src/libbacktrace/.libs/libstdc++_libbacktrace.a -g -o mallocHook.so and runs setenv LD_PRELOAD ./mallocHook.so ; ./a.out ; unsetenv LD_PRELOAD asked 4 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start## asked 8 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start## asked 16 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start## asked 32 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start## asked 64 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start## asked 128 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start## asked 256 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start## asked 512 at ###std::__new_allocator::allocate(unsigned long, void const*)#std::allocator_traits >::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator > >, int const&)#std::vector >::push_back(int const&)#go(int)#main##_start##
[Bug c++/111934] ICE internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934 --- Comment #3 from vincenzo Innocente --- with gcc version 14.0.0 20231024 (experimental) [master r14-4877-g724badcadf8] (GCC) I get the same ICE. Please note that one needs to include "iostream" (in my test compile with "-DICE") to trigger the ICE. w/o it just emits the syntax error as one would expect.
[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936 --- Comment #1 from vincenzo Innocente --- here is a minimal malloc hook that I would like to use [innocent@patatrack01 ctest]$ cat getStacktrace.cc #include std::string get_stacktrace() { std::string trace; for (auto & entry : std::stacktrace::current() ) trace += entry.description() + '#'; return trace; } #include #include #include extern "C" void * myMallocHook(size_t size, void const * caller) { __malloc_hook = nullptr; auto p = malloc(size); std::cout << "asked " << size << " at " << get_stacktrace() << std::endl; __malloc_hook = myMallocHook; return p; } namespace { struct Hook { Hook() { __malloc_hook = myMallocHook; } }; Hook hook; } compiled as c++ -O3 -Wall -pthread -fPIC -shared -std=c++23 -lstdc++exp getStacktrace.cc gives the undefined symbol setenv LD_PRELOAD ./a.out ; ls ; unsetenv LD_PRELOAD ls: symbol lookup error: ./a.out: undefined symbol: _ZNSt17__stacktrace_impl10_S_currentEPFiPvmES0_i
[Bug libstdc++/111936] New: std::stacktrace cannot be used in a shared library
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936 Bug ID: 111936 Summary: std::stacktrace cannot be used in a shared library Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- I would like to use std::stacktrace in a shared library to be preloaded... when I try to build the library even for this minimal example cat getStacktrace.cc #include std::string get_stacktrace() { std::string trace; for (auto & entry : std::stacktrace::current() ) trace += entry.description() + '#'; return trace; } it fails c++ -O3 -Wall -pthread -fPIC -shared getStacktrace.cc -std=c++23 -lstdc++exp /usr/bin/ld: /afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-fileline.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-posix.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-simple.o): relocation R_X86_64_32 against `.text' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-elf.o): relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-mmap.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-mmapio.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: /afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-dwarf.o): relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC /usr/bin/ld: final link failed: Nonrepresentable section on output collect2: error: ld returned 1 exit status it silently compiles with [innocent@patatrack01 ctest]$ c++ -O3 -Wall -pthread -fPIC -shared -std=c++23 -lstdc++exp getStacktrace.cc but the symbols are undefined [innocent@patatrack01 ctest]$ ldd ./a.out linux-vdso.so.1 (0x7ffd50f73000) libstdc++.so.6 => /afs/cern.ch/user/i/innocent/w5/lib64/libstdc++.so.6 (0x7fa9437f8000) libm.so.6 => /usr/lib64/libm.so.6 (0x7fa943476000) libgcc_s.so.1 => /afs/cern.ch/user/i/innocent/w5/lib64/libgcc_s.so.1 (0x7fa94324b000) libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x7fa94302b000) libc.so.6 => /usr/lib64/libc.so.6 (0x7fa942c66000) /lib64/ld-linux-x86-64.so.2 (0x7fa943e68000) [innocent@patatrack01 ctest]$ nm -C ./a.out | grep stack 0db0 T get_stacktrace[abi:cxx11]() 0be0 t get_stacktrace[abi:cxx11]() [clone .cold] 0d20 t std::basic_stacktrace >::current(std::allocator const&) [clone .isra.0] U std::stacktrace_entry::_Info::_M_populate(unsigned long) 1430 W std::stacktrace_entry::_Info::_S_set[abi:cxx11](void*, char const*) U std::__stacktrace_impl::_S_current(int (*)(void*, unsigned long), void*, int) 1310 W std::basic_stacktrace >::_M_prepare(unsigned short)::{lambda(void*, unsigned long)#1}::_FUN(void*, unsigned long) and at run time (not this example, my full application that invoke the staketrace from a malloc hook) it (obviously fail) [innocent@patatrack01 ctest]$ c++ -O3 -Wall -pthread -fPIC -shared -std=c++23 -lstdc++exp mallocWrapper.cc [innocent@patatrack01 ctest]$ setenv LD_PRELOAD ./a.out ; ls ; unsetenv LD_PRELOAD Recoding structure constructed in a thread ls: symbol lookup error: ./a.out: undefined symbol: _ZNSt17__stacktrace_impl10_S_currentEPFiPvmES0_i
[Bug c++/111934] ICE internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934 --- Comment #1 from vincenzo Innocente --- sorry missed the version gcc version 14.0.0 20231021 (experimental) [master r14-4817-g405a4140fc3] (GCC)
[Bug c++/111934] New: ICE internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934 Bug ID: 111934 Summary: ICE internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- #ifdef ICE #include #endif struct Me { static Me & me() { thread_local auto me = std::make_unique_ptr(); return *me; } }; int main() { return 0; } c++ -O3 -Wall -pthread ice13.cpp ice13.cpp: In static member function 'static Me& Me::me()': ice13.cpp:8:33: error: 'make_unique_ptr' is not a member of 'std' 8 | thread_local auto me = std::make_unique_ptr(); | ^~~ ice13.cpp:8:51: error: expected primary-expression before '>' token 8 | thread_local auto me = std::make_unique_ptr(); | ^ ice13.cpp:8:53: error: expected primary-expression before ')' token 8 | thread_local auto me = std::make_unique_ptr(); | ^ == ctest]$ c++ -O3 -Wall -pthread ice13.cpp -DICE ice13.cpp: In static member function 'static Me& Me::me()': ice13.cpp:8:33: error: 'make_unique_ptr' is not a member of 'std' 8 | thread_local auto me = std::make_unique_ptr(); | ^~~ ice13.cpp:8:51: error: expected primary-expression before '>' token 8 | thread_local auto me = std::make_unique_ptr(); | ^ ice13.cpp:8:53: error: expected primary-expression before ')' token 8 | thread_local auto me = std::make_unique_ptr(); | ^ ice13.cpp: At global scope: ice13.cpp:8:23: internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065 8 | thread_local auto me = std::make_unique_ptr(); | ^~ 0x7de25d discriminator_for_local_entity ../../gcc_src/gcc/cp/mangle.cc:2065 0xb92a4a write_local_name ../../gcc_src/gcc/cp/mangle.cc:2164 0xb92a4a write_name ../../gcc_src/gcc/cp/mangle.cc:1071 0xb94e46 write_encoding ../../gcc_src/gcc/cp/mangle.cc:864 0xb94f5b write_mangled_name ../../gcc_src/gcc/cp/mangle.cc:810 0xb95740 mangle_decl_string ../../gcc_src/gcc/cp/mangle.cc:4092 0xb9592a get_mangled_id ../../gcc_src/gcc/cp/mangle.cc:4113 0xb9592a mangle_decl(tree_node*) ../../gcc_src/gcc/cp/mangle.cc:4151 0x16512bd decl_assembler_name(tree_node*) ../../gcc_src/gcc/tree.cc:715 0xe4a329 symbol_table::insert_to_assembler_name_hash(symtab_node*, bool) ../../gcc_src/gcc/symtab.cc:175 0xe4a48c symbol_table::symtab_initialize_asm_name_hash() ../../gcc_src/gcc/symtab.cc:267 0xe4ae84 symbol_table::symtab_initialize_asm_name_hash() ../../gcc_src/gcc/symtab.cc:1078 0xe4ae84 symtab_node::get_for_asmname(tree_node const*) ../../gcc_src/gcc/symtab.cc:1066 0xe5fc61 handle_alias_pairs ../../gcc_src/gcc/cgraphunit.cc:1528 0xe64fa7 symbol_table::finalize_compilation_unit() ../../gcc_src/gcc/cgraphunit.cc:2541 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions.
[Bug tree-optimization/109885] New: gcc does not generate movmskps and testps instructions (clang does)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109885 Bug ID: 109885 Summary: gcc does not generate movmskps and testps instructions (clang does) Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in this simple code (on avx2) int sum(float const * x) { int ret = 0; for (int i=0; i<8; ++i) ret +=(0==x[i]); return ret; } int one(float const * x) { int ret = 0; for (int i=0; i<8; ++i) ret |=(0==x[i]); return ret; } int all(float const * x) { int ret = 1; for (int i=0; i<8; ++i) ret &=(0==x[i]); return ret; } clang uses movmskps and testps instructions, gcc does not see for instance https://godbolt.org/z/r11r8xoYz
[Bug c++/109281] New: use std::optional results in suboptimal code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109281 Bug ID: 109281 Summary: use std::optional results in suboptimal code Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- In the following (almost real) code gcc emits suboptimal code if std::optional is used w/r/t home made one and clang see https://godbolt.org/z/Pba51Ye7Y - code #include // #define USE_OPTIONAL #ifdef USE_OPTIONAL struct SubRingCrossings { SubRingCrossings(int ci, int ni, float nd) : closestIndex(ci), nextIndex(ni), nextDistance(nd) {} int closestIndex; int nextIndex; float nextDistance; }; #else struct SubRingCrossings { SubRingCrossings() : valid(false) {} SubRingCrossings(int ci, int ni, float nd) : valid(true), closestIndex(ci), nextIndex(ni), nextDistance(nd) {} bool valid; int closestIndex; int nextIndex; float nextDistance; }; #endif bool condition(); #ifdef USE_OPTIONAL std::optional foo() { if (condition()) { return std::nullopt; } return SubRingCrossings(1, 2, 3.14); } #else SubRingCrossings foo() { if (condition()) { return SubRingCrossings(); } return SubRingCrossings(1, 2, 3.14); } #endif int bar() { auto tmp = foo(); #ifdef USE_OPTIONAL if (tmp) { return tmp->closestIndex; #else if (tmp.valid) { return tmp.closestIndex; #endif } else { return 0; } }
[Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011 Bug ID: 109011 Summary: missed optimization in presence of __builtin_ctz Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in the following code foo does not vectorize, bar does. clang vectorize foo using a pattern that invokes vplzcntd (code made a bit complex to make vectorization "relevant") see https://godbolt.org/z/5fa1zbPeG #include uint32_t x[256]; uint32_t y[256]; uint32_t w[256]; uint32_t z[256]; void foo() { for (int i=0; i<256;i++) { auto p = x[i] ? __builtin_ctz(x[i]) : y[i]; z[i] = w[i]*p; } } void bar() { for (int j=0; j<256;j+=8) for (int i=j; i
[Bug tree-optimization/108804] New: missed vectorization in presence of conversion from uint64_t to float
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108804 Bug ID: 108804 Summary: missed vectorization in presence of conversion from uint64_t to float Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in the following code [1] foo does not vectorize, bar doos compiled with -march=haswell -Ofast --no-math-errno -Wall see https://godbolt.org/z/E6xzfavxc clang seems do do better [1] #include uint64_t d[512]; //uint32_t f[1024]; float f[1024]; void foo() { for (int i=0; i<512; ++i) { uint64_t k = d[i]; auto x = (k & 0x007F) | 0x3F80; k = k >> 23; auto y = (k & 0x007F) | 0x3F80; f[i]=x; f[128+i] = y; } } void bar() { for (int i=0; i<512; ++i) { uint64_t k = d[i]; uint32_t x = (k & 0x007F); x |= 0x3F80; uint32_t y = k >> 23; y = (y & 0x007F) | 0x3F80; f[i]=x; f[128+i] = y; } }
[Bug tree-optimization/108677] wrong vectorization (when copy constructor is present?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108677 --- Comment #3 from vincenzo Innocente --- sorry. the original internal bug report was for gcc 7.5 https://godbolt.org/z/9crafbqen where I think the generated code is indeed wrong (and does not depend on the presence of the constructor!) SO, if anything the bug should be changed in: removing constructor inhibit SLP vectorization?
[Bug tree-optimization/108677] New: wrong vectorization (when copy constructor is present?)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108677 Bug ID: 108677 Summary: wrong vectorization (when copy constructor is present?) Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in this real life code #include struct trig_pair { double CosPhi; double SinPhi; trig_pair() : CosPhi(1.), SinPhi(0.) {} trig_pair(const trig_pair ) : CosPhi(tp.CosPhi), SinPhi(tp.SinPhi) {} trig_pair(const double C, const double S) : CosPhi(C), SinPhi(S) {} trig_pair(const double phi) : CosPhi(cos(phi)), SinPhi(sin(phi)) {} //Return trig_pair fo angle increased by angle of tp. trig_pair Add(const trig_pair ) { return trig_pair(this->CosPhi*tp.CosPhi - this->SinPhi*tp.SinPhi, this->SinPhi*tp.CosPhi + this->CosPhi*tp.SinPhi); } }; trig_pair *TrigArr; void FillTrigArr(trig_pair tp, unsigned MaxM) { //Fill TrigArr with trig_pair(jp*phi) if (!TrigArr) return;; TrigArr[1] = tp; for (unsigned jp = 2; jp <= MaxM; ++jp) TrigArr[jp] = TrigArr[jp-1].Add(tp); } gcc vectorize the loop even if a dependency is present...[1] It will not if I comment out the copy contructor...[2] [1] https://godbolt.org/z/vhPeh35n5 [2] https://godbolt.org/z/YPjdYdqG8
[Bug target/106012] rsqrtps and rcpps instructions generated even if -fno-reciprocal-math specified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012 --- Comment #6 from vincenzo Innocente --- just to confirm that -Ofast -fno-reciprocal-math -mno-recip seems to inhibit all reciprocals... https://godbolt.org/z/f4bccb9GP
[Bug c++/107933] New: std::sqrt complies in intrinsics for float even if --no-builtin is provided
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107933 Bug ID: 107933 Summary: std::sqrt complies in intrinsics for float even if --no-builtin is provided Product: gcc Version: 12.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- on x86_64 float f(float x) { return std::sqrt(x);} compiles in sqrtss xmm0, xmm0 even if --no-builtin is provided double d(double x) { return std::sqrt(x);} calls libm as well as float fs(float x) { return sqrtf(x);} double ds(double x) { return sqrt(x);} see https://godbolt.org/z/Mhf9hv6ns
[Bug tree-optimization/106012] rsqrtps and rcpps instructiona generated even if -fno-reciprocal-math specified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012 vincenzo Innocente changed: What|Removed |Added Summary|rsqrtss instruction |rsqrtps and rcpps |generated even if |instructiona generated even |-mno-recip specified|if -fno-reciprocal-math ||specified Status|RESOLVED|NEW Resolution|WONTFIX |--- --- Comment #3 from vincenzo Innocente --- Thanks for the suggestion. -fno-reciprocal-math does indeed inhibit scalar reciprocal instructions. NOT in vectorized loop though. see https://godbolt.org/z/9eMb4Tjee
[Bug target/106012] New: rsqrtss instruction generated even if -mno-recip specified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012 Bug ID: 106012 Summary: rsqrtss instruction generated even if -mno-recip specified Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- with option -Ofast -mno-recip rsqrtss instruction is still generated. https://godbolt.org/z/hGxrG7xPh inhibiting rsqrtss and rcpss is critical to obtain identical results when running on INTEL and AMD platforms. Having to inhibit Ofast is clearly a larger performance penalty.
[Bug tree-optimization/104950] New: GCC does not emit branchless code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950 Bug ID: 104950 Summary: GCC does not emit branchless code Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- In this example GCC fails to emit branchless code while CLANG does. In the actual application, measurements shows slow down up to a factor 2. I managed to force branchless (-DBL) but the code is pretty unfriendly godbolt link (GCC, clang, GCC -DBL https://godbolt.org/z/KWY1rjhhY and here inlined include const float defaultBaseResponse = 0.5; class DForest { public: //based on FastForest::evaluate() and BDTree::parseTree() DForest() { } float evaluate(const float* features) const; std::vector rootIndices_; //"node" layout: cut, index, left, right struct Node{ float v; int i,l,r; constexpr int eval(float const * f) const { #ifdef BL auto m = f[i] > v; return *(() + int(m)); #else return f[i] > v ? r : l; #endif } }; std::vector nodes_; std::vector responses_; std::vector baseResponses_; }; float DForest::evaluate(const float* features) const{ float sum{defaultBaseResponse + baseResponses_[0]}; for(int index : rootIndices_){ do { index = nodes_[index].eval(features); } while (index>0); sum += responses_[-index]; } return sum; }
[Bug tree-optimization/97707] avx512 math function invoked even if -mprefer-vector-width=256 specified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97707 --- Comment #3 from vincenzo Innocente --- the main point in using -mprefer-vector-width=256 is to avoid clock throttling in "mixed" workloads. In small benchmarks like this one avx512 is faster (even on an old Silver) even if trigger a slower clock. (and the test should be performed with the machine fully loaded). Still if I ask -mprefer-vector-width=256 I would like to see no 512-wide instructions to be used. A disturbing feature is also the difference between using int or long long as loop index.
[Bug tree-optimization/97707] New: avx12 math function invoked even if -mprefer-vector-width=256 specified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97707 Bug ID: 97707 Summary: avx12 math function invoked even if -mprefer-vector-width=256 specified Product: gcc Version: 10.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- this code will invoke _ZGVeN8v_sin instead of _ZGVdN4v_sin making use of zmm registers #include int main() { double res=0; for (int x=0; x<1024;x++) { double y = x; res += std::sin(y); } return res > 0.5; } NOTE if I specify for (long long x=0; x<1024;x++) { it will correcty invoke _ZGVdN4v_sin (no zmm) compiler options -Ofast -march=skylake-avx512 -mprefer-vector-width=256
[Bug tree-optimization/92335] missed transformation to branchless
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92335 --- Comment #3 from vincenzo Innocente --- Understood for float it seems to me that the transformation does not occur for integer neither (signed or unsigned) as in using T= unsigned int; T bar(T const * __restrict__ x, T const * __restrict__ y) { T ret=0; for (int i=0;i<1024;++i) { auto k = y[i]; if(x[i]>1024) ret += k; } return ret; }
[Bug tree-optimization/92335] New: missed transformation to branchless
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92335 Bug ID: 92335 Summary: missed transformation to branchless Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in the following code (compiled with -O2 or -O3 and even with -march=haswell) gcc will use a branchless construct in foo but not in bar (changing from float to int does not modify the behavior) (see https://godbolt.org/z/0ZWKb5 ) with -Ofast they will compile in the same vectorized branchless loop, still I do not see why the branch shall be retained at -O2 in bar for random "x" the branchless version is 6 times faster on any out-of-order cpu float foo(float const * __restrict__ x, float const * __restrict__ y) { float ret=0.f; for (int i=0;i<1024;++i) { auto k = y[i]; ret += x[i]>0.f ? k : 0.f; } return ret; } float bar(float const * __restrict__ x, float const * __restrict__ y) { float ret=0.f; for (int i=0;i<1024;++i) { auto k = y[i]; if(x[i]>0.f) ret += k; } return ret; }
[Bug tree-optimization/88598] simplification of multiplication by 1 or 0 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88598 --- Comment #3 from vincenzo Innocente --- what I am interested in is NOT a constant array, more a small-size "sparse"-matrix that I can build explicitly at run time from other sources. I have examples using Eigen if of any interest ( https://godbolt.org/z/2L9OBU ) Clang is excellent in optimizing out zeros and ones, gcc in vectorization. I hope to get the best of the two!
[Bug tree-optimization/88598] New: simplification of multiplication by 1 or 0 fails
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88598 Bug ID: 88598 Summary: simplification of multiplication by 1 or 0 fails Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- g++ fails to optimize the code below even with -Ofast https://godbolt.org/z/mYRgVX independently of vectorization options https://godbolt.org/z/XMnCNz clang optimizes (return zero for "foo" and v[1] for "bar") even for just -ffinite-math-only -fno-signed-zeros -O2 https://godbolt.org/z/KU5f-x float foo(float const * __restrict__ v) { float j[5] = {0.,0.,0.,0.,0.}; float ret=0.; for (int i=0; i<5; ++i) ret +=j[i]*v[i]; return ret; } float bar(float const * __restrict__ v) { float j[5] = {0.,1.,0.,0.,0.}; float ret=0.; for (int i=0; i<5; ++i) ret +=j[i]*v[i]; return ret; }
[Bug tree-optimization/86855] REGRESSON: [8.0] -Ofast optimize away mm_set_ps(0.0f,0.0f,-0.0f,0.0f);
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855 --- Comment #5 from vincenzo Innocente --- I have indeed worked-around with const __m128i neg = _mm_set_epi32(0,0,0x8000,0); __m128i ret = __m128i(_mm_sub_ps(v5, v3)); return __m128(_mm_xor_si128(ret,neg)); const __m256i neg = _mm256_set_epi64x(0,0,0x8000,0); return __m256d(_mm256_xor_si256(__m256i(ret), neg)); etc
[Bug tree-optimization/86855] REGRESSON: [8.0] -Ofast optimize away mm_set_ps(0.0f,0.0f,-0.0f,0.0f);
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855 --- Comment #3 from vincenzo Innocente --- looks more undefined behavior as const __m128 neg = _mm_set_ps(0.0f,0.0f,-0.0f,-0.0f); return _mm_xor_ps(_mm_sub_ps(v5, v3), neg); with -O3 compiles in xorps .LC0(%rip), %xmm0 ret .LC0: .long 2147483648 .long 2147483648 .long 0 .long 0 while -Ofast in xorps .LC0(%rip), %xmm0 ret .LC0: .long 2147483648 .long 2147483648 .long 2147483648 .long 2147483648 needless to say that neither clang nor icc display such a behavior...
[Bug tree-optimization/86855] New: REGRESSON: [8.0] -Ofast optimize away mm_set_ps(0.0f,0.0f,-0.0f,0.0f);
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855 Bug ID: 86855 Summary: REGRESSON: [8.0] -Ofast optimize away mm_set_ps(0.0f,0.0f,-0.0f,0.0f); Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- this function _m128 _mm_cross_ps(__m128 v1, __m128 v2) { // same order is _MM_SHUFFLE(3,2,1,0) // x2, z1,z1 __m128 v3 = _mm_shuffle_ps(v2, v1, _MM_SHUFFLE(3, 0, 2, 2)); // y1, x2,y2 __m128 v4 = _mm_shuffle_ps(v1, v2, _MM_SHUFFLE(3, 1, 0, 1)); __m128 v5 = _mm_mul_ps(v3, v4); // x1, z2,z2 v3 = _mm_shuffle_ps(v1, v2, _MM_SHUFFLE(3, 0, 2, 2)); //y2, x1,y1 v4 = _mm_shuffle_ps(v2, v1, _MM_SHUFFLE(3, 1, 0, 1)); v3 = _mm_mul_ps(v3, v4); const __m128 neg = _mm_set_ps(0.0f,0.0f,-0.0f,0.0f); return _mm_xor_ps(_mm_sub_ps(v5, v3), neg); } compiled more or less in mm_cross_ps(float __vector(4), float __vector(4)): movaps %xmm1, %xmm2 movaps %xmm0, %xmm4 movaps %xmm0, %xmm3 shufps $202, %xmm0, %xmm2 shufps $209, %xmm1, %xmm4 shufps $202, %xmm1, %xmm3 shufps $209, %xmm0, %xmm1 mulps %xmm4, %xmm2 mulps %xmm3, %xmm1 movaps %xmm2, %xmm0 subps %xmm1, %xmm0 xorps .LC0(%rip), %xmm0 ret .LC0: .long 0 .long 2147483648 .long 0 .long 0 according to godbolt since 8.1 the xor is optimized away with -Ofast as mm_cross_ps(float __vector(4), float __vector(4)): movaps %xmm1, %xmm2 movaps %xmm0, %xmm4 movaps %xmm0, %xmm3 shufps $209, %xmm1, %xmm4 shufps $202, %xmm0, %xmm2 mulps %xmm4, %xmm2 shufps $202, %xmm1, %xmm3 shufps $209, %xmm0, %xmm1 mulps %xmm3, %xmm1 movaps %xmm2, %xmm0 subps %xmm1, %xmm0 ret is this intended?
[Bug tree-optimization/83857] [8 Regression] internal compiler error: in exact_div, at poly-int.h:2139
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83857 --- Comment #2 from vincenzo Innocente --- (In reply to Richard Biener from comment #1) > I've seen a similar bug so maybe fixed already. if the similar bug is #83753 it is looks "fixed" in the version I tested (at least /gcc/testsuite/gcc.dg/torture/pr83753.c is present)
[Bug tree-optimization/83857] New: [ICE] internal compiler error: in exact_div, at poly-int.h:2139
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83857 Bug ID: 83857 Summary: [ICE] internal compiler error: in exact_div, at poly-int.h:2139 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- Created attachment 43133 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43133=edit directory with all files needed to reproduce problem (no attempt to reduce to minimum) c++ -v Using built-in specs. COLLECT_GCC=c++ COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/8.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib : (reconfigured) ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib : (reconfigured) ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 8.0.1 20180115 (experimental) [trunk revision 256692] (GCC) [innocent@vinavx3 innocent]$ tar -xzf fastSinCos.tgz [innocent@vinavx3 innocent]$ cd fastSinCos [innocent@vinavx3 fastSinCos]$ c++ -Ofast -fopt-info-vec -Wall testSinCos.cpp testSinCos.cpp:136:21: note: loop vectorized during GIMPLE pass: vect testSinCos.cpp: In function 'int main()': testSinCos.cpp:22:5: internal compiler error: in exact_div, at poly-int.h:2139 int main() { ^~~~ 0x7853bf poly_int<1u, poly_result::is_poly>::type, poly_coeff_pair_traits::is_poly>::type>::result_kind>::type> exact_div<1u, unsigned long, unsigned long>(poly_int_pod<1u, unsigned long> const&, unsigned long) ../../gcc-trunk/gcc/poly-int.h:2139 0x7853bf poly_int<1u, poly_result::result_kind>::type> exact_div<1u, unsigned long, unsigned long>(poly_int_pod<1u, unsigned long> const&, poly_int_pod<1u, unsigned long> const&) ../../gcc-trunk/gcc/poly-int.h:2152 0x7853bf vect_get_num_vectors ../../gcc-trunk/gcc/tree-vectorizer.h:1307 0x7853bf vect_get_num_copies ../../gcc-trunk/gcc/tree-vectorizer.h:1318 0x7853bf vectorizable_live_operation(gimple*, gimple_stmt_iterator*, _slp_tree*, int, gimple**) ../../gcc-trunk/gcc/tree-vect-loop.c:8132 0x1102053 vect_analyze_loop_operations ../../gcc-trunk/gcc/tree-vect-loop.c:1855 0x1102053 vect_analyze_loop_2 ../../gcc-trunk/gcc/tree-vect-loop.c:2254 0x1102053 vect_analyze_loop(loop*, _loop_vec_info*) ../../gcc-trunk/gcc/tree-vect-loop.c:2546 0x111b0ad vectorize_loops() ../../gcc-trunk/gcc/tree-vectorizer.c:664 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. btw ls /data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c /data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c c++ -Ofast -c /data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c -fopt-info-vec /data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c:13:14: note: loop vectorized /data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c:19:1: note: basic block vectorized
[Bug target/80566] New: no use of avx vmovups on ymm registry in set and copy
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80566 Bug ID: 80566 Summary: no use of avx vmovups on ymm registry in set and copy Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in this example #include int * foo() { int * p = new int[16]; memset(p,0,16*sizeof(int)); return p; } int * foo(int * q) { int * p = new int[16]; memcpy(q,p,16*sizeof(int)); return p; } gcc does not make use of vmovups on ymm registry ( c++ -O3 -Wall -march=haswell -S) while (according to gcc.godbolt.org) clang 4.0 does https://godbolt.org/g/qnX975
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #19 from vincenzo Innocente --- Could you please have a look also to c++ and lto: this is what I get on my skylake: for c++ or lto -fno-split-paths pessimizes [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm ; ./a.out | grep LU LU Mflops: 5920.14(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -fno-split-paths ; ./a.out | grep LU LU Mflops: 6136.33(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -flto ; ./a.out | grep LU LU Mflops: 5809.93(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -flto -fno-split-paths ; ./a.out | grep LU LU Mflops: 5630.24(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm ; ./a.out | grep LU LU Mflops: 6001.47(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -fno-split-paths ; ./a.out | grep LU LU Mflops: 5920.14(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -flto; ./a.out | grep LU LU Mflops: 5434.16(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -flto -fno-split-paths ; ./a.out | grep LU LU Mflops: 5434.16(M=100, N=100)
[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390 --- Comment #17 from vincenzo Innocente --- [innocent@vinavx3 innocent]$ mkdir scimark2TMP [innocent@vinavx3 innocent]$ cd scimark2TMP [innocent@vinavx3 scimark2TMP]$ wget http://math.nist.gov/scimark2/scimark2_1c.zip . . gcc version 7.0.1 20170407 (experimental) [trunk revision 246752] (GCC) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2783.60 FFT Mflops: 2325.65(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 829.14 Sparse matmult Mflops: 2582.70(N=1000, nz=5000) LU Mflops: 5920.14(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm -fno-split-paths [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2825.86 FFT Mflops: 2333.43(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 829.14 Sparse matmult Mflops: 2570.04(N=1000, nz=5000) LU Mflops: 6136.33(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm -fsplit-paths [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2787.46 FFT Mflops: 2325.65(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 832.36 Sparse matmult Mflops: 2582.70(N=1000, nz=5000) LU Mflops: 5936.23(M=100, N=100) [innocent@vinavx3 scimark2TMP]$ pushd ~/code/s7/C CMSSW_8_0_22/ CMSSW_9_1_0_pre2/ [innocent@vinavx3 scimark2TMP]$ pushd ~/code/s7/CMSSW_9_1_0_pre2/ ~/code/s7/CMSSW_9_1_0_pre2 /tmp/innocent/scimark2TMP [innocent@vinavx3 CMSSW_9_1_0_pre2]$ cmsenv [innocent@vinavx3 CMSSW_9_1_0_pre2]$ popd /tmp/innocent/scimark2TMP [innocent@vinavx3 scimark2TMP]$ gcc -v gcc version 6.3.0 (GCC) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2820.21 FFT Mflops: 2325.65(N=1024) SOR Mflops: 2260.36(100 x 100) MonteCarlo: Mflops: 810.37 Sparse matmult Mflops: 2427.26(N=1000, nz=5000) LU Mflops: 6277.39(M=100, N=100)
[Bug target/80313] New: -march=znver1 produce worse code than -march=haswell
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80313 Bug ID: 80313 Summary: -march=znver1 produce worse code than -march=haswell Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- Created attachment 41125 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41125=edit sef contained scimark2 MC benchmark just got hold of a AMD Ryzen 7 1800X Eight-Core Processor and was surprised by the results running with -march=native the point is that the results can be reproduced on a haswell or broadwell as well! I used full scimark2, just the MC benchmark shows at least one problem this is on intel [innocent@vinavx3 fullMC]$ gcc -march=znver1 -O3 fullMC.c -g ; time ./a.out 1.245u 0.000s 0:01.24 100.0%0+0k 0+0io 0pf+0w [innocent@vinavx3 fullMC]$ gcc -O3 fullMC.c -g ; time ./a.out 0.327u 0.000s 0:00.32 100.0%0+0k 0+0io 0pf+0w [innocent@vinavx3 fullMC]$ gcc -march=broadwell -O3 fullMC.c -g ; time ./a.out 0.308u 0.000s 0:00.30 100.0%0+0k 0+0io 0pf+0w this is on ryzen [innocent@vinzen0 fullMC]$ gcc -march=znver1 -O3 fullMC.c -g ; time ./a.out 1.354u 0.001s 0:01.35 100.0%0+0k 0+0io 0pf+0w [innocent@vinzen0 fullMC]$ gcc -O3 fullMC.c -g ; time ./a.out 0.315u 0.000s 0:00.31 100.0%0+0k 0+0io 0pf+0w [innocent@vinzen0 fullMC]$ gcc -march=broadwell -O3 fullMC.c -g ; time ./a.out 0.313u 0.001s 0:00.31 100.0%0+0k 0+0io 0pf+0w
[Bug tree-optimization/80248] sparse access to Array of structures does not vectorize using gathers
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80248 --- Comment #2 from vincenzo Innocente --- side note: the difference is timing between "aos2" and "soa" seems to be fully accounted by the integer multiplication "3*k[i]".
[Bug target/80232] Ofast pessimizes Sparse matmult in scimark2 benchmark on avx platforms
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232 --- Comment #5 from vincenzo Innocente --- I confirm that gather is almost twice as fast on Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz w/r/t Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (used a benchmark version of PR80248 example) so on skylake, knl, (and hopefully on skylake-avx512) is profitable, on Haswell and broardwell is not...
[Bug tree-optimization/80248] New: sparse access to Array of structures does not vectorize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80248 Bug ID: 80248 Summary: sparse access to Array of structures does not vectorize Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in the following example "aos" does not vectorize while the equivalent aos2 does vectorize using vgatherdps instruction On a slight different matter: "soa" vectorizes and produces code that is apparently 20% faster than "aos2": I may open a different PR with a benchmark attached... cat simpleGather.cc struct float3 { float x; float y; float z; }; #define N 1024 float fx[N], g[N]; float fy[N]; float fz[N]; int k[N]; float3 f3[N]; void aos (void) { int i; for (i = 0; i < N; i++) g[i] = f3[k[i]].x+f3[k[i]].y+f3[k[i]].z; } // use gather void aos2 (void) { float * ff = &(f3[0].x); int i; for (i = 0; i < N; i++) g[i] = ff[3*k[i]]+ff[3*k[i]+1]+ff[3*k[i]+2]; } // use gather void soa (void) { int i; for (i = 0; i < N; i++) g[i] = fx[k[i]]+fy[k[i]]+fz[k[i]]; } [innocent@vinavx3 vectorize]$ c++ -Ofast -Wall -march=haswell -S simpleGather.cc -fopt-info-vec simpleGather.cc:31:17: note: loop vectorized simpleGather.cc:41:17: note: loop vectorized [innocent@vinavx3 vectorize]$ c++ -v Using built-in specs. COLLECT_GCC=c++ COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246485] (GCC)
[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796 --- Comment #10 from vincenzo Innocente --- added a self contained "benchmark" on my machine [innocent@vinavx3 ctest]$ c++ -Ofast -Wall SparseOnly.c -march=native ; time ./a.out 0.496u 0.000s 0:00.49 100.0%0+0k 0+0io 0pf+0w [innocent@vinavx3 ctest]$ c++ -O2 -Wall SparseOnly.c -march=native ; time ./a.out 0.411u 0.000s 0:00.41 100.0%0+0k 0+0io 0pf+0w [innocent@vinavx3 ctest]$ c++ -O3 -Wall SparseOnly.c -march=native ; time ./a.out 0.413u 0.000s 0:00.41 100.0%0+0k 0+0io 0pf+0w
[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796 --- Comment #9 from vincenzo Innocente --- Created attachment 41070 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41070=edit self contained benchmark of scimark2 SparseMat must content is not randomized param must be modified by hand in the main
[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796 --- Comment #8 from vincenzo Innocente --- My understanding of the gather latency is that it essentially corresponds to a load per cacheline: fast if all items are closeby, slower than scalar loads if items are all in different cachelines. Not sure how this can be turned in a "cost model"
[Bug tree-optimization/80232] New: Ofast pessimizes Sparse matmult in scimark2 benchmark on avx platforms
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232 Bug ID: 80232 Summary: Ofast pessimizes Sparse matmult in scimark2 benchmark on avx platforms Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- on my machine after the usual mkdir scimark2TMP cd scimark2TMP wget http://math.nist.gov/scimark2/scimark2_1c.zip . unzip scimark2_1c.zip gcc -v I get Using built-in specs. COLLECT_GCC=c++ COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246485] (GCC) [innocent@vinavx3 scimark2TMP]$ gcc -O2 -march=haswell *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" Sparse matmult Mflops: 3271.69(N=1000, nz=5000) [innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 2946.76(N=10, nz=100) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=nehalem *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" Sparse matmult Mflops: 3281.93(N=1000, nz=5000) [innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 2859.34(N=10, nz=100) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=corei7-avx *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" Sparse matmult Mflops: 2987.40(N=1000, nz=5000) [innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 2869.35(N=10, nz=100) [innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm [innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" Sparse matmult Mflops: 2579.52(N=1000, nz=5000) [innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 2381.40(N=10, nz=100) so O2 and sse4.2 are the fastest, avx is already slower, avx2 is dramatically slower par of the difference can be due to gather operation as in #57796: not sure the difference w/r/t O2 interesting to note that on KNL it makes almost not difference (not sure if this is positive or negative...) with a hint of speedup for the large problem... [innocent@vinknl0 scimark2TMP]$ gcc -Ofast -march=knl *.c -lm [innocent@vinknl0 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 348.13(N=1000, nz=5000) [innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 358.67(N=10, nz=100) [innocent@vinknl0 scimark2TMP]$ gcc -O2 -march=knl *.c -lm [innocent@vinknl0 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" Sparse matmult Mflops: 329.33(N=1000, nz=5000) [innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 321.51(N=10, nz=100) [innocent@vinknl0 scimark2TMP]$ gcc -Ofast -march=corei7-avx *.c -lm [innocent@vinknl0 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" Sparse matmult Mflops: 343.12(N=1000, nz=5000) [innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 323.03(N=10, nz=100) [innocent@vinknl0 scimark2TMP]$ gcc -Ofast -march=nehalem *.c -lm ./a.out 5 | grep "Sparse matmult" [innocent@vinknl0 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult" Sparse matmult Mflops: 343.57(N=1000, nz=5000) [innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult" Sparse matmult Mflops: 321.00(N=10, nz=100)
[Bug rtl-optimization/80197] New: pgo dramatically pessimizes scimark2 MonteCarlo benchmark
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197 Bug ID: 80197 Summary: pgo dramatically pessimizes scimark2 MonteCarlo benchmark Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- Created attachment 41053 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41053=edit self contained benchmark of scimark2 MC while chasing the regression I then found identified and solved in #79389 I discovered that pgo manages to do much worse than the regression above. The symptom is the same: a huge increase in branch-miss. This is not a regression: it is the same at least since gcc5.3 Attached a self contained single file, copy of scimark2 MC, and a couple of scripts to compile and run it just tar -xzf fullMC.tgz cd fullMC # standard compilation -O2 -O3 ./runit # same with pgo passes ./dopgo or just do [innocent@vinavx3 fullMC]$ rm -rf pgo/* ; c++ -O3 fullMC.c -g -fprofile-generate=pgo ; time ./a.out 1.848u 0.000s 0:01.85 99.4% 0+0k 0+8io 0pf+0w [innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g -fprofile-use=./pgo ; time ./a.out 0.967u 0.001s 0:00.96 100.0%0+0k 0+0io 0pf+0w [innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g; time ./a.out 0.328u 0.000s 0:00.32 100.0%0+0k 0+0io 0pf+0w for reference: cat dopgo cat /proc/cpuinfo | grep name | head -n 1 gcc -v rm -rf pgo/*;gcc -O2 fullMC.c -g -fprofile-generate=pgo; ./a.out gcc -O2 fullMC.c -g -fprofile-use=pgo; ./a.out perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out rm -rf pgo/*;gcc -O3 fullMC.c -g -fprofile-generate=pgo; ./a.out gcc -O3 fullMC.c -g -fprofile-use=pgo; ./a.out perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses ./a.out on my machine the result is # standard compilation [innocent@vinavx3 fullMC]$ ./runit model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) gcc -O2 fullMC.c -g real0m0.489s user0m0.485s sys 0m0.002s Performance counter stats for './a.out': 486.303424 task-clock (msec) #0.999 CPUs utilized 1901271534 cycles#3.910 GHz 6403589598 instructions #3.37 insn per cycle 700683389 branches # 1440.836 M/sec 13582 branch-misses #0.00% of all branches 0.486571089 seconds time elapsed gcc -O3 fullMC.c -g real0m0.330s user0m0.330s sys 0m0.000s Performance counter stats for './a.out': 327.385696 task-clock (msec) #0.999 CPUs utilized 1279958668 cycles#3.910 GHz 5009002909 instructions #3.91 insn per cycle 306481761 branches # 936.149 M/sec 10805 branch-misses #0.00% of all branches 0.327637485 seconds time elapsed // pro generation and use (perf after use...) [innocent@vinavx3 fullMC]$ ./dopgo model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm -disable-multilib Thread model: posix gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) Performance counter stats for './a.out': 964.399833 task-clock (msec) #1.000 CPUs utilized 3770455888 cycles#3.910 GHz 5007987488 instructions #1.33 insn per cycle 816525627 branches # 846.667 M/sec 88982233 branch-misses # 10.90% of all branches 0.964699603 seconds time elapsed Performance counter stats for './a.out': 964.540691 task-clock (msec) #1.000 CPUs utilized 3771010753 cycles#3.910 GHz 5007957589 instructi
[Bug tree-optimization/79594] New: -Waggressive-loop-optimizations incomplete and/or inconsistentt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79594 Bug ID: 79594 Summary: -Waggressive-loop-optimizations incomplete and/or inconsistentt Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- given cat aggressiveLoop.cc #include #include float x[1024]; float y[1024]; float w[512]; float z[128]; float c,q; void foo() { for (int i=0; i<1024; ++i) { auto zz=z[i]; auto yy = y[i]; if(x[i] > q) yy = zz; y[i]=yy; } } void foo2() { for (int i=0; i<1024; ++i) { auto zz=z[i]; auto yy = w[i]; if(x[i] > q) yy = zz; x[i]=yy; } } void foo3() { for (int i=0; i<1024; ++i) { auto zz=z[i]; auto yy = w[i]; if(x[i] > q) yy = zz; w[i]=yy; } } gcc version 7.0.1 20170205 (experimental) [trunk revision 245191] (GCC) reports c++ -Wall -O2 aggressiveLoop.cc -S aggressiveLoop.cc: In function 'void foo()': aggressiveLoop.cc:13:9: warning: iteration 128 invokes undefined behavior [-Waggressive-loop-optimizations] auto zz=z[i]; ^~ aggressiveLoop.cc:12:18: note: within this loop for (int i=0; i<1024; ++i) { ~^ aggressiveLoop.cc: In function 'void foo2()': aggressiveLoop.cc:22:9: warning: iteration 128 invokes undefined behavior [-Waggressive-loop-optimizations] auto zz=z[i]; ^~ aggressiveLoop.cc:21:18: note: within this loop for (int i=0; i<1024; ++i) { ~^ aggressiveLoop.cc: In function 'void foo3()': aggressiveLoop.cc:34:8: warning: iteration 512 invokes undefined behavior [-Waggressive-loop-optimizations] w[i]=yy; ^~~ aggressiveLoop.cc:30:18: note: within this loop for (int i=0; i<1024; ++i) { ~^ while in foo2 there is also "auto yy = w[i];" and in foo3 both assignments auto zz=z[i]; auto yy = w[i]; will "invokes undefined behavior" at iterations 128 and 512...
[Bug tree-optimization/77859] Ofast needed to vectorize loop in presence of conditional code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77859 --- Comment #2 from vincenzo Innocente --- Thanks for the fast response I think I can "survive" with -O3 -fno-trapping-math in principle it should not change the binary compatibility of the output w/r/t -O2 and at best of my understanding it does not inhibit raising FP exceptions (we already force -fno-math-errno to avoid errno generation in sqrt...)
[Bug tree-optimization/77859] New: Ofast needed in presence of conditional code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77859 Bug ID: 77859 Summary: Ofast needed in presence of conditional code Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- It looks to me that to vectorize this code "relaxed floating point math" is not a requirement currently gcc version 7.0.0 20161004 (experimental) [trunk revision 240754] (GCC) requires Ofast and not O3 to vectorize it #include float x[1024]; float y[1024]; void needOfast() { for (int i=0; i<1024; ++i) { constexpr float pi4 = M_PI/4.; constexpr float pi2 = M_PI/2.; auto g1 = x[i] > pi4; auto xx = x[i]; xx = g1 ? xx-pi2 : xx; auto g2 = xx > pi4; xx = g2 ? xx-pi2 : xx; y[i] = xx; } } in case anyone wonder this alternative formulation needs Ofast as well void needOfastAsWell() { for (int i=0; i<1024; ++i) { constexpr float pi = M_PI; constexpr float pi4 = M_PI/4.; constexpr float pi34 = 3.*M_PI/4.; constexpr float pi2 = M_PI/2.; auto g1 = x[i] > pi4; auto xx = x[i]; xx = g1 ? xx-pi2 : xx; auto g2 = x[i] > pi34; xx = g2 ? x[i]-pi : xx; y[i] = xx; } }
[Bug middle-end/71666] profile-generate not documented
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71666 --- Comment #2 from vincenzo Innocente --- ok so is just the sentence "" See Optimize Options" which needs to be changed...
[Bug web/71666] New: profile-generate not documented
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71666 Bug ID: 71666 Summary: profile-generate not documented Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: web Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- as of today -fprofile-generate does not seem to be documented in https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html it is quoted 4 times including a self-referencing " See Optimize Options, for information about the -fprofile-generate option" (btw -fprofile-dir is quoted and not documented as well)
[Bug gcov-profile/70993] New: ICE with gcov and lto
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70993 Bug ID: 70993 Summary: ICE with gcov and lto Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: gcov-profile Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- with gcc version 7.0.0 20160506 (experimental) [trunk revision 235977] (GCC) cat main.cpp int main() { return 0;} c++ -O2 main.cpp perf record -e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=49/ -o perf.data ./a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.016 MB perf.data (1 samples) ] create_gcov --binary=./a.out --profile=perf.data --gcov=fbdata.afdo -gcov_version 1 c++ -O2 main.cpp -fauto-profile -flto lto1: internal compiler error: in compute_working_sets, at gcov-io.c:1006 0x6fee03 compute_working_sets(gcov_ctr_summary const*, gcov_working_set_info*) ../../gcc-trunk/gcc/gcov-io.c:1006 0xa1a72d get_working_sets() ../../gcc-trunk/gcc/profile.c:226 0x97cd1a input_symtab() ../../gcc-trunk/gcc/lto-cgraph.c:1869 0x6634a7 read_cgraph_and_symbols ../../gcc-trunk/gcc/lto/lto.c:2856 0x6634a7 lto_main() ../../gcc-trunk/gcc/lto/lto.c:3305 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <http://gcc.gnu.org/bugs.html> for instructions. lto-wrapper: fatal error: c++ returned 1 exit status compilation terminated. /usr/bin/ld: lto-wrapper failed collect2: error: ld returned 1 exit status
[Bug c++/69564] [5/6 Regression] lto and/or C++ make scimark2 LU slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564 --- Comment #19 from vincenzo Innocente --- patch applied to gcc version 6.0.0 20160324 (experimental) [trunk revision 234461] (GCC) I confirm the improvement in timing for c++ and lto timing difference between gcc and c++ seems to be inside "errors" I am satisfied. Thanks Patrick! (btw I suppose no hope for a back port to 5.4?)
[Bug c++/69564] lto and/or C++ make scimark2 LU slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564 --- Comment #5 from vincenzo Innocente --- it is a regression gcc version 4.9.3 (GCC) c++ -Ofast *.c; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. gcc -Ofast *.c; ./a.out c++ -v Composite Score: 2449.06 FFT Mflops: 2046.03(N=1024) SOR Mflops: 1654.04(100 x 100) MonteCarlo: Mflops: 813.44 Sparse matmult Mflops: 2962.08(N=1000, nz=5000) LU Mflops: 4769.72(M=100, N=100) --- gcc -Ofast *.c -lm; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2475.22 FFT Mflops: 2064.19(N=1024) SOR Mflops: 1633.01(100 x 100) MonteCarlo: Mflops: 810.37 Sparse matmult Mflops: 2970.47(N=1000, nz=5000) LU Mflops: 4898.06(M=100, N=100)
[Bug c++/69564] lto and/or C++ make scimark2 LU slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564 --- Comment #3 from vincenzo Innocente --- > Any reason you are using the c++ driver here? Because I am interested in C++ performance never imagined that the c++ front-end could make a difference on such a code... >From my point of view it is even a more severe regression than just "lto"
[Bug lto/69564] New: lto makes scimark2 LU slower
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564 Bug ID: 69564 Summary: lto makes scimark2 LU slower Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: lto Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- mkdir scimark2; cd scimark2 wget http://math.nist.gov/scimark2/scimark2_1c.zip unzip scimark2_1c.zip c++ -Ofast *.c; ./a.out c++ -Ofast *.c -flto; ./a.out with gcc 4.9.3 gcc version 4.9.3 (GCC) c++ -Ofast *.c; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2462.90 FFT Mflops: 2070.32(N=1024) SOR Mflops: 1661.17(100 x 100) MonteCarlo: Mflops: 813.44 Sparse matmult Mflops: 2978.91(N=1000, nz=5000) LU Mflops: 4790.64(M=100, N=100) [innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2582.94 FFT Mflops: 2064.19(N=1024) SOR Mflops: 1654.04(100 x 100) MonteCarlo: Mflops: 1426.90 Sparse matmult Mflops: 2978.91(N=1000, nz=5000) LU Mflops: 4790.64(M=100, N=100) with latest build gcc version 6.0.0 20160129 (experimental) (GCC) [innocent@vinavx3 scimark2]$ c++ -Ofast *.c; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2377.18 FFT Mflops: 1970.89(N=1024) SOR Mflops: 1654.04(100 x 100) MonteCarlo: Mflops: 810.37 Sparse matmult Mflops: 3328.81(N=1000, nz=5000) LU Mflops: 4121.76(M=100, N=100) [innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 2136.23 FFT Mflops: 2076.48(N=1024) SOR Mflops: 1654.04(100 x 100) MonteCarlo: Mflops: 1533.92 Sparse matmult Mflops: 3266.59(N=1000, nz=5000) LU Mflops: 2150.13(M=100, N=100)
[Bug c++/68180] New: [ICE] at cp/constexpr.c:2768 in initializing __vector in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68180 Bug ID: 68180 Summary: [ICE] at cp/constexpr.c:2768 in initializing __vector in a loop Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t; constexpr float32x4_t fill(float x) { float32x4_t v{0}; constexpr auto vs = sizeof(v)/sizeof(v[0]); for (auto i=0U; i<vs; ++i) v[i]=i; return v+x; } float32x4_t foo(float32x4_t x) { constexpr float32x4_t v = fill(1.f); return x+v; } gcc version 6.0.0 20151028 (experimental) [trunk revision 229474] (GCC) ICE in c++ -O2 avxconst.cc -std=c++17 -S avxconst.cc: In function ‘float32x4_t foo(float32x4_t)’: avxconst.cc:10:33: in constexpr expansion of ‘fill(1.0e+0f)’ avxconst.cc:10:37: internal compiler error: tree check: expected constructor, have vector_cst in cxx_eval_store_expression, at cp/constexpr.c:2768 constexpr float32x4_t v = fill(1.f); ^ avxconst.cc:10:37: internal compiler error: Abort trap: 6 c++: internal compiler error: Abort trap: 6 (program cc1plus)
[Bug c++/68125] New: std::sqrt prevent use of associative math
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125 Bug ID: 68125 Summary: std::sqrt prevent use of associative math Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- with -Ofast the code generated differs float rsqrt1(float a, float x, float y) { return a/std::sqrt(x)/std::sqrt(y); } float rsqrt2(float a, float x, float y) { return a/sqrtf(x)/sqrtf(y); } rsqrt1(float, float, float): sqrtss %xmm2, %xmm2 sqrtss %xmm1, %xmm1 mulss %xmm2, %xmm1 divss %xmm1, %xmm0 ret rsqrt2(float, float, float): mulss %xmm1, %xmm2 rsqrtss %xmm2, %xmm1 mulss %xmm1, %xmm2 mulss %xmm1, %xmm2 mulss .LC9(%rip), %xmm1 addss .LC8(%rip), %xmm2 mulss %xmm1, %xmm2 mulss %xmm2, %xmm0 ret
[Bug c++/68125] std::sqrt prevent use of associative math
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125 --- Comment #2 from vincenzo Innocente --- Thanks Marc for the fast check I am still with gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC) will update and verify
[Bug c++/68125] std::sqrt prevent use of associative math
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125 vincenzo Innocente changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from vincenzo Innocente --- confirmed fixed in gcc version 6.0.0 20151028 (experimental) [trunk revision 229474] (GCC) still generated code is NOT identical __Z6rsqrt1fff: LFB230: mulss %xmm1, %xmm2 rsqrtss %xmm2, %xmm3 mulss %xmm3, %xmm2 movaps %xmm2, %xmm1 mulss %xmm3, %xmm1 addss LC0(%rip), %xmm1 mulss LC1(%rip), %xmm3 mulss %xmm3, %xmm1 mulss %xmm1, %xmm0 ret LFE230: .align 4,0x90 .globl __Z6rsqrt2fff __Z6rsqrt2fff: LFB228: mulss %xmm2, %xmm1 rsqrtss %xmm1, %xmm2 mulss %xmm2, %xmm1 mulss %xmm2, %xmm1 addss LC0(%rip), %xmm1 mulss LC1(%rip), %xmm2 mulss %xmm2, %xmm1 mulss %xmm1, %xmm0 ret LF
[Bug libgomp/67406] OMP SIMD cloning does not generate fma instruction for AVX2 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406 --- Comment #5 from vincenzo Innocente --- does not work... pragma omp declare simd notinbranch float __attribute__ ((__target__ ("default"))) fma(float x,float y, float z); #pragma omp declare simd notinbranch float __attribute__ ((__target__ ("arch=haswell"))) fma(float x,float y, float z); void foo() { #pragma omp simd for (int i=0; i<1024; ++i) v0[i] = fma(v1[i],v2[i],v3[i]); } generates .L11: vmovss v3(%rbx), %xmm2 addq$4, %rbx vmovss v2-4(%rbx), %xmm1 vmovss v1-4(%rbx), %xmm0 call_Z15_Z3fmafff.ifuncfff vmovss %xmm0, v0-4(%rbx) cmpq$4096, %rbx jne .L11 dispatching, no vectorization...
[Bug libgomp/67406] OMP SIMD cloning does not generate fma instruction for AVX2 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406 --- Comment #4 from vincenzo Innocente --- #pragma omp declare simd notinbranch float __attribute__ ((__target__ ("default"))) fma(float x,float y, float z) { return x+y*z; } #pragma omp declare simd notinbranch float __attribute__ ((__target__ ("arch=haswell"))) fma(float x,float y, float z) { return x+y*z; } #pragma omp declare simd notinbranch float __attribute__ ((__target__ ("arch=bdver1"))) fma(float x,float y, float z) { return x+y*z; } seems to generate a real fat library c++ -Ofast -fopenmp -S simdCloning.cc; grep fmafff simdCloning.s .globl __Z3fmafff __Z3fmafff: .globl __Z3fmafff.arch_haswell __Z3fmafff.arch_haswell: .globl __Z3fmafff.arch_bdver1 __Z3fmafff.arch_bdver1: .globl __ZGVbN4vvv__Z3fmafff.arch_bdver1 __ZGVbN4vvv__Z3fmafff.arch_bdver1: .globl __ZGVcN8vvv__Z3fmafff.arch_bdver1 __ZGVcN8vvv__Z3fmafff.arch_bdver1: .globl __ZGVdN8vvv__Z3fmafff.arch_bdver1 __ZGVdN8vvv__Z3fmafff.arch_bdver1: .globl __ZGVbN4vvv__Z3fmafff.arch_haswell __ZGVbN4vvv__Z3fmafff.arch_haswell: .globl __ZGVcN8vvv__Z3fmafff.arch_haswell __ZGVcN8vvv__Z3fmafff.arch_haswell: .globl __ZGVdN8vvv__Z3fmafff.arch_haswell __ZGVdN8vvv__Z3fmafff.arch_haswell: .globl __ZGVbN4vvv__Z3fmafff __ZGVbN4vvv__Z3fmafff: .globl __ZGVcN8vvv__Z3fmafff __ZGVcN8vvv__Z3fmafff: .globl __ZGVdN8vvv__Z3fmafff __ZGVdN8vvv__Z3fmafff: have now to test that it uses the correct one!
[Bug libgomp/67406] New: OMP SIMD cloning does not generate fma instruction for AVX2 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406 Bug ID: 67406 Summary: OMP SIMD cloning does not generate fma instruction for AVX2 target Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch CC: jakub at gcc dot gnu.org Target Milestone: --- given at simdCloning.cc #pragma omp declare simd notinbranch float fma(float x,float y, float z) { return x+y*z; } compiled with c++ -S -fopenmp -Ofast -Wall simdCloning.cc; cat simdCloning.s will generate the same code for AVX and AVX2 clones __ZGVdN8vvv__Z3fmafff: LFB3: leaq8(%rsp), %r10 LCFI5: andq$-32, %rsp vmulps %ymm2, %ymm1, %ymm1 pushq -8(%r10) pushq %rbp vaddps %ymm0, %ymm1, %ymm0 while I would have expected __ZGVdN8vvv__Z3fmafff: LFB3: leaq8(%rsp), %r10 LCFI5: andq$-32, %rsp vfmadd231ps %ymm2, %ymm1, %ymm0 pushq -8(%r10) pushq %rbp this last code has been obtained compiling with -mfma. unfortunately in this case ALL clones uses avx2 instructions (so again AVX and AVX2 clones are identical) btw: is there any reason why the AVX512 clone is not generated? I am using gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC)
[Bug libgomp/67406] OMP SIMD cloning does not generate fma instruction for AVX2 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406 --- Comment #2 from vincenzo Innocente --- is there any mechanism to tell gcc to generate the AVX2 clone using fma? I understand it reduces portability still at the moment I have to support mostly Intel platforms. for AMD, gcc suggests to use avx128 so it would anyhow requires a different library to exploit fma4.
[Bug c++/67335] New: [ICE] in compiling mop sims function with unused argument
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67335 Bug ID: 67335 Summary: [ICE] in compiling mop sims function with unused argument Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- cat ompsimd_t.cc #pragma omp declare simd notinbranch uniform(q) float bar(float x, float * q, int){ return q[0]+q[1]*x; } c++ -fopenmp -Wall -S ompsimd_t.cc ompsimd_t.cc: In function ‘__vector(4) float _Z3barfPfi.simdclone.0(float, float*, int)’: ompsimd_t.cc:4:1: internal compiler error: Segmentation fault: 11 } ^ ompsimd_t.cc:4:1: internal compiler error: Abort trap: 6 c++: internal compiler error: Abort trap: 6 (program cc1plus) gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC)
[Bug tree-optimization/67326] New: [5.2/6.0 regression] -ftree-loop-if-convert-stores does not vectorize conditional assignment (anymore)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67326 Bug ID: 67326 Summary: [5.2/6.0 regression] -ftree-loop-if-convert-stores does not vectorize conditional assignment (anymore) Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch Target Milestone: --- in 5.1 looks ok (according to http://gcc.godbolt.org) cat condBug.cc float v0[1024]; float v1[1024]; float v2[1024]; float v3[1024]; void condAssign1() { for(int i=0; i1024; ++i) v0[i] = (v2[i]v1[i]) ? v2[i]*v3[i] : v0[i]; } void condAssign2() { for(int i=0; i1024; ++i) v0[i] = (v2[i]v1[i]) ? v2[i]*v1[i] : v0[i]; } c++ -Ofast -fopt-info-vec -ftree-loop-if-convert-stores -S condBug.cc condBug.cc:7:3: note: loop vectorized condBug.cc:13:3: note: loop vectorized gcc version 4.9.3 (GCC) c++ -Ofast -fopt-info-vec -ftree-loop-if-convert-stores -S condBug.cc condBug.cc:13:17: note: loop vectorized with gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC)
[Bug tree-optimization/63644] New: Kahan Summation with fast-math, pattern not always recognized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63644 Bug ID: 63644 Summary: Kahan Summation with fast-math, pattern not always recognized Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch in the following example (compiled with -Ofast -std=c++11) the kahan summation pattern is recognized in sum, not in counter see http://goo.gl/aJn61B #includecstdio templatetypename T struct KahanSum { KahanSum(T is=0) : sum(is){} KahanSumT operator+=(T a) { add(a); return *this;} void add(T a) { float x = a - eps; float s = sum + x; eps = (s-sum) - x; sum = s; } T result() const { return sum;} T sum; T eps=0; }; float a[1204]; float sum() { KahanSumfloat res; for (int i=0; i1024; ++i) res+= a[i]; return res.result(); } float counter(int maxl) { float tenth=0.1f; KahanSumfloat sum = tenth; int n=0; while(nmaxl) { sum += tenth; ++n; // if (n21 || n%36000==0) printf(%d %f %a\n,n,sum.result(),sum.result()); } // use eps to avoid optimization out float count = float(60*60*100*10); printf(\n\n%f %f %a\n\n,count,float(count*tenth),float(count*tenth)); return sum.result(); }
[Bug tree-optimization/63599] New: wrong branch optimization with Ofast in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63599 Bug ID: 63599 Summary: wrong branch optimization with Ofast in a loop Product: gcc Version: 5.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch given this code #include x86intrin.h typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t; inline float32x4_t atan(float32x4_t t) { constexpr float PIO4F = 0.7853981633974483096f; float32x4_t high = t 0.4142135623730950f; auto z = t; float32x4_t ret={0.f,0.f,0.f,0.f}; // if all low no need to blend if ( _mm_movemask_ps(high) != 0) { z = ( t 0.4142135623730950f ) ? (t-1.0f)/(t+1.0f) : t; ret = ( t 0.4142135623730950f ) ? ret+PIO4F : ret; } /* polynomial removed */ return ret += z; } float32x4_t doAtan(float32x4_t z) { return atan(z);} float32x4_t va[1024]; float32x4_t vb[1024]; void computeV() { for (int i=0;i!=1024;++i) vb[i]=atan(va[i]); } compiled with -Ofast c++ -S -std=c++1y -Ofast bugmvmk.cc -march=nehalem; cat bugmvmk.s produces the following code where the movmskps%xmm8, %edx does not protect the code in the if block... __Z8computeVv: LFB2512: movapsLC0(%rip), %xmm4 xorl%eax, %eax movapsLC1(%rip), %xmm7 leaq_va(%rip), %rcx movapsLC2(%rip), %xmm6 movapsLC3(%rip), %xmm5 .align 4,0x90 L10: movaps(%rcx,%rax), %xmm2 movaps%xmm4, %xmm8 movaps%xmm2, %xmm3 cmpltps%xmm2, %xmm8 movaps%xmm2, %xmm1 addps%xmm6, %xmm3 addps%xmm7, %xmm1 movmskps%xmm8, %edx andps%xmm5, %xmm8 rcpps%xmm3, %xmm0 mulps%xmm0, %xmm3 mulps%xmm0, %xmm3 addps%xmm0, %xmm0 subps%xmm3, %xmm0 mulps%xmm0, %xmm1 movaps%xmm2, %xmm0 cmpleps%xmm4, %xmm0 blendvps%xmm0, %xmm2, %xmm1 pxor%xmm0, %xmm0 testl%edx, %edx jeL7 movaps%xmm8, %xmm0 L7: testl%edx, %edx jeL9 movaps%xmm1, %xmm2 L9: addps%xmm0, %xmm2 leaq_vb(%rip), %rdx movaps%xmm2, (%rdx,%rax) addq$16, %rax cmpq$16384, %rax jneL10 ret while with O2 is ok __Z8computeVv: LFB2512: movapsLC0(%rip), %xmm4 xorl%eax, %eax movapsLC1(%rip), %xmm7 leaq_va(%rip), %rsi movapsLC2(%rip), %xmm6 leaq_vb(%rip), %rcx movapsLC3(%rip), %xmm5 .align 4,0x90 L7: movaps(%rsi,%rax), %xmm1 movaps%xmm4, %xmm0 pxor%xmm2, %xmm2 cmpltps%xmm1, %xmm0 movmskps%xmm0, %edx testl%edx, %edx jeL6 movaps%xmm1, %xmm3 movaps%xmm1, %xmm2 addps%xmm6, %xmm2 addps%xmm7, %xmm3 divps%xmm2, %xmm3 movaps%xmm0, %xmm2 andps%xmm5, %xmm2 blendvps%xmm0, %xmm3, %xmm1 L6: addps%xmm2, %xmm1 movaps%xmm1, (%rcx,%rax) addq$16, %rax cmpq$16384, %rax jneL7 ret note that the function not in the loop (doAtan) is ok with both O2 and Ofast
[Bug target/63599] wrong branch optimization with Ofast in a loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63599 --- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch --- I agree that the code produces correct results. It looks to me sub-optimal. I understand that with Ofast the sequence below will be always executed andps%xmm5, %xmm8 rcpps%xmm3, %xmm0 mulps%xmm0, %xmm3 mulps%xmm0, %xmm3 addps%xmm0, %xmm0 subps%xmm3, %xmm0 mulps%xmm0, %xmm1 movaps%xmm2, %xmm0 cmpleps%xmm4, %xmm0 blendvps%xmm0, %xmm2, %xmm1 while with O2 it will not. and this generates a performance penalty for samples where the test is often false. ( I tried to add __builtin_expect(x, false) with no effect. )
[Bug tree-optimization/56829] Feature request: generic builtin to support control flow in vectorized code (movemask, vec_any/all_*)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56829 --- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch --- just to add the OpenCL syntax and doc https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/any.html
[Bug tree-optimization/50374] Support vectorization of min/max location pattern
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50374 vincenzo Innocente vincenzo.innocente at cern dot ch changed: What|Removed |Added Known to fail||4.9.1 --- Comment #27 from vincenzo Innocente vincenzo.innocente at cern dot ch --- coming back to this old issue. Any chance to see it implemented in the auto-vectorizer soon? using extended vectors I manage to vectorize min_element as below. In principle the auto-vectorizer should be able to do the same starting from the loop in comment 3 typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t; typedef float __attribute__( ( vector_size( 16 ) , aligned(4) ) ) float32x4a4_t; typedef int __attribute__( ( vector_size( 16 ) ) ) int32x4_t; inline float32x4_t load(float const * x) { return *(float32x4a4_t const *)(x); } int minloc(float const * x, int N) { float32x4_t v0; int32x4_t index; auto M = 4*(N/4); for (int i=M; iN; ++i) { v0[i-M] = x[i]; index[i]=i; } for (int i=N; iM+4;++i) { v0[i-M] = x[0]; index[i]=0; } int32x4_t j = {0,1,2,3}; for (int i=0; iM; i+=4) { decltype(auto) v = load(x+i); index = (vv0) ? j : index; v0 = (vv0) ? v : v0; j+=4; } auto k = 0; for (int i=1;i4; ++i) if (v0[i]v0[k]) k=i; return index[k]; } #includeiostream #includealgorithm #include x86intrin.h unsigned int taux=0; inline unsigned long long rdtscp() { return __rdtscp(taux); } int main() { float x[1024]; for (int i=0; i1024; ++i) x[i]= i%2 ? i : -i; for (int i = 0; i10; ++i) { std::random_shuffle(x,x+1024); long long ts = -rdtscp(); int l1 = std::min_element(x+i,x+1024) - (x+i); ts +=rdtscp(); long long tv = -rdtscp(); int l2 = minloc(x+i,1024-i); tv +=rdtscp(); std::cout min is at l1 ' ' ts std::endl; std::cout minloc l2 ' ' tv std::endl; } return 0; } which result in a pretty good speed up c++ -std=c++1y -Ofast minloc.cc -march=nehalem ./a.out ./a.out min is at 959 13780 minloc 959 2380 min is at 536 13680 minloc 536 4972 min is at 513 13648 minloc 513 1848 min is at 825 13640 minloc 825 1924 min is at 885 13628 minloc 885 1644 min is at 636 11252 minloc 636 1536 min is at 982 11240 minloc 982 1416 min is at 382 11228 minloc 382 1392 min is at 271 11216 minloc 271 1340 min is at 50 11204 minloc 50 1384
[Bug web/61744] New: misleading documentation about cast of extended vectors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61744 Bug ID: 61744 Summary: misleading documentation about cast of extended vectors Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: web Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch At the very bottom of https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html one reads It is possible to cast from one vector type to another, provided they are of the same size (in fact, you can also cast vectors to and from other datatypes of the same size). I find this misleading as the reader can think that the result of such a cast is similar to a C-style cast of each element, while instead is a simple reinterpretation of the bit content (as memcpy). I suggest to add The result is a vector of the new type with the same bit-content of the original. One can even add , not what expected from a C-style cast. Of course adding a proper conversion builtin (see PR61731) would definitively solve the issue ;-)
[Bug tree-optimization/61747] New: min,max pattern not always properly optimized (for sse4 targets)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747 Bug ID: 61747 Summary: min,max pattern not always properly optimized (for sse4 targets) Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch I was expecting gcc to substitute min/max instruction for (a/b) ? a : b; even for O2. This is not always the case, only Ofast provides consistently optimized code (even if sometimes with a redundant move). -ffinite-math-only makes the code worse for vector arguments... cat vmin.cc typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t; templatetypename V1 V1 vmax(V1 a, V1 b) { return (ab) ? a : b; } templatetypename V1 V1 vmin(V1 a, V1 b) { return (ab) ? a : b; } float foo(float a, float b, float c) { return vmin(vmax(a,b),c); } float32x4_t foo(float32x4_t a, float32x4_t b, float32x4_t c) { return vmin(vmax(a,b),c); } templatetypename Float Float bart(Float a) { constexpr Float zero{0.f}; constexpr Float it = zero+4.f; constexpr Float zt = zero-3.f; return vmin(vmax(a,zt),it); } float bar(float a) { return bart(a); } float32x4_t bar(float32x4_t a) { return bart(a); } I see c++ -std=c++11 -O2 -msse4.2 -s vmin.cc -S; cat vmin.s __Z3foofff: LFB2: maxss%xmm1, %xmm0 minss%xmm2, %xmm0 ret __Z3fooDv4_fS_S_: LFB3: maxps%xmm1, %xmm0 minps%xmm2, %xmm0 ret __Z3barf: LFB5: ucomissLC3(%rip), %xmm0 jbeL12 minssLC2(%rip), %xmm0 ret .align 4,0x90 L12: movssLC3(%rip), %xmm0 ret __Z3barDv4_f: LFB6: movapsLC5(%rip), %xmm1 movaps%xmm0, %xmm2 movaps%xmm1, %xmm0 cmpltps%xmm2, %xmm0 blendvps%xmm0, %xmm2, %xmm1 movapsLC6(%rip), %xmm2 movaps%xmm1, %xmm0 cmpltps%xmm2, %xmm0 blendvps%xmm0, %xmm1, %xmm2 movaps%xmm2, %xmm0 ret - c++ -std=c++11 -O2 -msse4.2 -s vmin.cc -S -ffinite-math-only; cat vmin.s __Z3foofff: LFB2: maxss%xmm0, %xmm1 minss%xmm2, %xmm1 movaps%xmm1, %xmm0 ret __Z3fooDv4_fS_S_: LFB3: maxps%xmm1, %xmm0 movaps%xmm0, %xmm1 movaps%xmm2, %xmm0 cmpleps%xmm1, %xmm0 blendvps%xmm0, %xmm2, %xmm1 movaps%xmm1, %xmm0 ret __Z3barf: LFB5: maxssLC2(%rip), %xmm0 minssLC3(%rip), %xmm0 ret __Z3barDv4_f: LFB6: movapsLC5(%rip), %xmm1 movaps%xmm0, %xmm2 movaps%xmm1, %xmm0 cmpltps%xmm2, %xmm0 blendvps%xmm0, %xmm2, %xmm1 movapsLC6(%rip), %xmm2 movaps%xmm1, %xmm0 cmpltps%xmm2, %xmm0 blendvps%xmm0, %xmm1, %xmm2 movaps%xmm2, %xmm0 ret LFE6: -- eventually c++ -std=c++11 -Ofast -msse4.2 -s vmin.cc -S; cat vmin.s __Z3foofff: LFB2: maxss%xmm0, %xmm1 minss%xmm2, %xmm1 movaps%xmm1, %xmm0 ret __Z3fooDv4_fS_S_: LFB3: maxps%xmm0, %xmm1 minps%xmm2, %xmm1 movaps%xmm1, %xmm0 ret __Z3barf: LFB5: maxssLC2(%rip), %xmm0 minssLC3(%rip), %xmm0 ret __Z3barDv4_f: LFB6: maxpsLC5(%rip), %xmm0 minpsLC6(%rip), %xmm0 ret
[Bug tree-optimization/61747] min,max pattern not always properly optimized (for sse4 targets)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747 --- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch --- I think you need -fno-signed-zeros for the transformation to be valid. possible. but then is the O2 code that is wrong? in any case adding -fno-signed-zeros makes no difference w/r/t O2 alone
[Bug tree-optimization/61747] min,max pattern not always properly optimized (for sse4 targets)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747 --- Comment #4 from vincenzo Innocente vincenzo.innocente at cern dot ch --- confirm that -ffinite-math-only -fno-signed-zeros is equivalent to Ofast in this case so we conclude that the code generated at O2 is wrong and -ffinite-math-only -fno-signed-zeros is required to trigger min/max?
[Bug target/61731] New: Feature request: generic builtin for conversion operator among vectors
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61731 Bug ID: 61731 Summary: Feature request: generic builtin for conversion operator among vectors Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch gcc is lacking a mechanism to convert (C-style cast) efficiently extended-vectors among different types. clang has recently introduce a __builtin_convertvector (see few lines below http://clang.llvm.org/docs/LanguageExtensions.html#langext-builtin-shufflevector) I would like to ask if it is possible to implement the same feature in gcc. An agreed syntax with clang would be welcome.
[Bug tree-optimization/56829] Feature request: generic builtin to support control flow in vectorized code (movemask, vec_any/all_*)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56829 vincenzo Innocente vincenzo.innocente at cern dot ch changed: What|Removed |Added Summary|Feature request: generic |Feature request: generic |builtin for movemask |builtin to support control ||flow in vectorized code ||(movemask, ||vec_any/all_*) --- Comment #1 from vincenzo Innocente vincenzo.innocente at cern dot ch --- as gcc 4.9 is now out I would like to come back to this request. As more support for it I have found this interesting talk http://llvm.org/devmtg/2012-04-12/Slides/Ralf_Karrenberg.pdf that from slide 17 addresses the issue of divergent control flow and its implementation on cpu (in the contest of OpenCL, still the argument is fully valid for other type of implementations) including a praise for a a way to express predication in IR in slide 25. For a general discussion and implementation see also http://www.mcs.anl.gov/publication/introducing-control-flow-vectorized-code and reference therein My preference is still for a builtin that converts a mask into an integer (movemask behavior). one can then use _builtin_popcount, __builtin_ctz etc to cast it in an bool. for altivec, gcc implements vec_any_cpm and vec_all_cpm set of functions that combine the comparison and the mask-int conversion. This is a possible alternative syntax. My understanding it that neon does not support any form of predication in its instruction set. (see http://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon for instance). This is an even more compelling reason for the compiler to provide a generic builtin!
[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796 --- Comment #5 from vincenzo Innocente vincenzo.innocente at cern dot ch --- so with latest 4.9 gcc version 4.10.0 20140611 (experimental) [trunk revision 211467] (GCC) situation has not changed much (the scalar version is now faster!): I think that the cost of gather instructions is still under-estimated
[Bug c++/61381] constexpr non captured by template lambda
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381 --- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch --- I am still at trunk revision 210507 will update and test again
[Bug c++/61381] constexpr non captured by template lambda
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381 vincenzo Innocente vincenzo.innocente at cern dot ch changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #3 from vincenzo Innocente vincenzo.innocente at cern dot ch --- confirmed it compile with [trunk revision 211189] a back port to 4.9.1. would be appreciated
[Bug c++/61381] New: constexpr non captured by template lambda
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381 Bug ID: 61381 Summary: constexpr non captured by template lambda Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch cat ceLambda.cc struct Bar { constexpr Bar(float i):f(i){}; float f;}; float foo1(float x) { constexpr Bar z{0}; auto f = [=](auto a, auto b) - Bar { return z;}; return f(x,x).f; } float foo2(float x) { const Bar z{0}; auto f = [=](auto a, auto b) - Bar { return z;}; return f(x,x).f; } float foo3(float x) { constexpr Bar z{0}; auto f = [=](float a, float b) - Bar { return z;}; return f(x,x).f; } b-d-128-141-131-42:ctest innocent$ c++ -O2 -std=c++1y -S ceLambda.cc ceLambda.cc: In instantiation of ‘foo1(float)::lambda(auto:1, auto:2) [with auto:1 = float; auto:2 = float]’: ceLambda.cc:7:16: required from here ceLambda.cc:5:49: error: ‘z’ was not declared in this scope auto f = [=](auto a, auto b) - Bar { return z;}; ^
[Bug tree-optimization/61338] New: too many permutation in a vectorized reverse loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61338 Bug ID: 61338 Summary: too many permutation in a vectorized reverse loop Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: vincenzo.innocente at cern dot ch in this example gcc generates 4 permutations for foo (while none is required) On the positive side the code for bar (which is a more realistic use case) seems optimal. float x[1024]; float y[1024]; float z[1024]; void foo() { for (int i=0; i512; ++i) x[1023-i] += y[1023-i]*z[512-i]; } void bar() { for (int i=0; i512; ++i) x[1023-i] += y[i]*z[i+512]; } c++ -Ofast -march=haswell -S revloop.cc; cat revloop.s __Z3foov: LFB0: vmovdqaLC0(%rip), %ymm2 xorl%eax, %eax leaq4064+_x(%rip), %rdx leaq4064+_y(%rip), %rsi leaq2020+_z(%rip), %rcx .align 4,0x90 L2: vpermd(%rdx,%rax), %ymm2, %ymm0 vpermd(%rcx,%rax), %ymm2, %ymm1 vpermd(%rsi,%rax), %ymm2, %ymm3 vfmadd231ps%ymm1, %ymm3, %ymm0 vpermd%ymm0, %ymm2, %ymm0 vmovaps%ymm0, (%rdx,%rax) subq$32, %rax cmpq$-2048, %rax jneL2 vzeroupper ret LFE0: .section __TEXT,__text_cold,regular,pure_instructions LCOLDE1: .text LHOTE1: .section __TEXT,__text_cold,regular,pure_instructions LCOLDB2: .text LHOTB2: .align 4,0x90 .globl __Z3barv __Z3barv: LFB1: vmovdqaLC0(%rip), %ymm1 leaq2048+_z(%rip), %rdx leaq_y(%rip), %rcx leaq4064+_x(%rip), %rax leaq4096+_z(%rip), %rsi .align 4,0x90 L6: vmovaps(%rdx), %ymm2 addq$32, %rdx vpermd(%rax), %ymm1, %ymm0 addq$32, %rcx vfmadd231ps-32(%rcx), %ymm2, %ymm0 subq$32, %rax vpermd%ymm0, %ymm1, %ymm0 vmovaps%ymm0, 32(%rax) cmpq%rsi, %rdx jneL6 vzeroupper ret LFE1:
[Bug tree-optimization/61338] too many permutation in a vectorized reverse loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61338 --- Comment #1 from vincenzo Innocente vincenzo.innocente at cern dot ch --- if I write it reverse void foo2() { for (int i=511; i=0; --i) x[1023-i] += y[1023-i]*z[512-i]; } its ok __Z4foo2v: LFB1: leaq2048+_x(%rip), %rdx xorl%eax, %eax leaq4+_z(%rip), %rsi leaq2048+_y(%rip), %rcx .align 4,0x90 L6: vmovaps(%rdx,%rax), %ymm1 vmovups(%rsi,%rax), %ymm0 vfmadd132ps(%rcx,%rax), %ymm1, %ymm0 vmovaps%ymm0, (%rdx,%rax) addq$32, %rax cmpq$2048, %rax jneL6 vzeroupper ret
[Bug middle-end/49363] [feature request] multiple target attribute (and runtime dispatching based on cpuid)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49363 --- Comment #23 from vincenzo Innocente vincenzo.innocente at cern dot ch --- Which Syntax? I want to reuse the same code for the various architecture and let gcc deal with vectorization details. The best I manage to do to share code is something like this namespace { inline float _sum0(float const * x, float const * y, float const * z) { float sum=0; for (int i=0; i!=1024; ++i) sum += z[i]+x[i]*y[i]; return sum; } } float __attribute__ ((__target__ (arch=haswell))) sum1(float const * x, float const * y, float const * z) { return _sum0(x,y,z); } float __attribute__ ((__target__ (arch=nehalem))) sum1(float const * x, float const * y, float const * z) { return _sum0(x,y,z); } //-- this for instance does not work (produce code only for haswell) float __attribute__ ( (__target__(arch=nehalem), __target__(arch=haswell)) ) sum0(float const * x, float const * y, float const * z) { float sum=0; for (int i=0; i!=1024; ++i) sum += z[i]+x[i]*y[i]; return sum; }