[Bug target/114943] New: X86 AVX2: inefficient code generated to convert SIMD Vectors

2024-05-04 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114943

Bug ID: 114943
   Summary: X86 AVX2: inefficient code generated to convert SIMD
Vectors
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in the example below (see https://godbolt.org/z/qnfT4fE5G )
convert and covert3 produce code that looks to me inefficient w/r/t convert2
(and clang)  for target x86-64-v3

#define VECTOR_EXT(N) __attribute__((vector_size(N)))
typedef float VECTOR_EXT(16) float32x4_t;
typedef double VECTOR_EXT(32) float64x4_t;

float32x4_t f1,f2,f3,f4,f;
float64x4_t d1,d2,d3,d4,d;


void covert() {
   for (int i=0;i<4;++i) {
d1[i] = f1[i];
d2[i] = f2[i];
d3[i] = f3[i];
d4[i] = f4[i];
  }

}

void covert2() {
   for (int i=0;i<4;++i)
d1[i] = f1[i];
 for (int i=0;i<4;++i)
d2[i] = f2[i];
 for (int i=0;i<4;++i)
d3[i] = f3[i];
 for (int i=0;i<4;++i)
d4[i] = f4[i];
}



void covert3() {
  d1 = __builtin_convertvector(f1,float64x4_t);
}

[Bug target/114484] #include changes ::abs in std::abs

2024-03-26 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484

--- Comment #9 from vincenzo Innocente  ---
We observe that including xmmintrin.h the behaviour of some code,
notably abs(x), when x is float or double changes.
And this depends on the platform as  xmmintrin.h is x86_64 specific.
Yes, is 20 years that is like that and people always wandered why abs(x) was
behaving differently in different parts of the code and now asking why it
behaves differently on x86_64 and ARM.
The workaround is obvious: use std::abs.

I personally find very unconfortable that including (even through cascade)
xmmintrin.h changes the behaviour of "abs(x)" 


If everybody on GCC side is confortable with this situation we will just take
note and try to be more strict with code visual inspection.

[Bug target/114484] #include changes ::abs in std::abs

2024-03-26 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484

--- Comment #4 from vincenzo Innocente  ---
in C++ one is supposed to #include
 not 

I do not think that there is an explicit version of C++ headers for the
intrinsics that avoids the conflicts between C and C++.

[Bug c++/114484] #include changes ::abs in std::abs

2024-03-26 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484

--- Comment #2 from vincenzo Innocente  ---
*** Bug 114483 has been marked as a duplicate of this bug. ***

[Bug c++/114483] #include changes ::abs in std::abs

2024-03-26 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114483

vincenzo Innocente  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #1 from vincenzo Innocente  ---
please close this

*** This bug has been marked as a duplicate of bug 114484 ***

[Bug c++/114484] #include changes ::abs in std::abs

2024-03-26 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484

--- Comment #1 from vincenzo Innocente  ---
xmmintrin.h
includes mm_malloc.h
which 
#include 
which
using std::abs;
(among others)


see
https://godbolt.org/z/cxo65rnr9

or this excerpt from c++ -E dump
```
# 32
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/xmmintrin.h"
2 3 4


# 1
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/mm_malloc.h"
1 3 4
# 27
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/mm_malloc.h"
3 4
# 1
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/stdlib.h"
1 3 4
# 36
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/stdlib.h"
3 4
# 1
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/cstdlib"
1 3 4
# 39
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/cstdlib"
3 4

# 40
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/cstdlib"
3
# 37
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/stdlib.h"
2 3 4

using std::abort;
using std::atexit;
using std::exit;


  using std::at_quick_exit;


  using std::quick_exit;




using std::div_t;
using std::ldiv_t;

using std::abs;
using std::atof;
using std::atoi;
using std::atol;
using std::bsearch;
using std::calloc;
using std::div;
using std::free;
using std::getenv;
using std::labs;
using std::ldiv;
using std::malloc;

using std::mblen;
using std::mbstowcs;
using std::mbtowc;

using std::qsort;
using std::rand;
using std::realloc;
using std::srand;
using std::strtod;
using std::strtol;
using std::strtoul;
using std::system;

using std::wcstombs;
using std::wctomb;
# 28
"/data/cmssw/el9_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/lib/gcc/x86_64-redhat-linux-gnu/12.3.1/include/mm_malloc.h"
2 3 4
```

[Bug c++/114484] New: #include changes ::abs in std::abs

2024-03-26 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114484

Bug ID: 114484
   Summary: #include   changes ::abs in std::abs
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

[Bug c++/114483] New: #include changes ::abs in std::abs

2024-03-26 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114483

Bug ID: 114483
   Summary: #include   changes ::abs in std::abs
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

[Bug tree-optimization/114363] inconsistent optimization of pow(x,2)+pow(y,2)

2024-03-16 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114363

--- Comment #4 from vincenzo Innocente  ---
Thanks Harald, I missed the point that float z = pow(double(x),2) and
float z = x*x would indeed produce exactly the same result, while in all other
cases of course not.

[Bug tree-optimization/114363] New: inconsistent optimization of pow(x,2)+pow(y,2)

2024-03-16 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114363

Bug ID: 114363
   Summary: inconsistent optimization of pow(x,2)+pow(y,2)
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

while pow(x,2) is optimized in x*x   (float x)
in  pow(x,2)+pow(y,2) x and y are first promoted to double 
which I find inconsistent

see
https://godbolt.org/z/rYfoaxr89

[Bug libstdc++/112649] New: [c++23] in presence of inline functions and debug-info stacktrace reports the deepest callee

2023-11-21 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112649

Bug ID: 112649
   Summary: [c++23] in presence of inline functions and debug-info
stacktrace reports the deepest callee
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

Created attachment 56657
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56657=edit
a small demo demonstating the description.

feature or defect?
or missing feature in std::stacktrace...

what I find disturbing is that the "symbol name" is different for the very same
pc depending if it has been compiled with "-g" or not:
in case of debug-info it is set to the deepest callee, w/o to the outermost
caller.

maybe it is a issue for the libstd committee?

DEMO:
just compile the attached demo program and compile it with
c++ -std=c++23 stackTraceDemo.cpp -lstdc++exp -O2 -DINCLUDE='' -g
and run it
than without -g
one can also try to run gdb to compare with the demo output.
```
gdb ./a.out
b instrumentedFunc
run
where
```

Details:
in libstdc++-v3/src/c++23/stacktrace.cc
   123  bool
   124  stacktrace_entry::_Info::_M_populate(native_handle_type pc)
   125  {
   126auto cb = [](void* self, uintptr_t, const char* filename, int lineno,
   127 const char* function) -> int
   128{
   129  auto& info = *static_cast<_Info*>(self);
   130  info._M_set_desc(function);
   131  info._M_set_file(filename);
   132  if (info._M_line)
   133*info._M_line = lineno;
   134  return function != nullptr;
   135};
   136const auto state = init();
   137if (::__glibcxx_backtrace_pcinfo(state, pc, +cb, err_handler, this))
   138  return true;

according to doc __glibcxx_backtrace_pcinfo
* Given PC, a program counter in the current program, call the
   callback function with filename, line number, and function name
   information.  This will normally call the callback function exactly
   once.  However, if the PC happens to describe an inlined call, and
   the debugging information contains the necessary information, then
   this may call the callback function multiple times.  This will make
   at least one call to either CALLBACK or ERROR_CALLBACK.  This
   returns the first non-zero value returned by CALLBACK, or 0.  */

>From my tests last sentence means that if the callback does not return 0 it may
be called again.
So in the current implementation it will be called just once even in presence
of inline functions and therefore the stacktrace-entry will be set to the
deepest callee.
If one waits till last call (returning always "false") one will be able to set
the entry to 
the outermost caller or even record the full call chain (as GDB does).
This last option does not seem to fit std::backtrace interface.


--
here is the output of the demo (I prefer to print the stacktrace reversed)
# is from the stacktrace entry
>> is from __glibcxx_backtrace_pcinfo returning always "false"

[innocent@patatrack01 demos]$ c++ -std=c++23 stackTraceDemo.cpp -lstdc++exp -O2
-DINCLUDE=''; ./a.out
#0 0x  :0
#1 0x40164d _start :0
#2 0x7f4412f23d84 __libc_start_main :0
#3 0x40159a main :0
#4 0x401eeb func(int) :0
#5 0x401ab0 instrumentedFunc(int) :0
10
[innocent@patatrack01 demos]$ c++ -std=c++23 stackTraceDemo.cpp -lstdc++exp -O2
-DINCLUDE='' -g; ./a.out
#0 0x  :0
#1 0x40164d _start :0
#2 0x7ff80f90ed84 __libc_start_main :0
#3 0x40159a main
/data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:116
>> 1 main /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:116
#4 0x401eeb nestedFunc2(int)
/data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:101
>> 1 _Z11nestedFunc2i 
>> /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:101
>> 2 _Z10nestedFunci 
>> /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:106
>> 3 _Z4funci /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:112
#5 0x401ab0 instrumentedFunc(int)
/data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:91
>> 1 _Z16instrumentedFunci 
>> /data/user/innocent/MallocProfiler/demos/stackTraceDemo.cpp:91
10
[innocent@patatrack01 demos]$ gdb ./a.out
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-19.el8
...
Reading symbols from ./a.out...done.
(gdb) b instrumentedFunc
Breakpoint 1 at 0x401a90: file
/afs/cern.ch/work/i/innocent/public/w5/include/c++/14.0.0/bits/new_allocator.h,
line 88.
(gdb) run
Starting program: /data/user/innocent/MallocProfiler/demos/a.out
Breakpoint 1, instrumentedFunc (c=4) at
/afs/cern.ch/work/i/innocent/public/w5/include/c++/14.0.0/bits/new_allocator.h:88
88__new_allocator() _GLIBCXX_USE_NOEX

[Bug libstdc++/112348] [C++23] defect in struct hash>

2023-11-09 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112348

--- Comment #1 from vincenzo Innocente  ---
This patch works for me

diff --git a/libstdc++-v3/include/std/stacktrace
b/libstdc++-v3/include/std/stacktrace
index da0e48d3532..9a0d0b16068 100644
--- a/libstdc++-v3/include/std/stacktrace
+++ b/libstdc++-v3/include/std/stacktrace
@@ -797,7 +797,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
   size_t
   operator()(const basic_stacktrace<_Allocator>& __st) const noexcept
   {
-   hash __h;
+   hash __h;
size_t __val = _Hash_impl::hash(__st.size());
for (const auto& __f : __st)
  __val = _Hash_impl::__hash_combine(__h(__f), __val);

[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library

2023-11-05 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263

--- Comment #12 from vincenzo Innocente  ---
confirm that the patch solves the issue

c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o
liba.so -ldl;c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la
-Wl,-rpath=.; ./a.out
   0# nested_func2(int) at
/data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:63
   1# nested_func(int) at
/data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:93
   2# func(int) at
/data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:101
   3# main at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:106
   4# __libc_start_main at :0
   5# _start at :0
   6#

what is the last empty entry is a different story I suppose (not an issue at
the moment).

Thanks again for the fast action

[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library

2023-11-03 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263

--- Comment #8 from vincenzo Innocente  ---
Thanks Ian for the patch.
For testing I will need the full git diff (including the makefile itself as my 
autoconf is not compatible with gcc14).

Backports down to gcc12 will be appreciated.
Could you please notify here when the patch enters the various main branches?

[Bug libstdc++/112348] New: [C++23] defect in struct hash>

2023-11-02 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112348

Bug ID: 112348
   Summary: [C++23] defect in struct
hash>
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

gcc version 14.0.0 20231028 (experimental) [master r14-4988-g5d2a360f0a5] (GCC)

 auto k = std::hash()(std::stacktrace::current());

does not compile to me with error

In instantiation of 'std::size_t std::hash
>::operator()(const std::basic_stacktrace<_Allocator>&) const [with _Allocator
= std::allocator; std::size_t = long unsigned int]':
testStacktrace.cpp:39:41:   required from here
   39 |auto k = std::hash()(std::stacktrace::current());
  | ^~~~
/afs/cern.ch/work/i/innocent/public/w5/include/c++/14.0.0/stacktrace:803:49:
error: no match for call to '(std::hash) (const
std::stacktrace_entry&)'
  803 |   __val = _Hash_impl::__hash_combine(__h(__f), __val);
  |  ~~~^


changed
// hash __h;
hash __h;

and it compiled.
(I suspect __f.native_handle() would work as well)

Surprised it passed tests.

[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library

2023-11-01 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263

--- Comment #6 from vincenzo Innocente  ---
Sorry, made the (almost) full exercise:
read the doc in 
https://en.cppreference.com/w/cpp/utility/stacktrace_entry
and the code in stacktrace header file and in
libstdc++-v3/src/c++23/stacktrace.cc
(have not read the specs in the C++23 standard)
indeed the entry implementation has just the handle as data member
and the details are retrieved when the "Query" methods are called.
This appears to happen in
stacktrace_entry::_Info::_M_populate(native_handle_type pc)
which in turn calls
::__glibcxx_backtrace_pcinfo
if this fails it calls
::__glibcxx_backtrace_syminfo

so most probably the issue is in this last function unless there is a problem
with the logic in _M_populate that I failed to identify.

[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library

2023-11-01 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263

--- Comment #5 from vincenzo Innocente  ---
so if I add to
std::cout << std::stacktrace::current() << '\n';
I get what needed
   Dl_info dlinfo;
   for (auto & entry : std::stacktrace::current() ) {
 dladdr((const void*)(entry.native_handle()),);
 std::cout << dlinfo.dli_sname << ' ' << dlinfo.dli_fname <<'\n';
   }

 c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o
liba.so -ldl ; c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L.
-la -Wl,-rpath=. ; ./a.out
   0#  at :0
   1#  at :0
   2# func(int) at
/data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:44
   3# main at /data/user/innocent/MallocProfiler/tests/testStacktrace.cpp:49
   4#  at :0
   5# _start at :0
   6#

_Z12nested_func2i ./liba.so
_Z11nested_funci ./liba.so

of course not de-mangled

so is it a feature or a defect?

I'm not sure how the implementation works (did not look to the code)
dladdr can be slow and may "hang" in some situations.
so it would be useful to have an option that the "name" is not immediately
resolved
and have a function that returns the name from the native_handle
"asynchronously"

[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library

2023-10-31 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263

--- Comment #4 from vincenzo Innocente  ---
intel x86_64 
uname -a
Linux patatrack01 4.18.0-477.13.1.el8_8.x86_64 #1 SMP Thu May 18 10:27:05 EDT
2023 x86_64 x86_64 x86_64 GNU/Linux

boost::backtrace works
can provide example

[Bug libbacktrace/112263] [C++23] std::stacktrace does not identify symbols in shared library

2023-10-30 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263

vincenzo Innocente  changed:

   What|Removed |Added

 CC||ian at gcc dot gnu.org
  Component|libstdc++   |libbacktrace

--- Comment #2 from vincenzo Innocente  ---
I suspect libbacktrace even if I do not have ways to test it outside
std::stacktrace

[Bug libstdc++/112263] New: [C++23] std::stacktrace does not identify symbols in shared library

2023-10-28 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112263

Bug ID: 112263
   Summary: [C++23] std::stacktrace does not identify symbols in
shared library
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

using 
gcc version 14.0.0 20231028 (experimental) [master r14-4988-g5d2a360f0a5] (GCC)
that contains the fix for #111936

This simple example  [1]
when run as a single executable prints all symbols in the stacktrace
when the nested functions are in a shared library their names are missing
c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -DINLIB ; ./a.out
   0# nested_func2(int) at
/afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:13
   1# nested_func(int) at
/afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:18
   2# func(int) at
/afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:26
   3# main at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:31
   4#  at :0
   5# _start at :0
   6#


c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o
liba.so ; c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la
-Wl,-rpath=. ; ./a.out
   0#  at :0
   1#  at :0
   2# func(int) at
/afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:26
   3# main at /afs/cern.ch/user/i/innocent/public/ctest/testStacktrace.cpp:31
   4#  at :0
   5# _start at :0
   6#



[1]
cat testStacktrace.cpp
//compile and run with either
// c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -DINLIB; ./a.out
// or
// c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINLIB -fpic -shared -o
liba.so;c++ -std=c++23 testStacktrace.cpp -lstdc++exp -g -DINMAIN -L. -la
-Wl,-rpath=.; ./a.out
//
#include 
#include 


#ifdef INLIB
int nested_func2(int c)
{
std::cout << std::stacktrace::current() << '\n';
return c + 1;
}
int nested_func(int c)
{
return nested_func2(c + 1);
}
#else
int nested_func(int c);
#endif
#ifdef INMAIN
int func(int b)
{
return nested_func(b + 1);
}

int main()
{
std::cout << func(777);
   return 0;
}
#endif

[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library

2023-10-24 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936

--- Comment #9 from vincenzo Innocente  ---
Thanks for the second patch.
I was indeed struggling with autoconf versions (1.15 vd 1.16)


Any chance to backport to gcc12 (our current production version)?

[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library

2023-10-24 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936

--- Comment #7 from vincenzo Innocente  ---
not explicitly in the src tree.
only run configure in the build directory.
what I need to run in the src tree?

[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library

2023-10-24 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936

--- Comment #5 from vincenzo Innocente  ---
My bad, long time I'm not using archive libraries and forgot about the order
rule. 

The issue is indeed missing -fPIC.
Thanks for the fast action.

I applied the patch but it seems not sufficient.

If I well understood this is where the ar lib is built
ar  rc .libs/libstdc++_libbacktrace.a  std_stacktrace-atomic.o
std_stacktrace-backtrace.o std_stacktrace-dwarf.o std_stacktrace-fileline.o
std_stacktrace-posix.o std_stacktrace-sort.o std_stacktrace-simple.o std_sta
cktrace-state.o std_stacktrace-cp-demangle.o std_stacktrace-elf.o
std_stacktrace-mmapio.o std_stacktrace-mmap.o

but those are the file compiled w/o -fPIC
those with fPIC are under .libs itself...

so I did manually
```
ar rc .libs/libstdc++_libbacktrace.a .libs/*.o ../c++23/stacktrace.o

```

and then locally
c++ -O3 -pthread -fPIC -shared -std=c++23 getStacktrace.cc
/data/user/innocent/gcc_build/x86_64-pc-linux-gnu/libstdc++-v3/src/libbacktrace/.libs/libstdc++_libbacktrace.a
-g -o mallocHook.so


and runs
setenv LD_PRELOAD ./mallocHook.so ; ./a.out ; unsetenv LD_PRELOAD
asked 4 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##
asked 8 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##
asked 16 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##
asked 32 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##
asked 64 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##
asked 128 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##
asked 256 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##
asked 512 at ###std::__new_allocator::allocate(unsigned long, void
const*)#std::allocator_traits
>::allocate(std::allocator&, unsigned long)#void std::vector >::_M_realloc_insert(__gnu_cxx::__normal_iterator
> >, int const&)#std::vector >::push_back(int
const&)#go(int)#main##_start##

[Bug c++/111934] ICE internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065

2023-10-24 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934

--- Comment #3 from vincenzo Innocente  ---
with
gcc version 14.0.0 20231024 (experimental) [master r14-4877-g724badcadf8] (GCC)
I get the same ICE.

Please note that one needs to include "iostream"
(in my test compile with "-DICE")
to trigger the ICE.
w/o it just emits the syntax error as one would expect.

[Bug libstdc++/111936] std::stacktrace cannot be used in a shared library

2023-10-23 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936

--- Comment #1 from vincenzo Innocente  ---
here is a minimal malloc hook that I would like to use
[innocent@patatrack01 ctest]$ cat getStacktrace.cc
#include 

  std::string get_stacktrace() {
 std::string trace;
 for (auto & entry : std::stacktrace::current() ) trace +=
entry.description() + '#';
 return trace;
  }


#include 
#include 
#include 

extern "C"
void * myMallocHook(size_t size, void const * caller) {
  __malloc_hook = nullptr;
  auto p = malloc(size);
  std::cout << "asked " << size
<< " at " << get_stacktrace()
<< std::endl;
  __malloc_hook = myMallocHook;
  return p;
}

namespace {
struct Hook {
  Hook() {
  __malloc_hook = myMallocHook;
  }
};

  Hook hook;

}

compiled as
c++ -O3 -Wall -pthread -fPIC -shared -std=c++23 -lstdc++exp getStacktrace.cc

gives the undefined symbol

 setenv LD_PRELOAD ./a.out ; ls ; unsetenv LD_PRELOAD
ls: symbol lookup error: ./a.out: undefined symbol:
_ZNSt17__stacktrace_impl10_S_currentEPFiPvmES0_i

[Bug libstdc++/111936] New: std::stacktrace cannot be used in a shared library

2023-10-23 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111936

Bug ID: 111936
   Summary: std::stacktrace cannot be used in a shared library
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

I would like to use std::stacktrace in a shared library to be preloaded...

when I try to build the library even for this minimal example
cat getStacktrace.cc
#include 

  std::string get_stacktrace() {
 std::string trace;
 for (auto & entry : std::stacktrace::current() ) trace +=
entry.description() + '#';
 return trace;
  }

it fails
 c++ -O3 -Wall -pthread -fPIC -shared getStacktrace.cc -std=c++23 -lstdc++exp
/usr/bin/ld:
/afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-fileline.o):
relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a
shared object; recompile with -fPIC
/usr/bin/ld:
/afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-posix.o):
relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a
shared object; recompile with -fPIC
/usr/bin/ld:
/afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-simple.o):
relocation R_X86_64_32 against `.text' can not be used when making a shared
object; recompile with -fPIC
/usr/bin/ld:
/afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-elf.o):
relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a
shared object; recompile with -fPIC
/usr/bin/ld:
/afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-mmap.o):
relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a
shared object; recompile with -fPIC
/usr/bin/ld:
/afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-mmapio.o):
relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a
shared object; recompile with -fPIC
/usr/bin/ld:
/afs/cern.ch/work/i/innocent/public/w5/bin/../lib/gcc/x86_64-pc-linux-gnu/14.0.0/../../../../lib64/libstdc++exp.a(std_stacktrace-dwarf.o):
relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a
shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status


it silently compiles with
[innocent@patatrack01 ctest]$ c++ -O3 -Wall -pthread -fPIC -shared -std=c++23
-lstdc++exp getStacktrace.cc

but the symbols are undefined

[innocent@patatrack01 ctest]$ ldd ./a.out
linux-vdso.so.1 (0x7ffd50f73000)
libstdc++.so.6 => /afs/cern.ch/user/i/innocent/w5/lib64/libstdc++.so.6
(0x7fa9437f8000)
libm.so.6 => /usr/lib64/libm.so.6 (0x7fa943476000)
libgcc_s.so.1 => /afs/cern.ch/user/i/innocent/w5/lib64/libgcc_s.so.1
(0x7fa94324b000)
libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x7fa94302b000)
libc.so.6 => /usr/lib64/libc.so.6 (0x7fa942c66000)
/lib64/ld-linux-x86-64.so.2 (0x7fa943e68000)
[innocent@patatrack01 ctest]$ nm -C ./a.out | grep stack
0db0 T get_stacktrace[abi:cxx11]()
0be0 t get_stacktrace[abi:cxx11]() [clone .cold]
0d20 t std::basic_stacktrace
>::current(std::allocator const&) [clone .isra.0]
 U std::stacktrace_entry::_Info::_M_populate(unsigned long)
1430 W std::stacktrace_entry::_Info::_S_set[abi:cxx11](void*, char
const*)
 U std::__stacktrace_impl::_S_current(int (*)(void*, unsigned
long), void*, int)
1310 W std::basic_stacktrace
>::_M_prepare(unsigned short)::{lambda(void*, unsigned long)#1}::_FUN(void*,
unsigned long)


and at run time (not this example, my full application that invoke the
staketrace from a malloc hook) it (obviously fail)

[innocent@patatrack01 ctest]$ c++ -O3 -Wall -pthread -fPIC -shared -std=c++23
-lstdc++exp mallocWrapper.cc
[innocent@patatrack01 ctest]$ setenv LD_PRELOAD ./a.out ; ls ; unsetenv
LD_PRELOAD
Recoding structure constructed in a thread
ls: symbol lookup error: ./a.out: undefined symbol:
_ZNSt17__stacktrace_impl10_S_currentEPFiPvmES0_i

[Bug c++/111934] ICE internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065

2023-10-23 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934

--- Comment #1 from vincenzo Innocente  ---
sorry missed the version

gcc version 14.0.0 20231021 (experimental) [master r14-4817-g405a4140fc3] (GCC)

[Bug c++/111934] New: ICE internal compiler error: in discriminator_for_local_entity, at cp/mangle.cc:2065

2023-10-23 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111934

Bug ID: 111934
   Summary: ICE  internal compiler error: in
discriminator_for_local_entity, at cp/mangle.cc:2065
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

#ifdef ICE
#include 
#endif

struct  Me {

  static Me & me() {
thread_local auto me = std::make_unique_ptr();
return *me;
  }

};


int main() {
  return 0;
}

c++ -O3 -Wall -pthread ice13.cpp
ice13.cpp: In static member function 'static Me& Me::me()':
ice13.cpp:8:33: error: 'make_unique_ptr' is not a member of 'std'
8 | thread_local auto me = std::make_unique_ptr();
  | ^~~
ice13.cpp:8:51: error: expected primary-expression before '>' token
8 | thread_local auto me = std::make_unique_ptr();
  |   ^
ice13.cpp:8:53: error: expected primary-expression before ')' token
8 | thread_local auto me = std::make_unique_ptr();
  | ^


==


ctest]$ c++ -O3 -Wall -pthread ice13.cpp -DICE
ice13.cpp: In static member function 'static Me& Me::me()':
ice13.cpp:8:33: error: 'make_unique_ptr' is not a member of 'std'
8 | thread_local auto me = std::make_unique_ptr();
  | ^~~
ice13.cpp:8:51: error: expected primary-expression before '>' token
8 | thread_local auto me = std::make_unique_ptr();
  |   ^
ice13.cpp:8:53: error: expected primary-expression before ')' token
8 | thread_local auto me = std::make_unique_ptr();
  | ^
ice13.cpp: At global scope:
ice13.cpp:8:23: internal compiler error: in discriminator_for_local_entity, at
cp/mangle.cc:2065
8 | thread_local auto me = std::make_unique_ptr();
  |   ^~
0x7de25d discriminator_for_local_entity
../../gcc_src/gcc/cp/mangle.cc:2065
0xb92a4a write_local_name
../../gcc_src/gcc/cp/mangle.cc:2164
0xb92a4a write_name
../../gcc_src/gcc/cp/mangle.cc:1071
0xb94e46 write_encoding
../../gcc_src/gcc/cp/mangle.cc:864
0xb94f5b write_mangled_name
../../gcc_src/gcc/cp/mangle.cc:810
0xb95740 mangle_decl_string
../../gcc_src/gcc/cp/mangle.cc:4092
0xb9592a get_mangled_id
../../gcc_src/gcc/cp/mangle.cc:4113
0xb9592a mangle_decl(tree_node*)
../../gcc_src/gcc/cp/mangle.cc:4151
0x16512bd decl_assembler_name(tree_node*)
../../gcc_src/gcc/tree.cc:715
0xe4a329 symbol_table::insert_to_assembler_name_hash(symtab_node*, bool)
../../gcc_src/gcc/symtab.cc:175
0xe4a48c symbol_table::symtab_initialize_asm_name_hash()
../../gcc_src/gcc/symtab.cc:267
0xe4ae84 symbol_table::symtab_initialize_asm_name_hash()
../../gcc_src/gcc/symtab.cc:1078
0xe4ae84 symtab_node::get_for_asmname(tree_node const*)
../../gcc_src/gcc/symtab.cc:1066
0xe5fc61 handle_alias_pairs
../../gcc_src/gcc/cgraphunit.cc:1528
0xe64fa7 symbol_table::finalize_compilation_unit()
../../gcc_src/gcc/cgraphunit.cc:2541
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

[Bug tree-optimization/109885] New: gcc does not generate movmskps and testps instructions (clang does)

2023-05-17 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109885

Bug ID: 109885
   Summary: gcc does not generate movmskps and testps instructions
 (clang does)
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in this simple code (on avx2)

int sum(float const * x) {
   int ret = 0;
   for (int i=0; i<8; ++i) ret +=(0==x[i]);
   return ret;
}

int one(float const * x) {
   int ret = 0;
   for (int i=0; i<8; ++i) ret |=(0==x[i]);
   return ret;
}

int all(float const * x) {
   int ret = 1;
   for (int i=0; i<8; ++i) ret &=(0==x[i]);
   return ret;
}

clang uses movmskps and testps instructions, gcc does not

see for instance

https://godbolt.org/z/r11r8xoYz

[Bug c++/109281] New: use std::optional results in suboptimal code

2023-03-25 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109281

Bug ID: 109281
   Summary: use std::optional results in suboptimal code
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

In the following (almost real) code gcc emits suboptimal code if std::optional
is used w/r/t home made one and clang

see https://godbolt.org/z/Pba51Ye7Y


-

code


#include 

// #define USE_OPTIONAL

#ifdef USE_OPTIONAL
struct SubRingCrossings {
  SubRingCrossings(int ci, int ni, float nd) : closestIndex(ci), nextIndex(ni),
nextDistance(nd) {}

  int closestIndex;
  int nextIndex;
  float nextDistance;
};
#else
struct SubRingCrossings {
  SubRingCrossings() : valid(false) {}
  SubRingCrossings(int ci, int ni, float nd) : valid(true), closestIndex(ci),
nextIndex(ni), nextDistance(nd) {}

  bool valid;
  int closestIndex;
  int nextIndex;
  float nextDistance;
};
#endif

bool condition();

#ifdef USE_OPTIONAL
std::optional foo() {
if (condition()) {
return std::nullopt;
}
return SubRingCrossings(1, 2, 3.14);
}
#else
SubRingCrossings foo() {
if (condition()) {
return SubRingCrossings();
}
return SubRingCrossings(1, 2, 3.14);
}
#endif

int bar() {
auto tmp = foo();
#ifdef USE_OPTIONAL
if (tmp) {
return tmp->closestIndex;
#else
if (tmp.valid) {
return tmp.closestIndex;
#endif
} else {
return 0;
}
}

[Bug tree-optimization/109011] New: missed optimization in presence of __builtin_ctz

2023-03-03 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109011

Bug ID: 109011
   Summary: missed optimization in presence of __builtin_ctz
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in the following code foo does not vectorize, bar does.
clang vectorize foo using a pattern that invokes vplzcntd

(code made a bit complex to make vectorization "relevant") 

see https://godbolt.org/z/5fa1zbPeG

#include 
uint32_t x[256];
uint32_t y[256];
uint32_t w[256];
uint32_t z[256];



void foo() {
  for (int i=0; i<256;i++) {
auto p = x[i] ?  __builtin_ctz(x[i]) : y[i];
   z[i] = w[i]*p;
 }  
}


void bar() {
  for (int j=0; j<256;j+=8)
  for (int i=j; i

[Bug tree-optimization/108804] New: missed vectorization in presence of conversion from uint64_t to float

2023-02-15 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108804

Bug ID: 108804
   Summary: missed vectorization in presence of conversion from
uint64_t to float
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in the following code [1] foo does not vectorize, bar doos
compiled with -march=haswell -Ofast --no-math-errno -Wall
see
https://godbolt.org/z/E6xzfavxc

clang seems do do better

[1]
#include



uint64_t d[512];
//uint32_t f[1024];
float f[1024];

void foo() {
for (int i=0; i<512; ++i) {
uint64_t k = d[i];
auto x  = (k & 0x007F) |  0x3F80;
k = k >> 23;
auto y  = (k & 0x007F) |  0x3F80;
f[i]=x; f[128+i] = y;

}
}

void bar() {
for (int i=0; i<512; ++i) {
uint64_t k = d[i];
uint32_t x  = (k & 0x007F);
x |= 0x3F80;
uint32_t y  = k >> 23;
y  = (y & 0x007F) |  0x3F80;
f[i]=x; f[128+i] = y;

}  
}

[Bug tree-optimization/108677] wrong vectorization (when copy constructor is present?)

2023-02-06 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108677

--- Comment #3 from vincenzo Innocente  ---
sorry. the original internal bug report was for gcc 7.5
https://godbolt.org/z/9crafbqen

where I think the generated code is indeed wrong (and does not depend on the
presence of the constructor!)

SO, if anything the bug should be changed in: removing constructor inhibit SLP
vectorization?

[Bug tree-optimization/108677] New: wrong vectorization (when copy constructor is present?)

2023-02-05 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108677

Bug ID: 108677
   Summary: wrong vectorization (when copy constructor is
present?)
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in this real life code

#include

struct trig_pair {
   double CosPhi;
   double SinPhi;

   trig_pair() : CosPhi(1.), SinPhi(0.) {}
   trig_pair(const trig_pair ) : CosPhi(tp.CosPhi), SinPhi(tp.SinPhi) {}
   trig_pair(const double C, const double S) : CosPhi(C), SinPhi(S) {}
   trig_pair(const double phi) : CosPhi(cos(phi)), SinPhi(sin(phi)) {}

   //Return trig_pair fo angle increased by angle of tp.
   trig_pair Add(const trig_pair ) {
  return trig_pair(this->CosPhi*tp.CosPhi - this->SinPhi*tp.SinPhi,
   this->SinPhi*tp.CosPhi + this->CosPhi*tp.SinPhi);
   }
};

trig_pair *TrigArr;

void FillTrigArr(trig_pair tp, unsigned MaxM)
{
//Fill TrigArr with trig_pair(jp*phi)
   if (!TrigArr) return;;
   TrigArr[1] = tp;
   for (unsigned jp = 2; jp <= MaxM; ++jp) TrigArr[jp] = TrigArr[jp-1].Add(tp);
}


gcc vectorize the loop even if a dependency is present...[1]
It will not if I comment out the copy contructor...[2]


[1]
https://godbolt.org/z/vhPeh35n5

[2]
https://godbolt.org/z/YPjdYdqG8

[Bug target/106012] rsqrtps and rcpps instructions generated even if -fno-reciprocal-math specified

2022-12-20 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012

--- Comment #6 from vincenzo Innocente  ---
just to confirm that
-Ofast  -fno-reciprocal-math -mno-recip
seems to inhibit all reciprocals...
https://godbolt.org/z/f4bccb9GP

[Bug c++/107933] New: std::sqrt complies in intrinsics for float even if --no-builtin is provided

2022-11-30 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107933

Bug ID: 107933
   Summary: std::sqrt complies in intrinsics for float even if
--no-builtin  is provided
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

on x86_64

float f(float x) { return std::sqrt(x);}
compiles in
sqrtss  xmm0, xmm0
even if --no-builtin is provided
double d(double x) { return std::sqrt(x);}
calls libm as well as

float  fs(float x) { return sqrtf(x);}
double ds(double x) { return sqrt(x);}


see
https://godbolt.org/z/Mhf9hv6ns

[Bug tree-optimization/106012] rsqrtps and rcpps instructiona generated even if -fno-reciprocal-math specified

2022-06-19 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012

vincenzo Innocente  changed:

   What|Removed |Added

Summary|rsqrtss instruction |rsqrtps and rcpps
   |generated even if   |instructiona generated even
   |-mno-recip specified|if -fno-reciprocal-math
   ||specified
 Status|RESOLVED|NEW
 Resolution|WONTFIX |---

--- Comment #3 from vincenzo Innocente  ---
Thanks for the suggestion.

-fno-reciprocal-math does indeed inhibit scalar reciprocal instructions.

NOT in vectorized loop though.

see

https://godbolt.org/z/9eMb4Tjee

[Bug target/106012] New: rsqrtss instruction generated even if -mno-recip specified

2022-06-17 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106012

Bug ID: 106012
   Summary: rsqrtss instruction generated even if -mno-recip
specified
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

with option -Ofast -mno-recip rsqrtss instruction is still generated.

https://godbolt.org/z/hGxrG7xPh

inhibiting rsqrtss and rcpss is critical to obtain identical results when
running on INTEL and AMD platforms. Having to inhibit Ofast is clearly a larger
performance penalty.

[Bug tree-optimization/104950] New: GCC does not emit branchless code

2022-03-16 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104950

Bug ID: 104950
   Summary: GCC does not emit branchless code
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

In this example GCC fails to emit branchless code while CLANG does.
In the actual application, measurements shows slow down up to a factor 2.
I managed to force branchless (-DBL) but the code is pretty unfriendly
godbolt link (GCC, clang, GCC -DBL 

https://godbolt.org/z/KWY1rjhhY



and here inlined

include 
const float defaultBaseResponse = 0.5;
class DForest {
public:
//based on FastForest::evaluate() and BDTree::parseTree()
DForest() {
}
float evaluate(const float* features) const;

std::vector rootIndices_;
//"node" layout: cut, index, left, right
struct Node{
float v; int i,l,r;
constexpr int eval(float const * f) const {
#ifdef BL 
  auto m = f[i] > v;
  return *(() + int(m));
#else
  return f[i] > v ? r : l;
#endif
}
};
std::vector nodes_;
std::vector responses_;
std::vector baseResponses_;
};

float DForest::evaluate(const float* features) const{
float sum{defaultBaseResponse + baseResponses_[0]};
for(int index : rootIndices_){
do {
index = nodes_[index].eval(features);
} while (index>0);
sum += responses_[-index];
}
return sum;
}

[Bug tree-optimization/97707] avx512 math function invoked even if -mprefer-vector-width=256 specified

2020-11-04 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97707

--- Comment #3 from vincenzo Innocente  ---
the main point in using -mprefer-vector-width=256 is to avoid clock throttling
in "mixed" workloads.
In small benchmarks like this one avx512 is faster (even on an old Silver) even
if trigger a slower clock. (and the test should be performed with the machine
fully loaded). Still if I ask  -mprefer-vector-width=256 I would like to see no
512-wide instructions to be used.

A disturbing feature is also the difference between using int or long long as
loop index.

[Bug tree-optimization/97707] New: avx12 math function invoked even if -mprefer-vector-width=256 specified

2020-11-03 Thread vincenzo.innocente at cern dot ch via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97707

Bug ID: 97707
   Summary: avx12 math function invoked even if
-mprefer-vector-width=256 specified
   Product: gcc
   Version: 10.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

this code will invoke _ZGVeN8v_sin instead of _ZGVdN4v_sin making use of zmm
registers
#include

int main() {

  double res=0;

  for (int x=0; x<1024;x++) {
double y = x; 
res += std::sin(y);
  }


 return res > 0.5;

}

NOTE if I specify
for (long long x=0; x<1024;x++) {

it will correcty invoke _ZGVdN4v_sin (no zmm)


compiler options
-Ofast -march=skylake-avx512 -mprefer-vector-width=256

[Bug tree-optimization/92335] missed transformation to branchless

2019-11-07 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92335

--- Comment #3 from vincenzo Innocente  ---
Understood for float
it seems to me that the transformation does not occur for integer neither
(signed or unsigned)

as in
using T= unsigned int;
T bar(T const * __restrict__ x, 
 T const * __restrict__ y) {
  T ret=0;
  for (int i=0;i<1024;++i) {
auto k = y[i];
if(x[i]>1024) ret += k;
  }
return ret;
}

[Bug tree-optimization/92335] New: missed transformation to branchless

2019-11-03 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92335

Bug ID: 92335
   Summary: missed transformation to branchless
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in the following code (compiled with -O2 or -O3 and even with -march=haswell)
gcc will use a branchless construct in foo but not in bar (changing from float
to int does not modify the behavior)
(see https://godbolt.org/z/0ZWKb5 )

with -Ofast they will compile in the same vectorized branchless loop, still I
do not see why the branch shall be retained at -O2 in bar

for random "x" the branchless version is 6 times faster on any out-of-order cpu

float foo(float const * __restrict__ x, 
float const * __restrict__ y) {
  float ret=0.f;
  for (int i=0;i<1024;++i) {
auto k = y[i];
ret += x[i]>0.f ? k : 0.f;
  }
return ret;
}



float bar(float const * __restrict__ x, 
float const * __restrict__ y) {
  float ret=0.f;
  for (int i=0;i<1024;++i) {
auto k = y[i];
if(x[i]>0.f) ret += k;
  }
return ret;
}

[Bug tree-optimization/88598] simplification of multiplication by 1 or 0 fails

2018-12-27 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88598

--- Comment #3 from vincenzo Innocente  ---
what I am interested in is NOT a constant array, more a small-size
"sparse"-matrix that I can build explicitly at run time from other sources.
I have examples using Eigen if of any interest ( https://godbolt.org/z/2L9OBU )
Clang is excellent in optimizing out zeros and ones, gcc in vectorization.
I hope to get the best of the two!

[Bug tree-optimization/88598] New: simplification of multiplication by 1 or 0 fails

2018-12-26 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88598

Bug ID: 88598
   Summary: simplification of multiplication by 1 or 0 fails
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

g++ fails to optimize the code below
even with -Ofast https://godbolt.org/z/mYRgVX
independently of vectorization options https://godbolt.org/z/XMnCNz
clang optimizes (return zero for "foo" and v[1] for "bar") even for just
-ffinite-math-only -fno-signed-zeros -O2
  https://godbolt.org/z/KU5f-x

float foo(float const * __restrict__ v) {
  float j[5] = {0.,0.,0.,0.,0.};
  float ret=0.;
  for (int i=0; i<5; ++i) ret +=j[i]*v[i];
  return ret;
}


float bar(float const * __restrict__ v) {
  float j[5] = {0.,1.,0.,0.,0.};
  float ret=0.;
  for (int i=0; i<5; ++i) ret +=j[i]*v[i];
  return ret;
}

[Bug tree-optimization/86855] REGRESSON: [8.0] -Ofast optimize away mm_set_ps(0.0f,0.0f,-0.0f,0.0f);

2018-08-04 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855

--- Comment #5 from vincenzo Innocente  ---
I have indeed worked-around with

const __m128i neg = _mm_set_epi32(0,0,0x8000,0);
__m128i ret = __m128i(_mm_sub_ps(v5, v3));
return __m128(_mm_xor_si128(ret,neg));

const  __m256i neg = _mm256_set_epi64x(0,0,0x8000,0);
return __m256d(_mm256_xor_si256(__m256i(ret), neg));

etc

[Bug tree-optimization/86855] REGRESSON: [8.0] -Ofast optimize away mm_set_ps(0.0f,0.0f,-0.0f,0.0f);

2018-08-04 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855

--- Comment #3 from vincenzo Innocente  ---
looks more undefined behavior as
const  __m128 neg = _mm_set_ps(0.0f,0.0f,-0.0f,-0.0f);
 return _mm_xor_ps(_mm_sub_ps(v5, v3), neg);
with -O3 compiles in
xorps .LC0(%rip), %xmm0
  ret
.LC0:
  .long 2147483648
  .long 2147483648
  .long 0
  .long 0
while -Ofast in
xorps .LC0(%rip), %xmm0
  ret
.LC0:
  .long 2147483648
  .long 2147483648
  .long 2147483648
  .long 2147483648

needless to say that neither clang nor icc display such a behavior...

[Bug tree-optimization/86855] New: REGRESSON: [8.0] -Ofast optimize away mm_set_ps(0.0f,0.0f,-0.0f,0.0f);

2018-08-04 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86855

Bug ID: 86855
   Summary: REGRESSON: [8.0] -Ofast optimize away
mm_set_ps(0.0f,0.0f,-0.0f,0.0f);
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

this function
_m128 _mm_cross_ps(__m128 v1, __m128 v2) {
 // same order is  _MM_SHUFFLE(3,2,1,0)
 //   x2, z1,z1
 __m128 v3 = _mm_shuffle_ps(v2, v1, _MM_SHUFFLE(3, 0, 2, 2));
 //   y1, x2,y2
 __m128 v4 = _mm_shuffle_ps(v1, v2, _MM_SHUFFLE(3, 1, 0, 1));

 __m128 v5 = _mm_mul_ps(v3, v4);

 // x1, z2,z2
 v3 = _mm_shuffle_ps(v1, v2, _MM_SHUFFLE(3, 0, 2, 2));
 //y2, x1,y1
 v4 = _mm_shuffle_ps(v2, v1, _MM_SHUFFLE(3, 1, 0, 1));

 v3 = _mm_mul_ps(v3, v4);
 const  __m128 neg = _mm_set_ps(0.0f,0.0f,-0.0f,0.0f);
 return _mm_xor_ps(_mm_sub_ps(v5, v3), neg);
   }

compiled more or less in
mm_cross_ps(float __vector(4), float __vector(4)):
  movaps %xmm1, %xmm2
  movaps %xmm0, %xmm4
  movaps %xmm0, %xmm3
  shufps $202, %xmm0, %xmm2
  shufps $209, %xmm1, %xmm4
  shufps $202, %xmm1, %xmm3
  shufps $209, %xmm0, %xmm1
  mulps %xmm4, %xmm2
  mulps %xmm3, %xmm1
  movaps %xmm2, %xmm0
  subps %xmm1, %xmm0
  xorps .LC0(%rip), %xmm0
  ret
.LC0:
  .long 0
  .long 2147483648
  .long 0
  .long 0

according to godbolt since 8.1 the xor is optimized away with -Ofast as
mm_cross_ps(float __vector(4), float __vector(4)):
  movaps %xmm1, %xmm2
  movaps %xmm0, %xmm4
  movaps %xmm0, %xmm3
  shufps $209, %xmm1, %xmm4
  shufps $202, %xmm0, %xmm2
  mulps %xmm4, %xmm2
  shufps $202, %xmm1, %xmm3
  shufps $209, %xmm0, %xmm1
  mulps %xmm3, %xmm1
  movaps %xmm2, %xmm0
  subps %xmm1, %xmm0
  ret

is this intended?

[Bug tree-optimization/83857] [8 Regression] internal compiler error: in exact_div, at poly-int.h:2139

2018-01-15 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83857

--- Comment #2 from vincenzo Innocente  ---
(In reply to Richard Biener from comment #1)
> I've seen a similar bug so maybe fixed already.
if the similar bug is #83753 it is looks "fixed" in the version I tested
(at least /gcc/testsuite/gcc.dg/torture/pr83753.c is present)

[Bug tree-optimization/83857] New: [ICE] internal compiler error: in exact_div, at poly-int.h:2139

2018-01-15 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83857

Bug ID: 83857
   Summary: [ICE] internal compiler error: in exact_div, at
poly-int.h:2139
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

Created attachment 43133
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43133=edit
directory with all files needed to reproduce problem (no attempt to reduce to
minimum)

c++ -v
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/8.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib : (reconfigured)
../gcc-trunk//configure --prefix=/afs/cern.ch/user/i/innocent/w5
-enable-languages=c,c++,lto,fortran --enable-lto -enable-libitm
-disable-multilib : (reconfigured) ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 8.0.1 20180115 (experimental) [trunk revision 256692] (GCC) 


[innocent@vinavx3 innocent]$ tar -xzf fastSinCos.tgz 
[innocent@vinavx3 innocent]$ cd fastSinCos
[innocent@vinavx3 fastSinCos]$ c++ -Ofast -fopt-info-vec -Wall testSinCos.cpp
testSinCos.cpp:136:21: note: loop vectorized
during GIMPLE pass: vect
testSinCos.cpp: In function 'int main()':
testSinCos.cpp:22:5: internal compiler error: in exact_div, at poly-int.h:2139
 int main() {
 ^~~~
0x7853bf poly_int<1u, poly_result::is_poly>::type,
poly_coeff_pair_traits::is_poly>::type>::result_kind>::type>
exact_div<1u, unsigned long, unsigned long>(poly_int_pod<1u, unsigned long>
const&, unsigned long)
../../gcc-trunk/gcc/poly-int.h:2139
0x7853bf poly_int<1u, poly_result::result_kind>::type>
exact_div<1u, unsigned long, unsigned long>(poly_int_pod<1u, unsigned long>
const&, poly_int_pod<1u, unsigned long> const&)
../../gcc-trunk/gcc/poly-int.h:2152
0x7853bf vect_get_num_vectors
../../gcc-trunk/gcc/tree-vectorizer.h:1307
0x7853bf vect_get_num_copies
../../gcc-trunk/gcc/tree-vectorizer.h:1318
0x7853bf vectorizable_live_operation(gimple*, gimple_stmt_iterator*,
_slp_tree*, int, gimple**)
../../gcc-trunk/gcc/tree-vect-loop.c:8132
0x1102053 vect_analyze_loop_operations
../../gcc-trunk/gcc/tree-vect-loop.c:1855
0x1102053 vect_analyze_loop_2
../../gcc-trunk/gcc/tree-vect-loop.c:2254
0x1102053 vect_analyze_loop(loop*, _loop_vec_info*)
../../gcc-trunk/gcc/tree-vect-loop.c:2546
0x111b0ad vectorize_loops()
../../gcc-trunk/gcc/tree-vectorizer.c:664
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

btw 
ls /data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c
/data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c
c++ -Ofast -c
/data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c 
-fopt-info-vec 
/data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c:13:14:
note: loop vectorized
/data/data/vin/build/gcc-trunk//gcc/testsuite/gcc.dg/torture/pr83753.c:19:1:
note: basic block vectorized

[Bug target/80566] New: no use of avx vmovups on ymm registry in set and copy

2017-04-29 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80566

Bug ID: 80566
   Summary: no use of avx vmovups on ymm registry in set and copy
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in this example
#include 
int * foo() {
  int * p = new int[16];
  memset(p,0,16*sizeof(int));
  return p;
}
int * foo(int * q) {
  int * p = new int[16];
  memcpy(q,p,16*sizeof(int));
  return p;
}

gcc does not make use of vmovups on ymm registry 
( c++ -O3 -Wall -march=haswell -S)
while (according to gcc.godbolt.org) clang 4.0 does
https://godbolt.org/g/qnX975

[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550

2017-04-07 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390

--- Comment #19 from vincenzo Innocente  ---
Could you please have a look also to c++ and lto: this is what I get on my
skylake:
for c++ or lto -fno-split-paths pessimizes
[innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm ;
./a.out | grep LU
LU  Mflops:  5920.14(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm
-fno-split-paths ; ./a.out | grep LU
LU  Mflops:  6136.33(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -flto ;
./a.out | grep LU
LU  Mflops:  5809.93(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ gcc -march=native -Wall -Ofast *.c -lm -flto
-fno-split-paths ; ./a.out | grep LU
LU  Mflops:  5630.24(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm ;
./a.out | grep LU
LU  Mflops:  6001.47(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm
-fno-split-paths ; ./a.out | grep LU
LU  Mflops:  5920.14(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -flto;
./a.out | grep LU
LU  Mflops:  5434.16(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ c++ -march=native -Wall -Ofast *.c -lm -flto
-fno-split-paths ; ./a.out | grep LU
LU  Mflops:  5434.16(M=100, N=100)

[Bug tree-optimization/79390] [7 Regression] 10% performance drop in SciMark2 LU after r242550

2017-04-07 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79390

--- Comment #17 from vincenzo Innocente  ---
[innocent@vinavx3 innocent]$ mkdir scimark2TMP
[innocent@vinavx3 innocent]$ cd scimark2TMP
[innocent@vinavx3 scimark2TMP]$ wget
http://math.nist.gov/scimark2/scimark2_1c.zip .
.
gcc version 7.0.1 20170407 (experimental) [trunk revision 246752] (GCC) 
[innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm
[innocent@vinavx3 scimark2TMP]$ ./a.out 
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2783.60
FFT Mflops:  2325.65(N=1024)
SOR Mflops:  2260.36(100 x 100)
MonteCarlo: Mflops:   829.14
Sparse matmult  Mflops:  2582.70(N=1000, nz=5000)
LU  Mflops:  5920.14(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm
-fno-split-paths 
[innocent@vinavx3 scimark2TMP]$ ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2825.86
FFT Mflops:  2333.43(N=1024)
SOR Mflops:  2260.36(100 x 100)
MonteCarlo: Mflops:   829.14
Sparse matmult  Mflops:  2570.04(N=1000, nz=5000)
LU  Mflops:  6136.33(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm -fsplit-paths
[innocent@vinavx3 scimark2TMP]$ ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2787.46
FFT Mflops:  2325.65(N=1024)
SOR Mflops:  2260.36(100 x 100)
MonteCarlo: Mflops:   832.36
Sparse matmult  Mflops:  2582.70(N=1000, nz=5000)
LU  Mflops:  5936.23(M=100, N=100)
[innocent@vinavx3 scimark2TMP]$ pushd ~/code/s7/C
CMSSW_8_0_22/ CMSSW_9_1_0_pre2/ 
[innocent@vinavx3 scimark2TMP]$ pushd ~/code/s7/CMSSW_9_1_0_pre2/
~/code/s7/CMSSW_9_1_0_pre2 /tmp/innocent/scimark2TMP 
[innocent@vinavx3 CMSSW_9_1_0_pre2]$ cmsenv
[innocent@vinavx3 CMSSW_9_1_0_pre2]$ popd
/tmp/innocent/scimark2TMP 
[innocent@vinavx3 scimark2TMP]$ gcc -v
gcc version 6.3.0 (GCC) 
[innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm 
[innocent@vinavx3 scimark2TMP]$ ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2820.21
FFT Mflops:  2325.65(N=1024)
SOR Mflops:  2260.36(100 x 100)
MonteCarlo: Mflops:   810.37
Sparse matmult  Mflops:  2427.26(N=1000, nz=5000)
LU  Mflops:  6277.39(M=100, N=100)

[Bug target/80313] New: -march=znver1 produce worse code than -march=haswell

2017-04-04 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80313

Bug ID: 80313
   Summary: -march=znver1 produce worse code than -march=haswell
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

Created attachment 41125
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41125=edit
sef contained scimark2 MC benchmark

just got hold of a AMD Ryzen 7 1800X Eight-Core Processor and was surprised by
the results running with -march=native
the point is that the results can be reproduced on a haswell or broadwell as
well!

I used full scimark2,

just the MC benchmark shows at least one problem

this is on intel
[innocent@vinavx3 fullMC]$ gcc -march=znver1 -O3 fullMC.c -g ; time ./a.out
1.245u 0.000s 0:01.24 100.0%0+0k 0+0io 0pf+0w
[innocent@vinavx3 fullMC]$ gcc -O3 fullMC.c -g ; time ./a.out
0.327u 0.000s 0:00.32 100.0%0+0k 0+0io 0pf+0w
[innocent@vinavx3 fullMC]$ gcc -march=broadwell -O3 fullMC.c -g ; time ./a.out
0.308u 0.000s 0:00.30 100.0%0+0k 0+0io 0pf+0w

this is on ryzen
[innocent@vinzen0 fullMC]$ gcc -march=znver1 -O3 fullMC.c -g ; time ./a.out
1.354u 0.001s 0:01.35 100.0%0+0k 0+0io 0pf+0w
[innocent@vinzen0 fullMC]$ gcc -O3 fullMC.c -g ; time ./a.out
0.315u 0.000s 0:00.31 100.0%0+0k 0+0io 0pf+0w
[innocent@vinzen0 fullMC]$ gcc -march=broadwell -O3 fullMC.c -g ; time ./a.out
0.313u 0.001s 0:00.31 100.0%0+0k 0+0io 0pf+0w

[Bug tree-optimization/80248] sparse access to Array of structures does not vectorize using gathers

2017-03-31 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80248

--- Comment #2 from vincenzo Innocente  ---
side note: the difference is timing between "aos2" and "soa" seems to be fully
accounted by the integer multiplication "3*k[i]".

[Bug target/80232] Ofast pessimizes Sparse matmult in scimark2 benchmark on avx platforms

2017-03-30 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232

--- Comment #5 from vincenzo Innocente  ---
I confirm that gather is almost twice as fast on 
Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
w/r/t
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
(used a benchmark version of PR80248 example)
so on skylake, knl, (and hopefully on skylake-avx512) is profitable,
on Haswell and broardwell is not...

[Bug tree-optimization/80248] New: sparse access to Array of structures does not vectorize

2017-03-29 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80248

Bug ID: 80248
   Summary: sparse access to Array of structures does not
vectorize
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in the following example "aos" does not vectorize  while the equivalent aos2
does vectorize using vgatherdps instruction

On a slight different matter:
"soa" vectorizes and produces code that is apparently 20% faster than "aos2":
I may open a different PR with a benchmark attached...


cat simpleGather.cc
struct float3 {
  float x;
  float y;
  float z;
};

#define N 1024
float fx[N], g[N];
float fy[N];
float fz[N]; 
int k[N];

float3 f3[N];


void
aos (void)
{
  int i;
  for (i = 0; i < N; i++)
g[i] = f3[k[i]].x+f3[k[i]].y+f3[k[i]].z;
}


// use gather
void
aos2 (void)
{
  float * ff = &(f3[0].x);
  int i;
  for (i = 0; i < N; i++)
g[i] = ff[3*k[i]]+ff[3*k[i]+1]+ff[3*k[i]+2];
}


// use gather
void
soa (void)
{
  int i;
  for (i = 0; i < N; i++)
g[i] = fx[k[i]]+fy[k[i]]+fz[k[i]];
}

[innocent@vinavx3 vectorize]$ c++ -Ofast -Wall -march=haswell -S
simpleGather.cc -fopt-info-vec
simpleGather.cc:31:17: note: loop vectorized
simpleGather.cc:41:17: note: loop vectorized
[innocent@vinavx3 vectorize]$ c++ -v
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246485] (GCC)

[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance

2017-03-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #10 from vincenzo Innocente  ---
added a self contained "benchmark"

on my machine
[innocent@vinavx3 ctest]$ c++ -Ofast -Wall SparseOnly.c -march=native ; time
./a.out
0.496u 0.000s 0:00.49 100.0%0+0k 0+0io 0pf+0w
[innocent@vinavx3 ctest]$ c++ -O2 -Wall SparseOnly.c -march=native ; time
./a.out
0.411u 0.000s 0:00.41 100.0%0+0k 0+0io 0pf+0w
[innocent@vinavx3 ctest]$ c++ -O3 -Wall SparseOnly.c -march=native ; time
./a.out
0.413u 0.000s 0:00.41 100.0%0+0k 0+0io 0pf+0w

[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance

2017-03-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #9 from vincenzo Innocente  ---
Created attachment 41070
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41070=edit
self contained benchmark of scimark2 SparseMat must

content is not randomized
param must be modified by hand in the main

[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance

2017-03-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #8 from vincenzo Innocente  ---
My understanding of the gather latency is that it essentially corresponds to a
load per cacheline: fast if all items are closeby, slower than scalar loads if
items are all in different cachelines. Not sure how this can be turned in a
"cost model"

[Bug tree-optimization/80232] New: Ofast pessimizes Sparse matmult in scimark2 benchmark on avx platforms

2017-03-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80232

Bug ID: 80232
   Summary: Ofast pessimizes Sparse matmult in scimark2 benchmark
on avx platforms
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

on my machine
after the usual
mkdir scimark2TMP
cd scimark2TMP
wget http://math.nist.gov/scimark2/scimark2_1c.zip .
unzip scimark2_1c.zip
gcc -v

I get 
Using built-in specs.
COLLECT_GCC=c++
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246485] (GCC) 

[innocent@vinavx3 scimark2TMP]$ gcc -O2 -march=haswell *.c -lm
[innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  3271.69(N=1000, nz=5000)
[innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  2946.76(N=10, nz=100)
[innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=nehalem *.c -lm
[innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  3281.93(N=1000, nz=5000)
[innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  2859.34(N=10, nz=100)
[innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=corei7-avx *.c -lm
[innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  2987.40(N=1000, nz=5000)
[innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  2869.35(N=10, nz=100)
[innocent@vinavx3 scimark2TMP]$ gcc -Ofast -march=haswell *.c -lm
[innocent@vinavx3 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  2579.52(N=1000, nz=5000)
[innocent@vinavx3 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:  2381.40(N=10, nz=100)

so O2 and sse4.2 are the fastest, avx is already slower, avx2 is dramatically
slower
par of the difference can be due to gather operation as in #57796: not sure the
difference w/r/t O2


interesting to note that on KNL it makes almost not difference (not sure if
this is positive or negative...) with a hint of speedup for the large
problem...

[innocent@vinknl0 scimark2TMP]$ gcc -Ofast -march=knl *.c -lm
[innocent@vinknl0 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult"
./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   348.13(N=1000, nz=5000)
[innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   358.67(N=10, nz=100)
[innocent@vinknl0 scimark2TMP]$ gcc -O2 -march=knl *.c -lm
[innocent@vinknl0 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   329.33(N=1000, nz=5000)
[innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   321.51(N=10, nz=100)
[innocent@vinknl0 scimark2TMP]$  gcc -Ofast -march=corei7-avx *.c -lm
[innocent@vinknl0 scimark2TMP]$ ./a.out 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   343.12(N=1000, nz=5000)
[innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   323.03(N=10, nz=100)
[innocent@vinknl0 scimark2TMP]$ gcc -Ofast -march=nehalem *.c -lm
 ./a.out 5 | grep "Sparse matmult"
[innocent@vinknl0 scimark2TMP]$  ./a.out 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   343.57(N=1000, nz=5000)
[innocent@vinknl0 scimark2TMP]$ ./a.out -large 5 | grep "Sparse matmult"
Sparse matmult  Mflops:   321.00(N=10, nz=100)

[Bug rtl-optimization/80197] New: pgo dramatically pessimizes scimark2 MonteCarlo benchmark

2017-03-26 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80197

Bug ID: 80197
   Summary: pgo dramatically pessimizes scimark2 MonteCarlo
benchmark
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

Created attachment 41053
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41053=edit
self contained benchmark of scimark2 MC

while chasing the regression I then found identified and solved in #79389
I discovered that pgo manages to do much worse than the regression above.
The symptom is the same: a huge increase in branch-miss.
This is not a regression: it is the same at least since gcc5.3
Attached a self contained single file, copy of scimark2 MC, and a couple of
scripts to compile and run it

just
tar -xzf fullMC.tgz
cd fullMC
# standard compilation -O2 -O3
./runit 
# same with pgo passes
./dopgo

or just do
[innocent@vinavx3 fullMC]$ rm -rf pgo/* ; c++ -O3 fullMC.c -g
-fprofile-generate=pgo ; time ./a.out
1.848u 0.000s 0:01.85 99.4% 0+0k 0+8io 0pf+0w
[innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g -fprofile-use=./pgo ; time
./a.out
0.967u 0.001s 0:00.96 100.0%0+0k 0+0io 0pf+0w
[innocent@vinavx3 fullMC]$ c++ -O3 fullMC.c -g; time ./a.out
0.328u 0.000s 0:00.32 100.0%0+0k 0+0io 0pf+0w


for reference:
cat dopgo
cat /proc/cpuinfo | grep name | head -n 1
gcc -v
rm -rf pgo/*;gcc -O2 fullMC.c -g -fprofile-generate=pgo; ./a.out
gcc -O2 fullMC.c -g -fprofile-use=pgo; ./a.out
perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses
./a.out
rm -rf pgo/*;gcc -O3 fullMC.c -g -fprofile-generate=pgo; ./a.out
gcc -O3 fullMC.c -g -fprofile-use=pgo; ./a.out
perf stat -e task-clock -e cycles -e instructions -e branches -e branch-misses
./a.out


on my machine the result is
# standard compilation
[innocent@vinavx3 fullMC]$ ./runit 
model name  : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) 
gcc -O2 fullMC.c -g

real0m0.489s
user0m0.485s
sys 0m0.002s

 Performance counter stats for './a.out':

486.303424  task-clock (msec) #0.999 CPUs utilized  
1901271534  cycles#3.910 GHz
6403589598  instructions  #3.37  insn per cycle 
 700683389  branches  # 1440.836 M/sec  
 13582  branch-misses #0.00% of all branches

   0.486571089 seconds time elapsed

gcc -O3 fullMC.c -g

real0m0.330s
user0m0.330s
sys 0m0.000s

 Performance counter stats for './a.out':

327.385696  task-clock (msec) #0.999 CPUs utilized  
1279958668  cycles#3.910 GHz
5009002909  instructions  #3.91  insn per cycle 
 306481761  branches  #  936.149 M/sec  
 10805  branch-misses #0.00% of all branches

   0.327637485 seconds time elapsed


// pro generation and use (perf after use...)
[innocent@vinavx3 fullMC]$ ./dopgo 
model name  : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/afs/cern.ch/work/i/innocent/public/w5/bin/../libexec/gcc/x86_64-pc-linux-gnu/7.0.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-trunk//configure
--prefix=/afs/cern.ch/user/i/innocent/w5 -enable-languages=c,c++,lto,fortran
--enable-lto -enable-libitm -disable-multilib
Thread model: posix
gcc version 7.0.1 20170326 (experimental) [trunk revision 246482] (GCC) 

 Performance counter stats for './a.out':

964.399833  task-clock (msec) #1.000 CPUs utilized  
3770455888  cycles#3.910 GHz
5007987488  instructions  #1.33  insn per cycle 
 816525627  branches  #  846.667 M/sec  
  88982233  branch-misses #   10.90% of all branches

   0.964699603 seconds time elapsed


 Performance counter stats for './a.out':

964.540691  task-clock (msec) #1.000 CPUs utilized  
3771010753  cycles#3.910 GHz
5007957589  instructi

[Bug tree-optimization/79594] New: -Waggressive-loop-optimizations incomplete and/or inconsistentt

2017-02-19 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79594

Bug ID: 79594
   Summary: -Waggressive-loop-optimizations incomplete and/or
inconsistentt
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

given
 cat aggressiveLoop.cc
#include 
#include

float x[1024];
float y[1024];
float w[512];
float z[128];

float c,q;

void foo() {
  for (int i=0; i<1024; ++i) {
   auto zz=z[i];
   auto yy = y[i];
   if(x[i] > q)  yy = zz;
   y[i]=yy;
  }
}

void foo2() {
  for (int i=0; i<1024; ++i) {
   auto zz=z[i];
   auto yy = w[i];
   if(x[i] > q)  yy = zz;
   x[i]=yy;
  }
}

void foo3() {
  for (int i=0; i<1024; ++i) {
   auto zz=z[i];
   auto yy = w[i];
   if(x[i] > q)  yy = zz;
   w[i]=yy;
  }
}

gcc version 7.0.1 20170205 (experimental) [trunk revision 245191] (GCC) 
reports
c++ -Wall -O2 aggressiveLoop.cc -S
aggressiveLoop.cc: In function 'void foo()':
aggressiveLoop.cc:13:9: warning: iteration 128 invokes undefined behavior
[-Waggressive-loop-optimizations]
auto zz=z[i];
 ^~
aggressiveLoop.cc:12:18: note: within this loop
   for (int i=0; i<1024; ++i) {
 ~^
aggressiveLoop.cc: In function 'void foo2()':
aggressiveLoop.cc:22:9: warning: iteration 128 invokes undefined behavior
[-Waggressive-loop-optimizations]
auto zz=z[i];
 ^~
aggressiveLoop.cc:21:18: note: within this loop
   for (int i=0; i<1024; ++i) {
 ~^
aggressiveLoop.cc: In function 'void foo3()':
aggressiveLoop.cc:34:8: warning: iteration 512 invokes undefined behavior
[-Waggressive-loop-optimizations]
w[i]=yy;
^~~
aggressiveLoop.cc:30:18: note: within this loop
   for (int i=0; i<1024; ++i) {
 ~^

while in foo2 there is also "auto yy = w[i];"
and in foo3 both assignments
  auto zz=z[i];
  auto yy = w[i];
will "invokes undefined behavior" at iterations 128 and 512...

[Bug tree-optimization/77859] Ofast needed to vectorize loop in presence of conditional code

2016-10-05 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77859

--- Comment #2 from vincenzo Innocente  ---
Thanks for the fast response
I think I can "survive" with  -O3 -fno-trapping-math
in principle it should not change the binary compatibility of the output w/r/t
-O2
and at best of my understanding it does not inhibit raising FP exceptions
(we already force -fno-math-errno to avoid errno generation in sqrt...)

[Bug tree-optimization/77859] New: Ofast needed in presence of conditional code

2016-10-05 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77859

Bug ID: 77859
   Summary: Ofast needed in presence of conditional code
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

It looks to me that to vectorize this code "relaxed floating point math" is not
a requirement

currently gcc version 7.0.0 20161004 (experimental) [trunk revision 240754]
(GCC) 
requires Ofast and not O3 to vectorize it

#include 

float x[1024];
float y[1024];

void needOfast() {
  for (int i=0; i<1024; ++i) {
   constexpr float pi4 = M_PI/4.;
   constexpr float pi2 = M_PI/2.;
   auto g1 = x[i] > pi4;
   auto xx = x[i];
   xx = g1 ? xx-pi2 : xx;
   auto g2 = xx > pi4;
   xx = g2 ? xx-pi2 : xx;
   y[i] = xx;
  }
}

in case anyone wonder this alternative formulation needs Ofast as well
void needOfastAsWell() {
  for (int i=0; i<1024; ++i) {
   constexpr float pi  = M_PI;
   constexpr float pi4 = M_PI/4.;
   constexpr float pi34 = 3.*M_PI/4.;
   constexpr float pi2 = M_PI/2.;
   auto g1 = x[i] > pi4;
   auto xx = x[i];
   xx = g1 ? xx-pi2 : xx;
   auto g2 = x[i] > pi34;
   xx = g2 ? x[i]-pi : xx;
   y[i] = xx;
  }
}

[Bug middle-end/71666] profile-generate not documented

2016-06-26 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71666

--- Comment #2 from vincenzo Innocente  ---
ok so is just the sentence "" See Optimize Options" which needs to be
changed...

[Bug web/71666] New: profile-generate not documented

2016-06-26 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71666

Bug ID: 71666
   Summary: profile-generate not documented
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: web
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

as of today
-fprofile-generate does not seem to be documented in
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
it is quoted 4 times including a self-referencing
" See Optimize Options, for information about the -fprofile-generate option"
(btw -fprofile-dir is quoted and not documented as well)

[Bug gcov-profile/70993] New: ICE with gcov and lto

2016-05-07 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70993

Bug ID: 70993
   Summary: ICE with gcov and lto
   Product: gcc
   Version: 7.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: gcov-profile
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

with gcc version 7.0.0 20160506 (experimental) [trunk revision 235977] (GCC) 

cat main.cpp 
int main() { return 0;}
c++ -O2 main.cpp 
perf record -e
cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=49/ -o
perf.data ./a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (1 samples) ]

create_gcov --binary=./a.out --profile=perf.data --gcov=fbdata.afdo
-gcov_version 1

c++ -O2 main.cpp -fauto-profile -flto
lto1: internal compiler error: in compute_working_sets, at gcov-io.c:1006
0x6fee03 compute_working_sets(gcov_ctr_summary const*, gcov_working_set_info*)
../../gcc-trunk/gcc/gcov-io.c:1006
0xa1a72d get_working_sets()
../../gcc-trunk/gcc/profile.c:226
0x97cd1a input_symtab()
../../gcc-trunk/gcc/lto-cgraph.c:1869
0x6634a7 read_cgraph_and_symbols
../../gcc-trunk/gcc/lto/lto.c:2856
0x6634a7 lto_main()
../../gcc-trunk/gcc/lto/lto.c:3305
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <http://gcc.gnu.org/bugs.html> for instructions.
lto-wrapper: fatal error: c++ returned 1 exit status
compilation terminated.
/usr/bin/ld: lto-wrapper failed
collect2: error: ld returned 1 exit status

[Bug c++/69564] [5/6 Regression] lto and/or C++ make scimark2 LU slower

2016-03-25 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564

--- Comment #19 from vincenzo Innocente  ---
patch applied to 
gcc version 6.0.0 20160324 (experimental) [trunk revision 234461] (GCC) 
I confirm the improvement in timing for c++ and lto
timing difference between gcc and c++ seems to be inside "errors"
I am satisfied.

Thanks Patrick!

(btw I suppose no hope for a back port to 5.4?)

[Bug c++/69564] lto and/or C++ make scimark2 LU slower

2016-02-01 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564

--- Comment #5 from vincenzo Innocente  ---
it is a regression 
gcc version 4.9.3 (GCC) 
c++ -Ofast *.c; ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
gcc -Ofast *.c; ./a.out
c++ -v
Composite Score: 2449.06
FFT Mflops:  2046.03(N=1024)
SOR Mflops:  1654.04(100 x 100)
MonteCarlo: Mflops:   813.44
Sparse matmult  Mflops:  2962.08(N=1000, nz=5000)
LU  Mflops:  4769.72(M=100, N=100)
---
gcc -Ofast *.c -lm; ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2475.22
FFT Mflops:  2064.19(N=1024)
SOR Mflops:  1633.01(100 x 100)
MonteCarlo: Mflops:   810.37
Sparse matmult  Mflops:  2970.47(N=1000, nz=5000)
LU  Mflops:  4898.06(M=100, N=100)

[Bug c++/69564] lto and/or C++ make scimark2 LU slower

2016-02-01 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564

--- Comment #3 from vincenzo Innocente  ---
> Any reason you are using the c++ driver here?
Because I am interested in C++ performance
never imagined that the c++ front-end could make a difference on such a code...
>From my point of view it is even a more severe regression than just "lto"

[Bug lto/69564] New: lto makes scimark2 LU slower

2016-01-30 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69564

Bug ID: 69564
   Summary: lto makes scimark2 LU slower
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: lto
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

mkdir scimark2; cd scimark2
wget http://math.nist.gov/scimark2/scimark2_1c.zip
unzip scimark2_1c.zip
c++ -Ofast *.c; ./a.out
c++ -Ofast *.c -flto; ./a.out


with gcc 4.9.3
gcc version 4.9.3 (GCC) 

c++ -Ofast *.c; ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2462.90
FFT Mflops:  2070.32(N=1024)
SOR Mflops:  1661.17(100 x 100)
MonteCarlo: Mflops:   813.44
Sparse matmult  Mflops:  2978.91(N=1000, nz=5000)
LU  Mflops:  4790.64(M=100, N=100)
[innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2582.94
FFT Mflops:  2064.19(N=1024)
SOR Mflops:  1654.04(100 x 100)
MonteCarlo: Mflops:  1426.90
Sparse matmult  Mflops:  2978.91(N=1000, nz=5000)
LU  Mflops:  4790.64(M=100, N=100)


with latest build
gcc version 6.0.0 20160129 (experimental) (GCC) 

[innocent@vinavx3 scimark2]$ c++ -Ofast *.c; ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2377.18
FFT Mflops:  1970.89(N=1024)
SOR Mflops:  1654.04(100 x 100)
MonteCarlo: Mflops:   810.37
Sparse matmult  Mflops:  3328.81(N=1000, nz=5000)
LU  Mflops:  4121.76(M=100, N=100)
[innocent@vinavx3 scimark2]$ c++ -Ofast *.c -flto; ./a.out
**  **
** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
** for details. (Results can be submitted to p...@nist.gov) **
**  **
Using   2.00 seconds min time per kenel.
Composite Score: 2136.23
FFT Mflops:  2076.48(N=1024)
SOR Mflops:  1654.04(100 x 100)
MonteCarlo: Mflops:  1533.92
Sparse matmult  Mflops:  3266.59(N=1000, nz=5000)
LU  Mflops:  2150.13(M=100, N=100)

[Bug c++/68180] New: [ICE] at cp/constexpr.c:2768 in initializing __vector in a loop

2015-11-02 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68180

Bug ID: 68180
   Summary: [ICE]  at cp/constexpr.c:2768 in initializing __vector
in a loop
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;
constexpr float32x4_t fill(float x) {
  float32x4_t v{0};
  constexpr auto vs = sizeof(v)/sizeof(v[0]);
  for (auto i=0U; i<vs; ++i) v[i]=i;
  return v+x;
}

float32x4_t foo(float32x4_t x) {
  constexpr float32x4_t v = fill(1.f);
  return x+v;
}

gcc version 6.0.0 20151028 (experimental) [trunk revision 229474] (GCC)
ICE in
c++ -O2 avxconst.cc -std=c++17 -S
avxconst.cc: In function ‘float32x4_t foo(float32x4_t)’:
avxconst.cc:10:33:   in constexpr expansion of ‘fill(1.0e+0f)’
avxconst.cc:10:37: internal compiler error: tree check: expected constructor,
have vector_cst in cxx_eval_store_expression, at cp/constexpr.c:2768
   constexpr float32x4_t v = fill(1.f);
 ^

avxconst.cc:10:37: internal compiler error: Abort trap: 6
c++: internal compiler error: Abort trap: 6 (program cc1plus)

[Bug c++/68125] New: std::sqrt prevent use of associative math

2015-10-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125

Bug ID: 68125
   Summary: std::sqrt prevent use of associative math
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

with -Ofast

the code generated differs
float rsqrt1(float a, float x, float y) {
   return a/std::sqrt(x)/std::sqrt(y);
}

float rsqrt2(float a, float x, float y) {
   return a/sqrtf(x)/sqrtf(y);
}

rsqrt1(float, float, float):
sqrtss  %xmm2, %xmm2
sqrtss  %xmm1, %xmm1
mulss   %xmm2, %xmm1
divss   %xmm1, %xmm0
ret
rsqrt2(float, float, float):
mulss   %xmm1, %xmm2
rsqrtss %xmm2, %xmm1
mulss   %xmm1, %xmm2
mulss   %xmm1, %xmm2
mulss   .LC9(%rip), %xmm1
addss   .LC8(%rip), %xmm2
mulss   %xmm1, %xmm2
mulss   %xmm2, %xmm0
ret


[Bug c++/68125] std::sqrt prevent use of associative math

2015-10-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125

--- Comment #2 from vincenzo Innocente  ---
Thanks Marc for the fast check
I am still with
gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC) 
will update and verify


[Bug c++/68125] std::sqrt prevent use of associative math

2015-10-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68125

vincenzo Innocente  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from vincenzo Innocente  ---
confirmed fixed in
gcc version 6.0.0 20151028 (experimental) [trunk revision 229474] (GCC) 
still generated code is NOT identical
__Z6rsqrt1fff:
LFB230:
mulss   %xmm1, %xmm2
rsqrtss %xmm2, %xmm3
mulss   %xmm3, %xmm2
movaps  %xmm2, %xmm1
mulss   %xmm3, %xmm1
addss   LC0(%rip), %xmm1
mulss   LC1(%rip), %xmm3
mulss   %xmm3, %xmm1
mulss   %xmm1, %xmm0
ret
LFE230:
.align 4,0x90
.globl __Z6rsqrt2fff
__Z6rsqrt2fff:
LFB228:
mulss   %xmm2, %xmm1
rsqrtss %xmm1, %xmm2
mulss   %xmm2, %xmm1
mulss   %xmm2, %xmm1
addss   LC0(%rip), %xmm1
mulss   LC1(%rip), %xmm2
mulss   %xmm2, %xmm1
mulss   %xmm1, %xmm0
ret
LF


[Bug libgomp/67406] OMP SIMD cloning does not generate fma instruction for AVX2 target

2015-09-06 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406

--- Comment #5 from vincenzo Innocente  ---
does not work...

pragma omp declare simd notinbranch
float __attribute__ ((__target__ ("default")))
fma(float x,float y, float z);
#pragma omp declare simd notinbranch
float __attribute__ ((__target__ ("arch=haswell")))
fma(float x,float y, float z);
void foo() {
  #pragma omp simd
  for (int i=0; i<1024; ++i)
   v0[i] = fma(v1[i],v2[i],v3[i]);
}


generates
.L11:
vmovss  v3(%rbx), %xmm2
addq$4, %rbx
vmovss  v2-4(%rbx), %xmm1
vmovss  v1-4(%rbx), %xmm0
call_Z15_Z3fmafff.ifuncfff
vmovss  %xmm0, v0-4(%rbx)
cmpq$4096, %rbx
jne .L11

dispatching, no vectorization...


[Bug libgomp/67406] OMP SIMD cloning does not generate fma instruction for AVX2 target

2015-09-06 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406

--- Comment #4 from vincenzo Innocente  ---
#pragma omp declare simd notinbranch
float __attribute__ ((__target__ ("default")))
fma(float x,float y, float z) {
   return x+y*z;
}
#pragma omp declare simd notinbranch
float __attribute__ ((__target__ ("arch=haswell")))
fma(float x,float y, float z) {
   return x+y*z;
}
#pragma omp declare simd notinbranch
float __attribute__ ((__target__ ("arch=bdver1")))
fma(float x,float y, float z) {
   return x+y*z;
}

seems to generate a real fat library
c++ -Ofast -fopenmp -S simdCloning.cc; grep fmafff simdCloning.s
.globl __Z3fmafff
__Z3fmafff:
.globl __Z3fmafff.arch_haswell
__Z3fmafff.arch_haswell:
.globl __Z3fmafff.arch_bdver1
__Z3fmafff.arch_bdver1:
.globl __ZGVbN4vvv__Z3fmafff.arch_bdver1
__ZGVbN4vvv__Z3fmafff.arch_bdver1:
.globl __ZGVcN8vvv__Z3fmafff.arch_bdver1
__ZGVcN8vvv__Z3fmafff.arch_bdver1:
.globl __ZGVdN8vvv__Z3fmafff.arch_bdver1
__ZGVdN8vvv__Z3fmafff.arch_bdver1:
.globl __ZGVbN4vvv__Z3fmafff.arch_haswell
__ZGVbN4vvv__Z3fmafff.arch_haswell:
.globl __ZGVcN8vvv__Z3fmafff.arch_haswell
__ZGVcN8vvv__Z3fmafff.arch_haswell:
.globl __ZGVdN8vvv__Z3fmafff.arch_haswell
__ZGVdN8vvv__Z3fmafff.arch_haswell:
.globl __ZGVbN4vvv__Z3fmafff
__ZGVbN4vvv__Z3fmafff:
.globl __ZGVcN8vvv__Z3fmafff
__ZGVcN8vvv__Z3fmafff:
.globl __ZGVdN8vvv__Z3fmafff
__ZGVdN8vvv__Z3fmafff:

have now to test that it uses the correct one!


[Bug libgomp/67406] New: OMP SIMD cloning does not generate fma instruction for AVX2 target

2015-08-31 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406

Bug ID: 67406
   Summary: OMP SIMD cloning does not generate fma instruction for
AVX2 target
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

given
at simdCloning.cc 
#pragma omp declare simd notinbranch
float fma(float x,float y, float z) {
   return x+y*z;
}

compiled with
c++ -S -fopenmp -Ofast -Wall simdCloning.cc; cat simdCloning.s
will generate the same code for AVX and AVX2 clones
__ZGVdN8vvv__Z3fmafff:
LFB3:
leaq8(%rsp), %r10
LCFI5:
andq$-32, %rsp
vmulps  %ymm2, %ymm1, %ymm1
pushq   -8(%r10)
pushq   %rbp
vaddps  %ymm0, %ymm1, %ymm0

while I would have expected
__ZGVdN8vvv__Z3fmafff:
LFB3:
leaq8(%rsp), %r10
LCFI5:
andq$-32, %rsp
vfmadd231ps %ymm2, %ymm1, %ymm0
pushq   -8(%r10)
pushq   %rbp


this last code has been obtained compiling with -mfma.
unfortunately in this case ALL clones uses avx2 instructions
(so again AVX and AVX2 clones are identical)


btw: is there any reason why the AVX512 clone is not generated?
I am using gcc version 6.0.0 20150801 (experimental) [trunk revision 226463]
(GCC)


[Bug libgomp/67406] OMP SIMD cloning does not generate fma instruction for AVX2 target

2015-08-31 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67406

--- Comment #2 from vincenzo Innocente  ---
is there any mechanism to tell gcc to generate the AVX2 clone using fma?
I understand it reduces portability still at the moment I have to support
mostly
Intel platforms.
for AMD, gcc suggests to use avx128 so it would anyhow requires a different
library to exploit fma4.


[Bug c++/67335] New: [ICE] in compiling mop sims function with unused argument

2015-08-24 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67335

Bug ID: 67335
   Summary: [ICE] in compiling mop sims function with unused
argument
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

cat ompsimd_t.cc
#pragma omp declare simd notinbranch uniform(q)
float bar(float x, float * q, int){
  return q[0]+q[1]*x;
}

c++ -fopenmp -Wall -S  ompsimd_t.cc
ompsimd_t.cc: In function ‘__vector(4) float _Z3barfPfi.simdclone.0(float,
float*, int)’:
ompsimd_t.cc:4:1: internal compiler error: Segmentation fault: 11
 }
 ^

ompsimd_t.cc:4:1: internal compiler error: Abort trap: 6
c++: internal compiler error: Abort trap: 6 (program cc1plus)

gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC)

[Bug tree-optimization/67326] New: [5.2/6.0 regression] -ftree-loop-if-convert-stores does not vectorize conditional assignment (anymore)

2015-08-23 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67326

Bug ID: 67326
   Summary: [5.2/6.0 regression] -ftree-loop-if-convert-stores
does not vectorize conditional assignment (anymore)
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch
  Target Milestone: ---

in 5.1 looks ok (according to http://gcc.godbolt.org)

cat condBug.cc 
float v0[1024];
float v1[1024];
float v2[1024];
float v3[1024];

void condAssign1() {
  for(int i=0; i1024; ++i)
v0[i] = (v2[i]v1[i]) ? v2[i]*v3[i] : v0[i];
}


void condAssign2() {
  for(int i=0; i1024; ++i)
v0[i] = (v2[i]v1[i]) ? v2[i]*v1[i] : v0[i];
}

c++ -Ofast -fopt-info-vec -ftree-loop-if-convert-stores -S condBug.cc
condBug.cc:7:3: note: loop vectorized
condBug.cc:13:3: note: loop vectorized
gcc version 4.9.3 (GCC) 

c++ -Ofast -fopt-info-vec -ftree-loop-if-convert-stores -S condBug.cc
condBug.cc:13:17: note: loop vectorized
with gcc version 6.0.0 20150801 (experimental) [trunk revision 226463] (GCC)


[Bug tree-optimization/63644] New: Kahan Summation with fast-math, pattern not always recognized

2014-10-25 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63644

Bug ID: 63644
   Summary: Kahan Summation with fast-math, pattern not always
recognized
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch

in the following example (compiled with -Ofast -std=c++11) the kahan summation
pattern is recognized in sum, not in counter
see 
http://goo.gl/aJn61B


#includecstdio

templatetypename T
struct KahanSum {
  KahanSum(T is=0) : sum(is){}
  KahanSumT operator+=(T a) { add(a); return *this;}
  void add(T a) {
float x = a - eps;
float s = sum + x;
eps = (s-sum) - x;
sum = s;

  }
  T result() const { return sum;}
  T sum;
  T eps=0;

};

float a[1204];
float sum() {
  KahanSumfloat res;
  for (int i=0; i1024; ++i) res+= a[i];
  return res.result();
}


float counter(int maxl) {
   float tenth=0.1f;
   KahanSumfloat sum = tenth; 
   int n=0;
   while(nmaxl) {
 sum += tenth;
 ++n;
 // if (n21 || n%36000==0) printf(%d %f
%a\n,n,sum.result(),sum.result());
   }
   // use eps to avoid optimization out

   float count = float(60*60*100*10);
   printf(\n\n%f %f %a\n\n,count,float(count*tenth),float(count*tenth));

   return sum.result();
}


[Bug tree-optimization/63599] New: wrong branch optimization with Ofast in a loop

2014-10-20 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63599

Bug ID: 63599
   Summary: wrong branch optimization with Ofast in a loop
   Product: gcc
   Version: 5.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch

given this code

#include x86intrin.h

typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;

inline
float32x4_t atan(float32x4_t t) {
  constexpr float PIO4F = 0.7853981633974483096f;
  float32x4_t high = t  0.4142135623730950f;
  auto z = t;
  float32x4_t ret={0.f,0.f,0.f,0.f};
// if all low no need to blend
  if ( _mm_movemask_ps(high) != 0) {
z   = ( t  0.4142135623730950f ) ? (t-1.0f)/(t+1.0f) : t;
ret = ( t  0.4142135623730950f ) ? ret+PIO4F : ret;
  }
  /* polynomial removed */
  return  ret += z;
}


float32x4_t doAtan(float32x4_t z) { return atan(z);}

float32x4_t va[1024];
float32x4_t vb[1024];

void computeV() {
  for (int i=0;i!=1024;++i)
vb[i]=atan(va[i]);
}


compiled with -Ofast
c++ -S -std=c++1y -Ofast bugmvmk.cc -march=nehalem; cat bugmvmk.s
produces the following code where the movmskps%xmm8, %edx
does not protect the code in the if block...

__Z8computeVv:
LFB2512:
movapsLC0(%rip), %xmm4
xorl%eax, %eax
movapsLC1(%rip), %xmm7
leaq_va(%rip), %rcx
movapsLC2(%rip), %xmm6
movapsLC3(%rip), %xmm5
.align 4,0x90
L10:
movaps(%rcx,%rax), %xmm2
movaps%xmm4, %xmm8
movaps%xmm2, %xmm3
cmpltps%xmm2, %xmm8
movaps%xmm2, %xmm1
addps%xmm6, %xmm3
addps%xmm7, %xmm1
movmskps%xmm8, %edx
andps%xmm5, %xmm8
rcpps%xmm3, %xmm0
mulps%xmm0, %xmm3
mulps%xmm0, %xmm3
addps%xmm0, %xmm0
subps%xmm3, %xmm0
mulps%xmm0, %xmm1
movaps%xmm2, %xmm0
cmpleps%xmm4, %xmm0
blendvps%xmm0, %xmm2, %xmm1
pxor%xmm0, %xmm0
testl%edx, %edx
jeL7
movaps%xmm8, %xmm0
L7:
testl%edx, %edx
jeL9
movaps%xmm1, %xmm2
L9:
addps%xmm0, %xmm2
leaq_vb(%rip), %rdx
movaps%xmm2, (%rdx,%rax)
addq$16, %rax
cmpq$16384, %rax
jneL10
ret

while with O2 is ok
__Z8computeVv:
LFB2512:
movapsLC0(%rip), %xmm4
xorl%eax, %eax
movapsLC1(%rip), %xmm7
leaq_va(%rip), %rsi
movapsLC2(%rip), %xmm6
leaq_vb(%rip), %rcx
movapsLC3(%rip), %xmm5
.align 4,0x90
L7:
movaps(%rsi,%rax), %xmm1
movaps%xmm4, %xmm0
pxor%xmm2, %xmm2
cmpltps%xmm1, %xmm0
movmskps%xmm0, %edx
testl%edx, %edx
jeL6
movaps%xmm1, %xmm3
movaps%xmm1, %xmm2
addps%xmm6, %xmm2
addps%xmm7, %xmm3
divps%xmm2, %xmm3
movaps%xmm0, %xmm2
andps%xmm5, %xmm2
blendvps%xmm0, %xmm3, %xmm1
L6:
addps%xmm2, %xmm1
movaps%xmm1, (%rcx,%rax)
addq$16, %rax
cmpq$16384, %rax
jneL7
ret

note that the function not in the loop (doAtan) is ok with both O2 and Ofast


[Bug target/63599] wrong branch optimization with Ofast in a loop

2014-10-20 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63599

--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
I agree that the code produces correct results. It looks to me  sub-optimal.
I understand that with Ofast the sequence below will be always executed

andps%xmm5, %xmm8
rcpps%xmm3, %xmm0
mulps%xmm0, %xmm3
mulps%xmm0, %xmm3
addps%xmm0, %xmm0
subps%xmm3, %xmm0
mulps%xmm0, %xmm1
movaps%xmm2, %xmm0
cmpleps%xmm4, %xmm0
blendvps%xmm0, %xmm2, %xmm1

while with O2 it will not.
and this generates a performance penalty for samples where the test is often
false.
( I tried to add __builtin_expect(x, false) with no effect. )


[Bug tree-optimization/56829] Feature request: generic builtin to support control flow in vectorized code (movemask, vec_any/all_*)

2014-10-05 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56829

--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
just to add the OpenCL syntax and doc
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/any.html


[Bug tree-optimization/50374] Support vectorization of min/max location pattern

2014-08-23 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50374

vincenzo Innocente vincenzo.innocente at cern dot ch changed:

   What|Removed |Added

  Known to fail||4.9.1

--- Comment #27 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
coming back to this old issue.
Any chance to see it implemented in the auto-vectorizer soon?

using extended vectors I manage to vectorize min_element as below.
In principle the auto-vectorizer should be able to do the same starting from
the loop in comment 3


typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;
typedef float __attribute__( ( vector_size( 16 ) , aligned(4) ) )
float32x4a4_t;
typedef int __attribute__( ( vector_size( 16 ) ) ) int32x4_t;


inline
float32x4_t load(float const * x) {
   return *(float32x4a4_t const *)(x);
}


int minloc(float const * x, int N) {
  float32x4_t v0;
  int32x4_t index;

  auto M = 4*(N/4);
  for (int i=M; iN; ++i) {
v0[i-M] = x[i];
index[i]=i;
  }
  for (int i=N; iM+4;++i) {
v0[i-M] = x[0];
index[i]=0;
  }
  int32x4_t j = {0,1,2,3};
  for (int i=0; iM; i+=4) {
decltype(auto) v = load(x+i);
index =  (vv0) ? j : index;
v0 = (vv0) ? v : v0;
j+=4;
  }
  auto k = 0;
  for (int i=1;i4; ++i) if (v0[i]v0[k]) k=i;
  return index[k];
}


#includeiostream
#includealgorithm
#include x86intrin.h
unsigned int taux=0;
inline unsigned long long rdtscp() {
 return __rdtscp(taux);
}

int main() {

  float x[1024];
  for (int i=0; i1024; ++i) x[i]= i%2 ? i : -i;
  for (int i = 0; i10; ++i) {
   std::random_shuffle(x,x+1024);
   long long ts = -rdtscp();
   int l1 = std::min_element(x+i,x+1024) - (x+i);
   ts +=rdtscp();
   long long tv = -rdtscp();
   int l2 = minloc(x+i,1024-i);
   tv +=rdtscp();

std::cout  min is at   l1  ' '  ts  std::endl;
std::cout  minloc   l2  ' '  tv  std::endl;
  }
  return 0;

}


which result in a pretty good speed up
c++ -std=c++1y -Ofast minloc.cc -march=nehalem
./a.out
./a.out 
min is at 959 13780
minloc 959 2380
min is at 536 13680
minloc 536 4972
min is at 513 13648
minloc 513 1848
min is at 825 13640
minloc 825 1924
min is at 885 13628
minloc 885 1644
min is at 636 11252
minloc 636 1536
min is at 982 11240
minloc 982 1416
min is at 382 11228
minloc 382 1392
min is at 271 11216
minloc 271 1340
min is at 50 11204
minloc 50 1384


[Bug web/61744] New: misleading documentation about cast of extended vectors

2014-07-08 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61744

Bug ID: 61744
   Summary: misleading documentation about cast of extended
vectors
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: web
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch

At the very bottom of https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
one reads
It is possible to cast from one vector type to another, provided they are of
the same size (in fact, you can also cast vectors to and from other datatypes
of the same size).

I find this misleading as the reader can think that the result of such a cast
is similar to a C-style cast of each element, while instead is a simple
reinterpretation of the bit content (as memcpy).

I suggest to add 
The result is a vector of the new type with the same bit-content of the
original.
One can even add
, not what expected from a C-style cast.

Of course adding a proper conversion builtin (see PR61731) would definitively
solve the issue ;-)


[Bug tree-optimization/61747] New: min,max pattern not always properly optimized (for sse4 targets)

2014-07-08 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747

Bug ID: 61747
   Summary: min,max pattern not always properly optimized (for
sse4 targets)
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch

I was expecting gcc to substitute min/max instruction for (a/b) ? a : b;
even for O2.
This is not always the case, only Ofast provides consistently optimized code
(even if sometimes with a redundant move). -ffinite-math-only makes the code
worse for vector arguments...

cat vmin.cc 
typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;

  templatetypename V1
  V1 vmax(V1 a, V1 b) {
return (ab) ? a : b;
  }
  templatetypename V1
  V1 vmin(V1 a, V1 b) {
return (ab) ? a : b;
  }


float foo(float a, float b, float c) {
  return vmin(vmax(a,b),c);
}

float32x4_t foo(float32x4_t a, float32x4_t b, float32x4_t c) {
  return vmin(vmax(a,b),c);
}

templatetypename Float
Float bart(Float a) { 
  constexpr Float zero{0.f};
  constexpr Float it = zero+4.f;
  constexpr Float zt = zero-3.f;
  return vmin(vmax(a,zt),it);
}


float bar(float a) {
   return bart(a);
}
float32x4_t bar(float32x4_t a) {
   return bart(a);
}

I see
c++ -std=c++11 -O2  -msse4.2 -s vmin.cc -S; cat vmin.s

__Z3foofff:
LFB2:
maxss%xmm1, %xmm0
minss%xmm2, %xmm0
ret

__Z3fooDv4_fS_S_:
LFB3:
maxps%xmm1, %xmm0
minps%xmm2, %xmm0
ret

__Z3barf:
LFB5:
ucomissLC3(%rip), %xmm0
jbeL12
minssLC2(%rip), %xmm0
ret
.align 4,0x90
L12:
movssLC3(%rip), %xmm0
ret

__Z3barDv4_f:
LFB6:
movapsLC5(%rip), %xmm1
movaps%xmm0, %xmm2
movaps%xmm1, %xmm0
cmpltps%xmm2, %xmm0
blendvps%xmm0, %xmm2, %xmm1
movapsLC6(%rip), %xmm2
movaps%xmm1, %xmm0
cmpltps%xmm2, %xmm0
blendvps%xmm0, %xmm1, %xmm2
movaps%xmm2, %xmm0
ret

-
c++ -std=c++11 -O2  -msse4.2 -s vmin.cc -S -ffinite-math-only; cat vmin.s
__Z3foofff:
LFB2:
maxss%xmm0, %xmm1
minss%xmm2, %xmm1
movaps%xmm1, %xmm0
ret
__Z3fooDv4_fS_S_:
LFB3:
maxps%xmm1, %xmm0
movaps%xmm0, %xmm1
movaps%xmm2, %xmm0
cmpleps%xmm1, %xmm0
blendvps%xmm0, %xmm2, %xmm1
movaps%xmm1, %xmm0
ret

__Z3barf:
LFB5:
maxssLC2(%rip), %xmm0
minssLC3(%rip), %xmm0
ret

__Z3barDv4_f:
LFB6:
movapsLC5(%rip), %xmm1
movaps%xmm0, %xmm2
movaps%xmm1, %xmm0
cmpltps%xmm2, %xmm0
blendvps%xmm0, %xmm2, %xmm1
movapsLC6(%rip), %xmm2
movaps%xmm1, %xmm0
cmpltps%xmm2, %xmm0
blendvps%xmm0, %xmm1, %xmm2
movaps%xmm2, %xmm0
ret
LFE6:

--
eventually
c++ -std=c++11 -Ofast  -msse4.2 -s vmin.cc -S; cat vmin.s

__Z3foofff:
LFB2:
maxss%xmm0, %xmm1
minss%xmm2, %xmm1
movaps%xmm1, %xmm0
ret

__Z3fooDv4_fS_S_:
LFB3:
maxps%xmm0, %xmm1
minps%xmm2, %xmm1
movaps%xmm1, %xmm0
ret

__Z3barf:
LFB5:
maxssLC2(%rip), %xmm0
minssLC3(%rip), %xmm0
ret
__Z3barDv4_f:
LFB6:
maxpsLC5(%rip), %xmm0
minpsLC6(%rip), %xmm0
ret


[Bug tree-optimization/61747] min,max pattern not always properly optimized (for sse4 targets)

2014-07-08 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747

--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
 I think you need -fno-signed-zeros for the transformation to be valid.
possible.
but then is the O2 code that is wrong?
in any case adding -fno-signed-zeros makes no difference w/r/t O2 alone


[Bug tree-optimization/61747] min,max pattern not always properly optimized (for sse4 targets)

2014-07-08 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61747

--- Comment #4 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
confirm that
-ffinite-math-only -fno-signed-zeros
is equivalent to Ofast in this case
so we conclude that the code generated at O2 is wrong and
-ffinite-math-only -fno-signed-zeros
is required to trigger min/max?


[Bug target/61731] New: Feature request: generic builtin for conversion operator among vectors

2014-07-07 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61731

Bug ID: 61731
   Summary: Feature request: generic builtin for conversion
operator among vectors
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch

gcc is lacking a mechanism to convert (C-style cast) efficiently
extended-vectors among different types.
clang has recently introduce a __builtin_convertvector
(see few lines below
http://clang.llvm.org/docs/LanguageExtensions.html#langext-builtin-shufflevector)

I would like to ask if it is possible to implement the same feature in gcc.
An agreed syntax with clang would be welcome.


[Bug tree-optimization/56829] Feature request: generic builtin to support control flow in vectorized code (movemask, vec_any/all_*)

2014-07-07 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56829

vincenzo Innocente vincenzo.innocente at cern dot ch changed:

   What|Removed |Added

Summary|Feature request: generic  |Feature request: generic
   |builtin for movemask  |builtin to support control
   ||flow in vectorized code
   ||(movemask,
   ||vec_any/all_*)

--- Comment #1 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
as gcc 4.9 is now out  I would like to come back to this request.
As more support for it
I have found this interesting talk
http://llvm.org/devmtg/2012-04-12/Slides/Ralf_Karrenberg.pdf
that from slide 17 addresses the issue of divergent control flow and its
implementation on cpu (in the contest of OpenCL, still the argument is fully
valid for other type of implementations) including a praise for a a way to
express predication in IR in slide 25.
For a general discussion and implementation see also
http://www.mcs.anl.gov/publication/introducing-control-flow-vectorized-code
and reference therein

My preference is still for a builtin that converts a mask into an integer
(movemask behavior). one can then use
_builtin_popcount, __builtin_ctz etc  to cast it in an bool.
for altivec, gcc implements vec_any_cpm and vec_all_cpm set of functions
that combine the comparison and the mask-int conversion.
This is a possible alternative syntax.

My understanding it that neon does not support any form of predication in its
instruction set.
(see
http://stackoverflow.com/questions/11870910/sse-mm-movemask-epi8-equivalent-method-for-arm-neon
for instance).
This is an even more compelling reason for the compiler to provide a generic
builtin!


[Bug target/57796] AVX2 gather vectorization: code bloat and reduction of performance

2014-06-19 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57796

--- Comment #5 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
so with latest 4.9 
gcc version 4.10.0 20140611 (experimental) [trunk revision 211467] (GCC) 
situation has not changed much (the scalar version is now faster!):
I think that the cost of gather instructions is still under-estimated


[Bug c++/61381] constexpr non captured by template lambda

2014-06-03 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381

--- Comment #2 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
I am still at trunk revision 210507
will update and test again


[Bug c++/61381] constexpr non captured by template lambda

2014-06-03 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381

vincenzo Innocente vincenzo.innocente at cern dot ch changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #3 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
confirmed it compile with
[trunk revision 211189]
a back port to 4.9.1. would be appreciated


[Bug c++/61381] New: constexpr non captured by template lambda

2014-06-01 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61381

Bug ID: 61381
   Summary: constexpr non captured by template lambda
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch

cat ceLambda.cc 
struct Bar {  constexpr Bar(float i):f(i){}; float f;};
float foo1(float x) {
   constexpr Bar z{0};

   auto f = [=](auto a, auto b) - Bar { return z;};

   return f(x,x).f;

}

float foo2(float x) {
   const Bar z{0};

   auto f = [=](auto a, auto b) - Bar { return z;};

   return f(x,x).f;

}

float foo3(float x) {
   constexpr Bar z{0};

   auto f = [=](float a, float b) - Bar { return z;};

   return f(x,x).f;

}

b-d-128-141-131-42:ctest innocent$ c++ -O2 -std=c++1y -S ceLambda.cc 
ceLambda.cc: In instantiation of ‘foo1(float)::lambda(auto:1, auto:2) [with
auto:1 = float; auto:2 = float]’:
ceLambda.cc:7:16:   required from here
ceLambda.cc:5:49: error: ‘z’ was not declared in this scope
auto f = [=](auto a, auto b) - Bar { return z;};
 ^

[Bug tree-optimization/61338] New: too many permutation in a vectorized reverse loop

2014-05-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61338

Bug ID: 61338
   Summary: too many permutation in a vectorized reverse loop
   Product: gcc
   Version: 4.9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: vincenzo.innocente at cern dot ch

in this example gcc generates 4 permutations for foo (while none is required)
On the positive side the code for bar (which is a more realistic use case)
seems optimal.

float x[1024];
float y[1024];
float z[1024];

void foo() {
  for (int i=0; i512; ++i)
x[1023-i] += y[1023-i]*z[512-i];
}


void bar() {
  for (int i=0; i512; ++i)
x[1023-i] += y[i]*z[i+512];
}

c++ -Ofast -march=haswell -S revloop.cc; cat revloop.s

__Z3foov:
LFB0:
vmovdqaLC0(%rip), %ymm2
xorl%eax, %eax
leaq4064+_x(%rip), %rdx
leaq4064+_y(%rip), %rsi
leaq2020+_z(%rip), %rcx
.align 4,0x90
L2:
vpermd(%rdx,%rax), %ymm2, %ymm0
vpermd(%rcx,%rax), %ymm2, %ymm1
vpermd(%rsi,%rax), %ymm2, %ymm3
vfmadd231ps%ymm1, %ymm3, %ymm0
vpermd%ymm0, %ymm2, %ymm0
vmovaps%ymm0, (%rdx,%rax)
subq$32, %rax
cmpq$-2048, %rax
jneL2
vzeroupper
ret
LFE0:
.section __TEXT,__text_cold,regular,pure_instructions
LCOLDE1:
.text
LHOTE1:
.section __TEXT,__text_cold,regular,pure_instructions
LCOLDB2:
.text
LHOTB2:
.align 4,0x90
.globl __Z3barv
__Z3barv:
LFB1:
vmovdqaLC0(%rip), %ymm1
leaq2048+_z(%rip), %rdx
leaq_y(%rip), %rcx
leaq4064+_x(%rip), %rax
leaq4096+_z(%rip), %rsi
.align 4,0x90
L6:
vmovaps(%rdx), %ymm2
addq$32, %rdx
vpermd(%rax), %ymm1, %ymm0
addq$32, %rcx
vfmadd231ps-32(%rcx), %ymm2, %ymm0
subq$32, %rax
vpermd%ymm0, %ymm1, %ymm0
vmovaps%ymm0, 32(%rax)
cmpq%rsi, %rdx
jneL6
vzeroupper
ret
LFE1:


[Bug tree-optimization/61338] too many permutation in a vectorized reverse loop

2014-05-28 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61338

--- Comment #1 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
if I write it reverse
void foo2() {
  for (int i=511; i=0; --i)
x[1023-i] += y[1023-i]*z[512-i];
}

its ok
__Z4foo2v:
LFB1:
leaq2048+_x(%rip), %rdx
xorl%eax, %eax
leaq4+_z(%rip), %rsi
leaq2048+_y(%rip), %rcx
.align 4,0x90
L6:
vmovaps(%rdx,%rax), %ymm1
vmovups(%rsi,%rax), %ymm0
vfmadd132ps(%rcx,%rax), %ymm1, %ymm0
vmovaps%ymm0, (%rdx,%rax)
addq$32, %rax
cmpq$2048, %rax
jneL6
vzeroupper
ret


[Bug middle-end/49363] [feature request] multiple target attribute (and runtime dispatching based on cpuid)

2014-05-26 Thread vincenzo.innocente at cern dot ch
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49363

--- Comment #23 from vincenzo Innocente vincenzo.innocente at cern dot ch ---
Which Syntax?
I want to reuse the same code for the various architecture and let gcc deal
with vectorization details.
The best I manage to do to share code is something like this

namespace {
inline
float _sum0(float const *  x,
   float const *  y, float const *  z) {
  float sum=0;
  for (int i=0; i!=1024; ++i)
sum += z[i]+x[i]*y[i];
  return sum;
}
}


float  __attribute__ ((__target__ (arch=haswell)))
sum1(float const *  x,
 float const *  y, float const *  z) {
  return _sum0(x,y,z);
}

float  __attribute__ ((__target__ (arch=nehalem)))
sum1(float const *  x,
 float const *  y, float const *  z) {
  return _sum0(x,y,z);
}

//--

this for instance does not work (produce code only for haswell)

float  __attribute__ ( (__target__(arch=nehalem), __target__(arch=haswell))
)
sum0(float const *  x,
  float const *  y, float const *  z) {
 float sum=0;
 for (int i=0; i!=1024; ++i)
   sum += z[i]+x[i]*y[i];
 return sum;
}


  1   2   3   4   5   >