[Bug c++/91235] Array size expression is implicitly casted to unsigned long type

2019-08-29 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91235

--- Comment #1 from Daniel Fruzynski  ---
I checked that trunk gcc also accepts this code, both with -std=c++11 and
-std=c++1z. Clang also compiles this without error. Could someone take a look
on this and add some comment here?

[Bug c++/91235] New: Array size expression is implicitly casted to unsigned long type

2019-07-23 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91235

Bug ID: 91235
   Summary: Array size expression is implicitly casted to unsigned
long type
   Product: gcc
   Version: 9.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
void foo(char*);

inline void bar(int n)
{
if (__builtin_constant_p(n))
{
char a[(int)(n == 2 ? -1 : 0)];
foo(a);
}
}

void baz()
{
bar(2);
}
[/code]

When this is compiled with -O3 -Wall -Wextra -std=c++11 (tested via
godbolt.org), it produces following code:

[asm]
baz():
  push rbp
  mov rbp, rsp
  mov rdi, rsp
  call foo(char*)
  leave
  ret
[/asm]

During compilation gcc reported following warning:
[out]
: In function 'void baz()':

:7:14: warning: argument to variable-length array is too large
[-Wvla-larger-than=]

7 | char a[(int)(n == 2 ? -1 : 0)];

  |  ^

:7:14: note: limit is 9223372036854775807 bytes, but argument is
18446744073709551615

Compiler returned: 0
[out]

This means that gcc saw that n is constant, and then expression specified as
array size was evaluated and implicitly casted to unsigned type.

When I removed "foo(a);" line, this warning is gone, and gcc warned about
unused variable.

When -1 is specified as array size, it correctly report error that array size
is negative. Looks that only expressions causes this issue.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #21 from Daniel Fruzynski  ---
I have increased stack size on Linux to 800MB, verified that ulimit -s reports
new value and run gcc again - it crashed again.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #20 from Daniel Fruzynski  ---
gcc 8.2.0 does not crash on this code.

I tried to use sgcheck, but without luck - it exited on some assertion failure.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #18 from Daniel Fruzynski  ---
Unfortunately default Valgrind tool (memcheck) does not look for stack issues.
It has separate tool (sgcheck) which does this. I will try to use it too and
see if it will something.

One more thing come to my mind. gcc should use stack limit of host, not of
target. I saw crashes on both Cygwin and Linux hosts, which have stack size
limits 2032 and 8192, respectively. If gcc uses default host stack size, this
would mean that bug is somewhere else.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #16 from Daniel Fruzynski  ---
I checked it using Git Bash and got 2048.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #14 from Daniel Fruzynski  ---
How to check default stack size? I found that ld has --stack option to set it,
but I cannot find a way to check default. I tried to dump default linker script
using --verbose when linking, but there was no stack size there.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #10 from Daniel Fruzynski  ---
If I recall correctly, I tried it on 8.2 or 8.3 crosscompiler too, and it
worked there. However I am not sure if I used the same command to run it. I
will check this later after I return home.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #7 from Daniel Fruzynski  ---
Preprocessed source is in 1st attachment here.

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #5 from Daniel Fruzynski  ---
Created attachment 46356
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46356=edit
Valgrind log

Here is Valgrind log. It found multiple cases when uninitialized value vas
used. However in all cases callstack for place where this uninitialized memory
come from is the same. Here is log for 1st issue:

==18990== Conditional jump or move depends on uninitialised value(s)
==18990==at 0x86DAA7: sparseset_bit_p (sparseset.h:147)
==18990==by 0x86DAA7: mark_pseudo_regno_live(int) (ira-lives.c:289)
==18990==by 0x86EBC0: process_bb_node_lives(ira_loop_tree_node*)
(ira-lives.c:1254)
==18990==by 0x8545C7: ira_traverse_loop_tree(bool, ira_loop_tree_node*,
void (*)(ira_loop_tree_node*), void (*)(ira_loop_tree_node*))
(ira-build.c:1806)
==18990==by 0x86F751: ira_create_allocno_live_ranges() (ira-lives.c:1565)
==18990==by 0x855DFC: ira_build() (ira-build.c:3422)
==18990==by 0x84D7BA: ira (ira.c:5308)
==18990==by 0x84D7BA: (anonymous namespace)::pass_ira::execute(function*)
(ira.c:5619)
==18990==by 0x913B60: execute_one_pass(opt_pass*) (passes.c:2465)
==18990==by 0x914307: execute_pass_list_1(opt_pass*) (passes.c:2554)
==18990==by 0x914319: execute_pass_list_1(opt_pass*) (passes.c:2555)
==18990==by 0x914364: execute_pass_list(function*, opt_pass*)
(passes.c:2565)
==18990==by 0x691118: cgraph_node::expand() (cgraphunit.c:2042)
==18990==by 0x692633: expand_all_functions (cgraphunit.c:2178)
==18990==by 0x692633: symbol_table::compile() (cgraphunit.c:2536)
==18990==  Uninitialised value was created by a heap allocation
==18990==at 0x4C29BC3: malloc (vg_replace_malloc.c:299)
==18990==by 0x10EF387: xmalloc (xmalloc.c:147)
==18990==by 0x9B9F34: sparseset_alloc(unsigned long) (sparseset.c:33)
==18990==by 0x86F6DF: ira_create_allocno_live_ranges() (ira-lives.c:1557)
==18990==by 0x855DFC: ira_build() (ira-build.c:3422)
==18990==by 0x84D7BA: ira (ira.c:5308)
==18990==by 0x84D7BA: (anonymous namespace)::pass_ira::execute(function*)
(ira.c:5619)
==18990==by 0x913B60: execute_one_pass(opt_pass*) (passes.c:2465)
==18990==by 0x914307: execute_pass_list_1(opt_pass*) (passes.c:2554)
==18990==by 0x914319: execute_pass_list_1(opt_pass*) (passes.c:2555)
==18990==by 0x914364: execute_pass_list(function*, opt_pass*)
(passes.c:2565)
==18990==by 0x691118: cgraph_node::expand() (cgraphunit.c:2042)
==18990==by 0x692633: expand_all_functions (cgraphunit.c:2178)
==18990==by 0x692633: symbol_table::compile() (cgraphunit.c:2536)

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #3 from Daniel Fruzynski  ---
Created attachment 46355
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46355=edit
Source code which triggers crash

I added code which causes crash when compiling. Here is command which I used on
CentOS 7:

../gcc-7.4.0-mingw64/bin/x86_64-w64-mingw32-g++ -O3 -ftree-vectorize -std=c++11
-Wall -pthread -I. -D_BSD_SOURCE -g -c RakeSearchOpenCL.cpp -o
RakeSearchOpenCL.o

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #2 from Daniel Fruzynski  ---
I was able to reproduce crash using MinGW crosscompiler build for CentOS 7,
configured in following way:

../gcc-7.4.0/configure --prefix=/root/gcc-7.4.0-mingw64
--build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --with-gnu-as
--with-gnu-ld --verbose --without-newlib --disable-multilib --disable-plugin
--with-system-zlib --disable-nls --without-included-gettext
--disable-win32-registry --enable-languages=c,c++ --enable-threads=posix
--enable-libgomp --target=x86_64-w64-mingw32
--with-sysroot=/usr/x86_64-w64-mingw32/sys-root
--with-gxx-include-dir=/usr/x86_64-w64-mingw32/sys-root/mingw/include/c++
--with-as=/usr/bin/x86_64-w64-mingw32-as
--with-ld=/usr/bin/x86_64-w64-mingw32-ld

Required prerequisites were downloaded using script from contrib dir. I also
have installed MinGW toolchain from EPEL repository, to get sysroot and other
things needed for build.

With this crosscompiler I got following crash report:

[out]
RakeSearchOpenCL.cpp: In member function 'bool RakeSearchOpenCL::init(int,
char**)':
RakeSearchOpenCL.cpp:99:1: internal compiler error: Segmentation fault
 }
 ^
0x9cd06f crash_signal
../../gcc-7.4.0/gcc/toplev.c:337
0x62ab33 lookup_page_table_entry
../../gcc-7.4.0/gcc/ggc-page.c:630
0x62ab33 ggc_set_mark(void const*)
../../gcc-7.4.0/gcc/ggc-page.c:1527
0x57bf97 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:133
0x57d337 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:523
0x59d6c3 gt_ggc_mx_cxx_binding(void*)
./gt-cp-name-lookup.h:60
0x57c1b7 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:648
0x57cf06 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:275
0x57c920 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:466
0x57c83b gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:440
0x57ce6c gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:264
0x57d337 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:523
0x57d329 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:522
0x57d329 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:522
0x57d353 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:525
0x57cc69 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:356
0x57ce96 gt_ggc_mx_lang_tree_node(void*)
./gt-cp-tree.h:267
0x7d78ce gt_ggc_mx_gimple(void*)
/root/gcc-7.4.0-mingw64-build/obj/gcc/gtype-desc.c:1316
0x7d5909 gt_ggc_mx_basic_block_def(void*)
/root/gcc-7.4.0-mingw64-build/obj/gcc/gtype-desc.c:1440
0x7d7565 gt_ggc_mx_gimple(void*)
/root/gcc-7.4.0-mingw64-build/obj/gcc/gtype-desc.c:1319
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.
[/out]

I tried to debut it using gdb, and got this callstack:

[out]
Program received signal SIGSEGV, Segmentation fault.
[Switching to process 18062]
lookup_page_table_entry (p=p@entry=0x157cf6a4) at
../../gcc-7.4.0/gcc/ggc-page.c:630
630   while (table->high_bits != high_bits)
Missing separate debuginfos, use: debuginfo-install
glibc-2.17-260.el7_6.5.x86_64
(gdb) i lo
L2 = 
high_bits = 0
base = 
L1 = 
table = 0x0
(gdb) bt
#0  lookup_page_table_entry (p=p@entry=0x157cf6a4) at
../../gcc-7.4.0/gcc/ggc-page.c:630
#1  ggc_set_mark (p=p@entry=0x157cf6a4) at ../../gcc-7.4.0/gcc/ggc-page.c:1527
#2  0x0057bf98 in gt_ggc_mx_lang_tree_node (x_p=0x157cf6a4) at
./gt-cp-tree.h:133
#3  0x0057d338 in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:523
#4  0x0059d6c4 in gt_ggc_mx_cxx_binding (x_p=0x7fffe8d61550) at
./gt-cp-name-lookup.h:60
#5  0x0057c1b8 in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:648
#6  0x0057cf07 in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:275
#7  0x0057c921 in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:466
#8  0x0057c83c in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:440
#9  0x0057ce6d in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:264
#10 0x0057d338 in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:523
#11 0x0057d32a in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:522
#12 0x0057d32a in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:522
#13 0x0057d354 in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:525
#14 0x0057cc6a in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:356
#15 0x0057ce97 in gt_ggc_mx_lang_tree_node (x_p=) at
./gt-cp-tree.h:267
#16 0x007d78cf in gt_ggc_mx_gimple (x_p=) at
gtype-desc.c:1316
#17 0x007d590a in gt_ggc_mx_basic_block_def (x_p=) at
gtype-desc.c:1440
#18 0x007d7566 in gt_ggc_mx_gimple (x_p=) at
gtype-desc.c:1319
#19 0x007d98d2 in gt_ggc_mx_cgraph_edge (x_p=) at
gtype-desc.c:2608
#20 0x007d93ee in gt_ggc_mx_symtab_node (x_p=) at
gtype-desc.c:1799
#21 0x0057ccc1 

[Bug debug/90471] ICE Segmentation fault when compiling with debug info

2019-05-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

--- Comment #1 from Daniel Fruzynski  ---
Created attachment 46354
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46354=edit
MinGW package versions

[Bug c/90471] New: ICE Segmentation fault when compiling with debug info

2019-05-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90471

Bug ID: 90471
   Summary: ICE Segmentation fault when compiling with debug info
   Product: gcc
   Version: 7.4.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Created attachment 46353
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46353=edit
Preprocessed code

I got ICE Segmentation fault when trying to build OpenCL BOINC app which I am
developing. This happen only when I use -g option, without it code compiles
fine.

I compiled code using MinGW crossompiler shipped with Cygwin. Exact versions of
all mingw packages are on attached screen. I also attached preprocessed source.
I use 64-bit Cygwin on 64-bit Win 10 Pro with latest patches.

When I was trying to remove unimportant parts of source code, I found
interesting thing: I was able to comment out boinc_opencl.h include and crash
still happen. However when I removed this line completely, gcc did not crash.
This part of code looks as follows:

[code]
#define __CL_ENABLE_EXCEPTIONS
#define CL_TARGET_OPENCL_VERSION 120
#define CL_USE_DEPRECATED_OPENCL_1_1_APIS
#include "CL/cl.hpp"
//#include "boinc_opencl.h"

class OclException : public std::exception
[/code]

I can attach original files and all relevant headers if you need them too.

$ x86_64-w64-mingw32-g++ -O3 -ftree-vectorize -std=c++11 -Wall -pthread
-I/cygdrive/c/rakesearch/_boinc -I/cygdrive/c/rakesearch/_boinc/lib
-I/cygdrive/c/rakesearch/_boinc/include/boinc -I. -D_BSD_SOURCE -g -c
RakeSearchOpenCL2.cpp -o RakeSearchOpenCL.o

RakeSearchOpenCL2.cpp: In member function ‘bool RakeSearchOpenCL::init(int,
char**)’:
RakeSearchOpenCL2.cpp:99:1: internal compiler error: Segmentation fault
 }
 ^
Please submit a full bug report,
with preprocessed source if appropriate.
See <https://gcc.gnu.org/bugs/> for instructions.



$ x86_64-w64-mingw32-g++ --version
x86_64-w64-mingw32-g++ (GCC) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ x86_64-w64-mingw32-g++ -v
Using built-in specs.
COLLECT_GCC=x86_64-w64-mingw32-g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-w64-mingw32/7.4.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
Configured with:
/cygdrive/i/szsz/tmpp/cygwin64/mingw64-x86_64/mingw64-x86_64-gcc-7.4.0-1.x86_64/src/gcc-7.4.0/configure
--srcdir=/cygdrive/i/szsz/tmpp/cygwin64/mingw64-x86_64/mingw64-x86_64-gcc-7.4.0-1.x86_64/src/gcc-7.4.0
--prefix=/usr --exec-prefix=/usr --localstatedir=/var --sysconfdir=/etc
--docdir=/usr/share/doc/mingw64-x86_64-gcc
--htmldir=/usr/share/doc/mingw64-x86_64-gcc/html -C --build=x86_64-pc-cygwin
--host=x86_64-pc-cygwin --target=x86_64-w64-mingw32 --without-libiconv-prefix
--without-libintl-prefix --with-sysroot=/usr/x86_64-w64-mingw32/sys-root
--with-build-sysroot=/usr/x86_64-w64-mingw32/sys-root --disable-multilib
--disable-win32-registry --enable-languages=c,c++,fortran,lto,objc,obj-c++
--enable-fully-dynamic-string --enable-graphite --enable-libgomp
--enable-libquadmath --enable-libquadmath-support --enable-libssp
--enable-version-specific-runtime-libs --enable-libgomp --enable-libada
--with-dwarf2 --with-gnu-ld --with-gnu-as --with-tune=generic
--with-cloog-include=/usr/include/cloog-isl --with-system-zlib
--enable-threads=posix --libexecdir=/usr/lib
Thread model: posix
gcc version 7.4.0 (GCC)

[Bug c/90293] New function attribute: expect_return

2019-04-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90293

--- Comment #1 from Daniel Fruzynski  ---
One more case: sometimes it may be more handy to specify what will *not* be
usually returned, e.g. special invalid value. For such cases another attribute
would be needed:

__attribute__((expect_not_return(-1)))
int CreateSocket();

[Bug c/90293] New: New function attribute: expect_return

2019-04-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90293

Bug ID: 90293
   Summary: New function attribute: expect_return
   Product: gcc
   Version: 10.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I have an idea of new function attribute: expect_return. It would allow to
specify value usually returned from function, so it could help with
optimization in similar way like __builtin_expect() does.

Example use:

__attribute__((expect_return(false)))
bool DebugModeEnabled();

__attribute__((expect_return(false)))
bool IsErrorCode(int code);

[Bug tree-optimization/89317] Ineffective code from std::copy

2019-02-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89317

--- Comment #2 from Daniel Fruzynski  ---
Yes, I mean inefficient.

[Bug c++/89317] New: Ineffective code from std::copy

2019-02-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89317

Bug ID: 89317
   Summary: Ineffective code from std::copy
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

gcc produces ineffective code when std::copy is used to copy data. For test I
created my own version of std::copy and this version is optimized properly.

Compiles using g++ (GCC-Explorer-Build) 9.0.1 20190211 (experimental)
Options: -O3 -std=c++11 -march=skylake

[code]
#include 
#include 

#define Size 8

class Test
{
public:
void test1(void*__restrict ptr);
void test2(void*__restrict ptr);

private:
int16_t data1[Size];
int16_t data2[Size];
};

template
void mycopy(T1 begin, T1 end, T2 dest)
{
while (begin != end)
{
*dest = *begin;
++dest;
++begin;
}
}

void Test::test1(void*__restrict ptr)
{
uint16_t* p = (uint16_t*)ptr;

std::copy(data1, data1 + Size, p);
p += Size;
std::copy(data2, data2 + Size, p);
}

void Test::test2(void*__restrict ptr)
{
int16_t* p = (int16_t*)ptr;

mycopy(data1, data1 + Size, p);
p += Size;
mycopy(data2, data2 + Size, p);
}
[/code]

[asm]
Test::test1(void*):
movzx   eax, WORD PTR [rdi]
mov edx, 16
mov WORD PTR [rsi], ax
movzx   eax, WORD PTR [rdi+2]
add rsi, 16
mov WORD PTR [rsi-14], ax
movzx   eax, WORD PTR [rdi+4]
mov WORD PTR [rsi-12], ax
movzx   eax, WORD PTR [rdi+6]
mov WORD PTR [rsi-10], ax
movzx   eax, WORD PTR [rdi+8]
mov WORD PTR [rsi-8], ax
movzx   eax, WORD PTR [rdi+10]
mov WORD PTR [rsi-6], ax
movzx   eax, WORD PTR [rdi+12]
mov WORD PTR [rsi-4], ax
movzx   eax, WORD PTR [rdi+14]
mov WORD PTR [rsi-2], ax
mov rax, rdx
sar rax
testrdx, rdx
jle .L69
movzx   edx, WORD PTR [rdi+16]
mov WORD PTR [rsi], dx
cmp rax, 1
je  .L69
movzx   edx, WORD PTR [rdi+18]
mov WORD PTR [rsi+2], dx
cmp rax, 2
je  .L69
movzx   edx, WORD PTR [rdi+20]
mov WORD PTR [rsi+4], dx
cmp rax, 3
je  .L69
movzx   edx, WORD PTR [rdi+22]
mov WORD PTR [rsi+6], dx
cmp rax, 4
je  .L69
movzx   edx, WORD PTR [rdi+24]
mov WORD PTR [rsi+8], dx
cmp rax, 5
je  .L69
movzx   edx, WORD PTR [rdi+26]
mov WORD PTR [rsi+10], dx
cmp rax, 6
je  .L69
movzx   edx, WORD PTR [rdi+28]
mov WORD PTR [rsi+12], dx
cmp rax, 7
je  .L69
movzx   edx, WORD PTR [rdi+30]
mov WORD PTR [rsi+14], dx
cmp rax, 8
je  .L69
movzx   edx, WORD PTR [rdi+32]
mov WORD PTR [rsi+16], dx
cmp rax, 9
je  .L69
movzx   edx, WORD PTR [rdi+34]
mov WORD PTR [rsi+18], dx
cmp rax, 10
je  .L69
movzx   edx, WORD PTR [rdi+36]
mov WORD PTR [rsi+20], dx
cmp rax, 11
je  .L69
movzx   edx, WORD PTR [rdi+38]
mov WORD PTR [rsi+22], dx
cmp rax, 12
je  .L69
movzx   edx, WORD PTR [rdi+40]
mov WORD PTR [rsi+24], dx
cmp rax, 13
je  .L69
movzx   edx, WORD PTR [rdi+42]
mov WORD PTR [rsi+26], dx
cmp rax, 14
je  .L69
movzx   eax, WORD PTR [rdi+44]
mov WORD PTR [rsi+28], ax
.L69:
ret
Test::test2(void*):
vmovdqu xmm0, XMMWORD PTR [rdi]
vmovups XMMWORD PTR [rsi], xmm0
vmovdqu xmm1, XMMWORD PTR [rdi+16]
vmovups XMMWORD PTR [rsi+16], xmm1
ret
[/asm]

[Bug c/88963] New: gcc generates terrible code for vectors of 64+ length which are not natively supported

2019-01-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88963

Bug ID: 88963
   Summary: gcc generates terrible code for vectors of 64+ length
which are not natively supported
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
typedef int VInt __attribute__((vector_size(64)));

void test(VInt*__restrict a, VInt*__restrict b, 
VInt*__restrict c)
{
*a = *b + *c;
}
[/code]

This code compiled with -O3 -march=skylake in following way:

[asm]
test(int __vector(16)*, int __vector(16)*, int __vector(16)*):
  push rbp
  mov rbp, rsp
  and rsp, -64
  sub rsp, 136
  vmovdqa xmm3, XMMWORD PTR [rsi]
  vmovdqa xmm4, XMMWORD PTR [rsi+16]
  vmovdqa xmm5, XMMWORD PTR [rsi+32]
  vmovdqa xmm6, XMMWORD PTR [rsi+48]
  vmovdqa xmm7, XMMWORD PTR [rdx]
  vmovaps XMMWORD PTR [rsp-56], xmm3
  vmovdqa xmm1, XMMWORD PTR [rdx+16]
  vmovaps XMMWORD PTR [rsp-40], xmm4
  vmovdqa ymm4, YMMWORD PTR [rsp-56]
  vmovdqa xmm2, XMMWORD PTR [rdx+32]
  vmovaps XMMWORD PTR [rsp-8], xmm6
  vmovaps XMMWORD PTR [rsp+8], xmm7
  vmovdqa xmm3, XMMWORD PTR [rdx+48]
  vmovaps XMMWORD PTR [rsp-24], xmm5
  vmovaps XMMWORD PTR [rsp+24], xmm1
  vpaddd ymm0, ymm4, YMMWORD PTR [rsp+8]
  vmovdqa ymm5, YMMWORD PTR [rsp-24]
  vmovaps XMMWORD PTR [rsp+40], xmm2
  vmovaps XMMWORD PTR [rsp+56], xmm3
  vmovdqa xmm2, xmm0
  vmovdqa YMMWORD PTR [rsp-120], ymm0
  vpaddd ymm0, ymm5, YMMWORD PTR [rsp+40]
  vmovdqa xmm6, XMMWORD PTR [rsp-104]
  vmovdqa YMMWORD PTR [rsp-88], ymm0
  vmovdqa xmm7, XMMWORD PTR [rsp-72]
  vmovaps XMMWORD PTR [rdi], xmm2
  vmovaps XMMWORD PTR [rdi+16], xmm6
  vmovaps XMMWORD PTR [rdi+32], xmm0
  vmovaps XMMWORD PTR [rdi+48], xmm7
  vzeroupper
  leave
  ret
[/asm]

Other compilers (clang, icc) produces nice code. This is from clang:

[asm]
test(int __vector(16)*, int __vector(16)*, int __vector(16)*): # @test(int
__vector(16)*, int __vector(16)*, int __vector(16)*)
  vmovdqa ymm0, ymmword ptr [rdx]
  vmovdqa ymm1, ymmword ptr [rdx + 32]
  vpaddd ymm0, ymm0, ymmword ptr [rsi]
  vpaddd ymm1, ymm1, ymmword ptr [rsi + 32]
  vmovdqa ymmword ptr [rdi + 32], ymm1
  vmovdqa ymmword ptr [rdi], ymm0
  vzeroupper
  ret
[/asm]

gcc produces pretty code for -O3 -march=skylake-avx512. Pretty code is also for
vector size 32 with AVX disabled. However for vector size 128 and -O3
-march=skylake-avx512 code is again ugly.

[Bug c/88959] Unnecessary xor before bsf/tzcnt

2019-01-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88959

--- Comment #1 from Daniel Fruzynski  ---
I have found that this extra xor is not added when compiling with -O3
-march=sandybridge or -O3 -march=ivydybridge. However with -O3
-march=sandybridge/ivydybridge -mbmi it is added.

[Bug c/88959] New: Unnecessary xor before bsf/tzcnt

2019-01-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88959

Bug ID: 88959
   Summary: Unnecessary xor before bsf/tzcnt
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
int test(int x)
{
return __builtin_ctz(x);
}
[/code]

gcc 4.9.1 with -O3 produces this:

[asm]
test(int):
  rep bsf eax, edi
  ret
[/asm]

And this with -O3 -mbmi:

[asm]
test(int):
  tzcnt eax, edi
  ret
[/asm]

gcc 4.9.2 and newer (including gcc 9) produces this for both cases:

[asm]
test(int):
  xor eax, eax
  rep bsf eax, edi
  ret
[/asm]

[asm]
test(int):
  xor eax, eax
  tzcnt eax, edi
  ret
[/asm]

This extra xor instruction is not needed here.

[Bug target/71659] _xgetbv intrinsic missing

2019-01-17 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71659

--- Comment #5 from Daniel Fruzynski  ---
I meant pr85684

[Bug target/71659] _xgetbv intrinsic missing

2019-01-17 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71659

Daniel Fruzynski  changed:

   What|Removed |Added

 CC||bugzilla@poradnik-webmaster
   ||a.com

--- Comment #4 from Daniel Fruzynski  ---
This intrinsics was added in gcc 8. Initial implementation was buggy (see
r85684) and was fixed in 8.2 However there is one more issue here: Intel
Intrinsics Guide says that it should be available by including ,
however in gcc you need to include .

Additionally there are no defines for XFEATURE_ENABLED_MASK and possible output
values.

[Bug target/88679] SSE2 intrinsics are available by default on x86

2019-01-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88679

--- Comment #2 from Daniel Fruzynski  ---
I used compiler at https://godbolt.org/. Here are outputs for both commands:

$ gcc -v
Using built-in specs.

COLLECT_GCC=/opt/compiler-explorer/gcc-snapshot/bin/g++

Target: x86_64-linux-gnu

Configured with: ../gcc-trunk-20190103/configure
--prefix=/opt/compiler-explorer/gcc-build/staging --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu --disable-bootstrap
--enable-multiarch --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --enable-clocale=gnu --enable-languages=c,c++,fortran
--enable-ld=yes --enable-gold=yes --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --enable-linker-build-id --enable-lto
--enable-plugins --enable-threads=posix --with-pkgversion=GCC-Explorer-Build

Thread model: posix

gcc version 9.0.0 20190102 (experimental) (GCC-Explorer-Build) 

COLLECT_GCC_OPTIONS='-fdiagnostics-color=always' '-g' '-o'
'/tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s' '-masm=intel'
'-S' '-v' '-shared-libgcc' '-mtune=generic' '-march=x86-64'


/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/9.0.0/cc1plus
-quiet -v -imultiarch x86_64-linux-gnu -iprefix
/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/
-D_GNU_SOURCE  -quiet -dumpbase example.cpp -masm=intel -mtune=generic
-march=x86-64 -auxbase-strip
/tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s -g -version
-fdiagnostics-color=always -o
/tmp/compiler-explorer-compiler11903-60-1nshruf.qczq/output.s

GNU C++14 (GCC-Explorer-Build) version 9.0.0 20190102 (experimental)
(x86_64-linux-gnu)

compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4,
MPC version 1.0.3, isl version isl-0.18-GMP

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096

ignoring nonexistent directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/include"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/x86_64-linux-gnu"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/backward"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/include"

ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"

ignoring duplicate directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/include-fixed"

ignoring nonexistent directory
"/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/include"

#include "..." search starts here:

#include <...> search starts here:


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/x86_64-linux-gnu


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../include/c++/9.0.0/backward


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/include


/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/include-fixed

 /usr/local/include

 /opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/../../include

 /usr/include/x86_64-linux-gnu

 /usr/include

End of search list.

GNU C++14 (GCC-Explorer-Build) version 9.0.0 20190102 (experimental)
(x86_64-linux-gnu)

compiled by GNU C version 7.3.0, GMP version 6.1.0, MPFR version 3.1.4,
MPC version 1.0.3, isl version isl-0.18-GMP

GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096

Compiler executable checksum: f724e483fb841047a948ffa41ca3218a

COMPILER_PATH=/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/9.0.0/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/x86_64-linux-gnu/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../libexec/gcc/:/opt/compiler-explorer/gcc-trunk-20190103/bin/../lib/gcc/x86_64-linux-gnu/9.0.0/../../../../x86_64-linux-gnu/bin/


[Bug c/88679] New: SSE2 intrinsics are available by default on x86

2019-01-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88679

Bug ID: 88679
   Summary: SSE2 intrinsics are available by default on x86
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

SSE2 intrinsics are available by default when compiling code for 32-bit x86.
Code below compiles fine with options -m32 -O3. I had to add -mno-sse2 to get
an error. 

Fortunately __SSE2__ is not defined by default, so code can rely on it.

[code]
#include 

void test(__m128i const* m)
{
__m128i v = _mm_load_si128(m);
}
[/code]

[Bug c++/87729] Please include -Woverloaded-virtual in -Wall

2019-01-02 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87729

--- Comment #2 from Daniel Fruzynski  ---
Here you are:

[code]
class Foo
{
public:
virtual void f(int);
};

class Bar : public Foo
{
public:
virtual void f(short);
};
[/code]

[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782

--- Comment #5 from Daniel Fruzynski  ---
I got following link:
https://stackoverflow.com/questions/53733624/is-xmm8-register-value-preserved-across-calls/53733767#53733767

Quote from it: "Any additional registers for newer instruction sets are
volatile by default. This includes the upper parts of YMM0-15 and ZMM0-15 as
well as ?MM16-31 if present.".

So it looks that gcc should not generate .seh_savexmm for xmm16..31 at all.

[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782

--- Comment #4 from Daniel Fruzynski  ---
I have found that I can use -ffixed-reg option for this. It allows to eliminate
one register, so I have to use it 16 times to eliminate all xmm16..31
registers. It would be handy to have another option which would allow to
disable all registers from this group together.

[Bug target/65782] Assembly failure (invalid register for .seh_savexmm) with -O3 -mavx512f on mingw-w64

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65782

Daniel Fruzynski  changed:

   What|Removed |Added

 CC||bugzilla@poradnik-webmaster
   ||a.com

--- Comment #3 from Daniel Fruzynski  ---
Cygwin (x86_64-pc-cygwin) is also affected. I have encountered this bug on gcc
7.4.0.

Could you add new option which would remove XMM16+ registers from available
registers pool? It could be used as an easy to use workaround until you fix it
properly.

[Bug middle-end/88575] gcc got confused by different comparison operators

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575

--- Comment #3 from Daniel Fruzynski  ---
I have tried to compile with -O3 -march=skylake -ffast-math and got this:

[asm]
test(double, double):
vminsd  xmm2, xmm0, xmm1
vcmplesdxmm0, xmm0, xmm1
vxorpd  xmm1, xmm1, xmm1
vblendvpd   xmm0, xmm1, xmm2, xmm0
ret
test2(double, double):
vminsd  xmm2, xmm0, xmm1
vcmpltsdxmm0, xmm0, xmm1
vxorpd  xmm1, xmm1, xmm1
vblendvpd   xmm0, xmm1, xmm2, xmm0
ret
[/asm]

And this is for -O3 -march=skylake -funsafe-math-optimizations. As you can see,
one instruction was eliminated from test2(). For some reason it was not
eliminated from test() function. I checked that -ffinite-math-only present in
-ffast-math prevented elimination of this extra instruction.

[asm]
test(double, double):
vminsd  xmm2, xmm0, xmm1
vcmplesdxmm0, xmm0, xmm1
vxorpd  xmm1, xmm1, xmm1
vblendvpd   xmm0, xmm1, xmm2, xmm0
ret
test2(double, double):
vcmpnltsd   xmm1, xmm0, xmm1
vxorpd  xmm2, xmm2, xmm2
vblendvpd   xmm0, xmm0, xmm2, xmm1
ret
[/asm]

[Bug middle-end/88575] gcc got confused by different comparison operators

2019-01-01 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575

--- Comment #2 from Daniel Fruzynski  ---
Code was compiled with -O3 -march=skylake.

I have tried to add -fno-signed-zeros and -fsigned-zeros, and got the same
output for both cases.

[Bug middle-end/88575] New: gcc got confused by different comparison operators

2018-12-22 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88575

Bug ID: 88575
   Summary: gcc got confused by different comparison operators
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

In test() gcc is not able to determine that for a==b it does not have to
evaluate 2nd comparison and can use value of a if 1st comparison is true. When
operators are swapped like in test2() or are the same, code is optimized.

[code]
double test(double a, double b)
{
if (a <= b)
return a < b ? a : b;
return 0.0;
}

double test2(double a, double b)
{
if (a < b)
return a <= b ? a : b;
return 0.0;
}
[/code]

[asm]
test(double, double):
  vcomisd xmm1, xmm0
  jnb .L10
  vxorpd xmm0, xmm0, xmm0
  ret
.L10:
  vminsd xmm0, xmm0, xmm1
  ret

test2(double, double):
  vcmpnltsd xmm1, xmm0, xmm1
  vxorpd xmm2, xmm2, xmm2
  vblendvpd xmm0, xmm0, xmm2, xmm1
  ret
[/asm]

[Bug target/88570] Missing or ineffective vectorization of scatter load

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570

--- Comment #3 from Daniel Fruzynski  ---
I have checked svn head version (20181221), issue is still there.

[Bug target/88571] AVX512: when calculating logical expression with all values in kN registers, do not use GPRs

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571

--- Comment #3 from Daniel Fruzynski  ---
I have checked svn head version (20181221), issue is still there.

[Bug target/88570] Missing or ineffective vectorization of scatter load

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570

--- Comment #2 from Daniel Fruzynski  ---
In g++ (GCC-Explorer-Build) 9.0.0 20181219 (experimental) this still exists.

[Bug target/88571] AVX512: when calculating logical expression with all values in kN registers, do not use GPRs

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571

--- Comment #2 from Daniel Fruzynski  ---
Yes. Issue still exists in g++ (GCC-Explorer-Build) 9.0.0 20181219
(experimental).

[Bug target/88571] New: AVX512: when calculating logical expression with all values in kN registers, do not use GPRs

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88571

Bug ID: 88571
   Summary: AVX512: when calculating logical expression with all
values in kN registers, do not use GPRs
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

This is a side effect of finding Bug 88570. I have noticed that when gcc has to
generate code for logical expression with all values already stored in kN
registers, it moves them to GPRs, performs calculation on them and moved result
back. Such situation may happen as a side effect of optimizations in gcc. It is
also move convenient to use C/C++ operators to write expressions instead of
intrinsics, so some people may prefer to use them. It probably can also happen
as a side effect of interaction of code optimized by gcc with user code.

When logical expression is written using intrinsics, values stays in kN
registers as expected.

Code below was compiled with -O3 -march=skylake-avx512. test1 and test2 are
examples of code with C/C++ operators. test3 is an example of not introduced by
gcc during optimization. This last example is also in Bug 88570, which I logged
to fix inefficient optimizations.

[code]
#include 

void test1(int*__restrict n1, int*__restrict n2,
int*__restrict n3, int*__restrict n4)
{
__m256i v = _mm256_loadu_si256((__m256i*)n1);
__mmask8 m = _mm256_cmpgt_epi32_mask(v, _mm256_set1_epi32(1));
m = ~m;
_mm256_mask_storeu_epi32((__m256i*)n2, m, v);
}

void test2(int*__restrict n1, int*__restrict n2,
int*__restrict n3, int*__restrict n4)
{
__m256i v1 = _mm256_loadu_si256((__m256i*)n1);
__m256i v2 = _mm256_loadu_si256((__m256i*)n1);
__m256i v0 = _mm256_set1_epi32(2);
__mmask8 m1 = _mm256_cmpgt_epi32_mask(v1, _mm256_set1_epi32(1));
__mmask8 m2 = _mm256_cmpgt_epi32_mask(v2, _mm256_set1_epi32(2));
__mmask8 m = ~(m1 | m2);
_mm256_mask_storeu_epi32((__m256i*)n2, m, v1);
}

void test3(double*__restrict d1, double*__restrict d2,
double*__restrict d3, double*__restrict d4)
{
for (int n = 0; n < 4; ++n)
{
if (d1[n] > 0.0)
d2[n] = d3[n];
else
d2[n] = d4[n];
}
}
[/code]

[asm]
test1(int*, int*, int*, int*):
vmovdqu64   ymm0, YMMWORD PTR [rdi]
vpcmpgtdk1, ymm0, YMMWORD PTR .LC0[rip]
kmovb   eax, k1
not eax
kmovb   k2, eax
vmovdqu32   YMMWORD PTR [rsi]{k2}, ymm0
vzeroupper
ret
test2(int*, int*, int*, int*):
vmovdqu64   ymm1, YMMWORD PTR [rdi]
vpcmpgtdk1, ymm1, YMMWORD PTR .LC0[rip]
vpcmpgtdk2, ymm1, YMMWORD PTR .LC1[rip]
kmovb   edx, k1
kmovb   eax, k2
or  eax, edx
not eax
kmovb   k3, eax
vmovdqu32   YMMWORD PTR [rsi]{k3}, ymm1
vzeroupper
ret
test3(double*, double*, double*, double*):
vmovupd ymm0, YMMWORD PTR [rdi]
vxorpd  xmm1, xmm1, xmm1
vcmppd  k1, ymm0, ymm1, 14
vcmpltpdymm1, ymm1, ymm0
kmovb   eax, k1
not eax
vmovupd ymm2{k1}{z}, YMMWORD PTR [rdx]
kmovb   k2, eax
vmovupd ymm0{k2}{z}, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovupd YMMWORD PTR [rsi], ymm0
vzeroupper
ret
.LC0:
.long   1
.long   1
.long   1
.long   1
.long   1
.long   1
.long   1
.long   1
.LC1:
.long   2
.long   2
.long   2
.long   2
.long   2
.long   2
.long   2
.long   2
[/asm]

[Bug middle-end/88570] New: Missing or ineffective vectorization of scatter load

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88570

Bug ID: 88570
   Summary: Missing or ineffective vectorization of scatter load
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
void test1(int*__restrict n1, int*__restrict n2,
int*__restrict n3, int*__restrict n4)
{
for (int n = 0; n < 8; ++n)
{
if (n1[n] > 0)
n2[n] = n3[n];
else
n2[n] = n4[n];
}
}

void test2(double*__restrict d1, double*__restrict d2,
double*__restrict d3, double*__restrict d4)
{
for (int n = 0; n < 4; ++n)
{
if (d1[n] > 0.0)
d2[n] = d3[n];
else
d2[n] = d4[n];
}
}
[/code]

Code like above is vectorized properly when global variables are used. However
when code has to work on pointers passed as function arguments, vectorization
is not performed or performed ineffectively.

1. Compilation with -O3 -msse2: no vectorization at all, scalar code is
generated. It is long so I do not paste it here.

2. Compilation with -O3 -msse4.1: no vectorization at all

3. Compilation with -O3 -mavx or -march=sandybridge: code for test1() is still
not vectorized (somewhat expected, as int operations are in AVX2). Output for
test2() is below. As you can see, generated code performs masked loads for d3
and d4, and then used blend to create final result. When global vars are used,
masked loads are not used, only blend. Additionally xor mask is loaded from
memory instead of using cmpeq instruction.

[asm]
test2(double*, double*, double*, double*):
vmovupd xmm3, XMMWORD PTR [rdi]
vinsertf128 ymm1, ymm3, XMMWORD PTR [rdi+16], 0x1
vxorpd  xmm0, xmm0, xmm0
vcmpltpdymm1, ymm0, ymm1
vmaskmovpd  ymm2, ymm1, YMMWORD PTR [rdx]
vxorps  ymm0, ymm1, YMMWORD PTR .LC0[rip]
vmaskmovpd  ymm0, ymm0, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovups XMMWORD PTR [rsi], xmm0
vextractf128XMMWORD PTR [rsi+16], ymm0, 0x1
vzeroupper
ret
.LC0:
.quad   -1
.quad   -1
.quad   -1
.quad   -1
[/asm]

4. Compilation with -O3 -march=haswell: code similar as above, with both masked
loads and blend. This time compiler generated vpcmpeqd to load xor mask. This
also happen when -mavx2 is used instead of -march=haswell.

[asm]
test1(int*, int*, int*, int*):
vmovdqu ymm1, YMMWORD PTR [rdi]
vpxor   xmm0, xmm0, xmm0
vpcmpgtdymm1, ymm1, ymm0
vpmaskmovd  ymm2, ymm1, YMMWORD PTR [rdx]
vpcmpeqdymm0, ymm1, ymm0
vpmaskmovd  ymm0, ymm0, YMMWORD PTR [rcx]
vpblendvb   ymm0, ymm0, ymm2, ymm1
vmovdqu YMMWORD PTR [rsi], ymm0
vzeroupper
ret
test2(double*, double*, double*, double*):
vxorpd  xmm0, xmm0, xmm0
vcmpltpdymm1, ymm0, YMMWORD PTR [rdi]
vpcmpeqdymm0, ymm0, ymm0
vmaskmovpd  ymm2, ymm1, YMMWORD PTR [rdx]
vpxor   ymm0, ymm0, ymm1
vmaskmovpd  ymm0, ymm0, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovupd YMMWORD PTR [rsi], ymm0
vzeroupper
ret
[/asm]

4. Compilation with -O3 -march=skylake-avx512: masked loads and blend used
again. This time masked loads uses kN registers to store mask. test1() performs
comparison twice to get negated value. test2() uses single comparison, but to
negate it it moves value to eax and then back (I will log a separate bug for
this part, as it has other implications). Code which uses global variables only
uses blend with mask in ymm register.

[asm]
test1(int*, int*, int*, int*):
vmovdqu32   ymm0, YMMWORD PTR [rdi]
vpxor   xmm2, xmm2, xmm2
vpcmpd  k1, ymm0, ymm2, 6
vpcmpgtdymm3, ymm0, ymm2
vmovdqu32   ymm1{k1}{z}, YMMWORD PTR [rdx]
vpcmpd  k1, ymm0, ymm2, 2
vmovdqu32   ymm0{k1}{z}, YMMWORD PTR [rcx]
vpblendvb   ymm0, ymm0, ymm1, ymm3
vmovdqu32   YMMWORD PTR [rsi], ymm0
vzeroupper
ret
test2(double*, double*, double*, double*):
vmovupd ymm0, YMMWORD PTR [rdi]
vxorpd  xmm1, xmm1, xmm1
vcmppd  k1, ymm0, ymm1, 14
vcmpltpdymm1, ymm1, ymm0
kmovb   eax, k1
not eax
vmovupd ymm2{k1}{z}, YMMWORD PTR [rdx]
kmovb   k2, eax
vmovupd ymm0{k2}{z}, YMMWORD PTR [rcx]
vblendvpd   ymm0, ymm0, ymm2, ymm1
vmovupd YMMWORD PTR [rsi], ymm0
vzeroupper
ret
[/asm]

5. I tried to compile this code using icc, and got this. As you can see, it
uses masked move instead of blend. I did not check if it o

[Bug middle-end/88569] New: Track relations between variable values

2018-12-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88569

Bug ID: 88569
   Summary: Track relations between variable values
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

This example comes from code which could be compiled for various CPUs, and had
dedicated sections for AVX and SSE2. I left original ifdefs in comments. When
1st loop (for AVX) ends, following relations is true: (cnt - n <= 3). Similarly
after 2nd loop this is true: (cnt - n <= 1). With such knowledge it is possible
to optimize code of bar() to baz(). This eliminates two condition checks (after
2nd and 3rd loop), and one increment (for 3rd loop). It would be nice if gcc
could perform such transformation automatically.

[code]
void foo(int n);

void bar(int cnt)
{
int n = 0;
//#ifdef __AVX__
for (; n < cnt - 3; n += 4)
foo(n);
//#endif
//#ifdef __SSE2__
for (; n < cnt - 1; n += 2)
foo(n);
//#endif
for (; n < cnt; n += 1)
foo(n);
}

void baz(int cnt)
{
int n = 0;
for (; n < cnt - 3; n += 4)
foo(n);
if (n < cnt - 1)
{
foo(n);
n += 2;
}
if (n < cnt)
foo(n);
}
[/code]

[asm]
bar(int):
pushr13
pushr12
mov r12d, edi
pushrbp
lea ebp, [rdi-3]
pushrbx
xor ebx, ebx
sub rsp, 8
testebp, ebp
jle .L5
.L2:
mov edi, ebx
add ebx, 4
callfoo(int)
cmp ebx, ebp
jl  .L2
lea eax, [r12-4]
shr eax, 2
lea ebx, [4+rax*4]
.L5:
lea ebp, [r12-1]
cmp ebp, ebx
jle .L3
mov edi, ebx
lea r13d, [rbx+2]
callfoo(int)
cmp ebp, r13d
jle .L8
mov edi, r13d
callfoo(int)
.L8:
lea edi, [r12-2]
sub edi, ebx
mov ebx, edi
and ebx, -2
add ebx, r13d
.L3:
cmp r12d, ebx
jle .L14
mov edi, ebx
callfoo(int)
lea edi, [rbx+1]
cmp r12d, edi
jg  .L17
.L14:
add rsp, 8
pop rbx
pop rbp
pop r12
pop r13
ret
.L17:
add rsp, 8
pop rbx
pop rbp
pop r12
pop r13
jmp foo(int)
baz(int):
pushr12
mov r12d, edi
pushrbp
lea ebp, [rdi-3]
pushrbx
xor ebx, ebx
testebp, ebp
jle .L19
.L20:
mov edi, ebx
add ebx, 4
callfoo(int)
cmp ebx, ebp
jl  .L20
lea eax, [r12-4]
shr eax, 2
lea ebx, [4+rax*4]
.L19:
lea eax, [r12-1]
cmp eax, ebx
jg  .L27
cmp ebx, r12d
jl  .L28
.L25:
pop rbx
pop rbp
pop r12
ret
.L27:
mov edi, ebx
add ebx, 2
callfoo(int)
cmp ebx, r12d
jge .L25
.L28:
mov edi, ebx
pop rbx
pop rbp
pop r12
jmp foo(int)
[/asm]

[Bug middle-end/88487] union prevents autovectorization

2018-12-20 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #6 from Daniel Fruzynski  ---
Not good. Fortunately I found workaround. This is probably the best what one
can get:

[code]
#include 
#include 

template
struct TypeHelper
{
constexpr unsigned offset();

operator Type&()
{
uint8_t*__restrict p = (uint8_t*__restrict)this - offset();
Type*__restrict pt =  (Type*__restrict)p;
return *pt;
}
};

struct S
{
struct Union
{
void*__restrict*__restrict ptr;
TypeHelper d;
} u;
};

template<>
constexpr unsigned TypeHelper::offset()
{
return offsetof(S::Union, d) - offsetof(S::Union, ptr);
}

void test(S* __restrict s1, S* __restrict s2)
{
for (int n = 0; n < 2; ++n)
{
s1->u.d[n][0] = s2->u.d[n][0];
s1->u.d[n][1] = s2->u.d[n][1];
}
}
[/code]

[Bug tree-optimization/88540] Issues with vectorization of min/max operations

2018-12-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88540

--- Comment #3 from Daniel Fruzynski  ---
Looks that AARCH64 is also affected. This is output from gcc 8.2 for SIZE=2:

[asm]
test(double*, double*, double*):
ldp d1, d0, [x0]
ldp d3, d2, [x1]
fcmpe   d1, d3
fcsel   d1, d1, d3, mi
fcmpe   d0, d2
fcsel   d0, d0, d2, mi
stp d1, d0, [x2]
ret
[/asm]

And this is for SIZE=4:

[asm]
test(double*, double*, double*):
ldr q5, [x0]
ldr q3, [x1]
ldr q4, [x0, 16]
ldr q2, [x1, 16]
fcmgt   v1.2d, v3.2d, v5.2d
fcmgt   v0.2d, v2.2d, v4.2d
bsl v1.16b, v5.16b, v3.16b
bsl v0.16b, v4.16b, v2.16b
str q1, [x2]
str q0, [x2, 16]
ret
[/asm]

[Bug middle-end/88542] Optimize symmetric range check

2018-12-18 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88542

--- Comment #2 from Daniel Fruzynski  ---
No, code with -ffast-math is the same.

BTW, fabs(NaN) is NaN, so result is the same as before (false).

[Bug c/88542] New: Optimize symmetric range check

2018-12-18 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88542

Bug ID: 88542
   Summary: Optimize symmetric range check
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 

bool test1(double d, double max)
{
return (d < max) && (d > -max);
}

bool test2(double d, double max)
{
return fabs(d) < max;
}
[/code]

When code checks if some number d is in (or outside of) symmetric range like
(-max, max), code from test1() can be replaced with one from test2(). This of
course assumes that expression does not produce any side effects. This can be
done nicely for floating point numbers stored in IEEE format, what leads to
faster code:

[asm]
test1(double, double):
vcomisd xmm1, xmm0
jbe .L6
vxorpd  xmm1, xmm1, XMMWORD PTR .LC0[rip]
vcomisd xmm0, xmm1
setaal
ret
.L6:
xor eax, eax
ret
test2(double, double):
vandpd  xmm0, xmm0, XMMWORD PTR .LC1[rip]
vcomisd xmm1, xmm0
setaal
ret
[/asm]

For integer types stored in two's complement format similar change gives slower
code. However on platforms which uses different integer format with dedicated
sign bit this optimizations may be beneficial.

[Bug c/88540] New: Issues with vectorization of min/max operations

2018-12-18 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88540

Bug ID: 88540
   Summary: Issues with vectorization of min/max operations
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

1st issue:

[code]
#define SIZE 2

void test(double* __restrict d1, double* __restrict d2, double* __restrict d3)
{
for (int n = 0; n < SIZE; ++n)
{
d3[n] = d1[n] < d2[n] ? d1[n] : d2[n];
}
}
[code]

When this is compiled with for SSE2, gcc produces non vectorized code:

[asm]
test(double*, double*, double*):
vmovsd  xmm0, QWORD PTR [rdi]
vminsd  xmm0, xmm0, QWORD PTR [rsi]
vmovsd  QWORD PTR [rdx], xmm0
vmovsd  xmm0, QWORD PTR [rdi+8]
vminsd  xmm0, xmm0, QWORD PTR [rsi+8]
vmovsd  QWORD PTR [rdx+8], xmm0
ret
[/asm]

When SIZE is changed to 3 or greater, code gets vectorized properly. I thought
that this may be some workaround for old CPU which was slower there, but this
also happen when compiling with "-O3 -march=skylake". I also checked with SIZE
6, and got 1 AVX op and 2 scalar SSE ones. Looks that this is an off-by-one
bug.

The same happen for code with other relational operators (>, <=, >=).

2nd issue: when compiling for AVX512, gcc does not use new instructions which
use ZMM registers, it still generates code for YMM ones.

[Bug middle-end/88487] union prevents autovectorization

2018-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #4 from Daniel Fruzynski  ---
OK, I see. Is there any workaround for this? I tried to assign pointer to local
variable directly and with intermediate casting via void*, but it did not help.
Casting S1* to S2* also does not work.

[Bug middle-end/88490] Missed autovectorization when indices are different

2018-12-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490

--- Comment #3 from Daniel Fruzynski  ---
In this case s->d is pointer to pointer to double, and both pointer levels have
restrict qualifier. I wonder if you could add some tag that s->d[n] and s->d[k]
points to separate memory areas. This tag could be later used to determine that
s->d[n][0] and s->d[k][0] also do not overlap.

[Bug middle-end/88490] Missed autovectorization when indices are different

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490

--- Comment #1 from Daniel Fruzynski  ---
Ehh, small typo. This is correct version, also not vectorized:

[code]
struct S
{
double* __restrict__ * __restrict__ d;
};

void test(S* __restrict__ s, int n, int k)
{
if (n > k)
{
for (int i = 0; i < 2; ++i)
{
s->d[n][0] = s->d[k][0];
s->d[n][1] = s->d[k][1];
}
}
}
[/code]

[asm]
test(S*, int, int):
cmp esi, edx
jle .L3
mov rax, QWORD PTR [rdi]
movsx   rdx, edx
mov rdx, QWORD PTR [rax+rdx*8]
movsx   rsi, esi
vmovsd  xmm0, QWORD PTR [rdx]
mov rax, QWORD PTR [rax+rsi*8]
vmovsd  QWORD PTR [rax], xmm0
vmovsd  xmm0, QWORD PTR [rdx+8]
vmovsd  QWORD PTR [rax+8], xmm0
vmovsd  xmm0, QWORD PTR [rdx]
vmovsd  QWORD PTR [rax], xmm0
vmovsd  xmm0, QWORD PTR [rdx+8]
vmovsd  QWORD PTR [rax+8], xmm0
.L3:
ret
[/asm]

[Bug middle-end/88490] New: Missed autovectorization when indices are different

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88490

Bug ID: 88490
   Summary: Missed autovectorization when indices are different
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Code below reads and writes data using different indices what is checked by
"if" above loop. This can be autovectorized, as both memory areas do not
overlap. Code compiled with -O3 -march=skylake-avx512

[code]
struct S
{
double* __restrict__ * __restrict__ d;
};

void test(S* __restrict__ s, int n, int k)
{
if (n > k)
{
for (int n = 0; n < 2; ++n)
{
s->d[n][0] = s->d[k][0];
s->d[n][1] = s->d[k][1];
}
}
}
[/code]

[asm]
test(S*, int, int):
cmp esi, edx
jle .L3
mov rcx, QWORD PTR [rdi]
movsx   rdx, edx
mov rax, QWORD PTR [rcx+rdx*8]
mov rdx, QWORD PTR [rcx]
vmovsd  xmm0, QWORD PTR [rax]
vmovsd  QWORD PTR [rdx], xmm0
vmovsd  xmm0, QWORD PTR [rax+8]
vmovsd  QWORD PTR [rdx+8], xmm0
vmovsd  xmm0, QWORD PTR [rax]
mov rdx, QWORD PTR [rcx+8]
vmovsd  QWORD PTR [rdx], xmm0
vmovsd  xmm0, QWORD PTR [rax+8]
vmovsd  QWORD PTR [rdx+8], xmm0
.L3:
ret
[/asm]

[Bug middle-end/88487] union prevents autovectorization

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #2 from Daniel Fruzynski  ---
I spotted that test3 in previous comment uses structure S2 which does not have
union inside. When I changes it to use S1, I got non-vectorized code. So this
workaround does not work.

[Bug middle-end/88487] union prevents autovectorization

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

--- Comment #1 from Daniel Fruzynski  ---
Update: when pointers to data are copied to local variables like below,
autovectorization starts working again.

[code]
void test3(S2* __restrict__ s1, S2* __restrict__ s2)
{
double* __restrict__ * __restrict__ d1 = s1->d;
double* __restrict__ * __restrict__ d2 = s2->d;
for (int n = 0; n < 2; ++n)
{
d1[n][0] = d2[n][0];
d1[n][1] = d2[n][1];
}
}
[/code]

[Bug middle-end/88487] New: union prevents autovectorization

2018-12-13 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88487

Bug ID: 88487
   Summary: union prevents autovectorization
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

When pointer to data is inside union, loops are not autovectorized. This also
happen when I removed "i" field from union, so it had only one field. Code
compiled with -O3 -mavx

[code]
struct S1
{
union
{
double* __restrict__ * __restrict__ d;
int* __restrict__ * __restrict__ i;
} u;
};

struct S2
{
double* __restrict__ * __restrict__ d;
};

void test1(S1* __restrict__ s1, S1* __restrict__ s2)
{
for (int n = 0; n < 2; ++n)
{
s1->u.d[n][0] = s2->u.d[n][0];
s1->u.d[n][1] = s2->u.d[n][1];
}
}

void test2(S2* __restrict__ s1, S2* __restrict__ s2)
{
for (int n = 0; n < 2; ++n)
{
s1->d[n][0] = s2->d[n][0];
s1->d[n][1] = s2->d[n][1];
}
}
[/code]

[asm]
test1(S1*, S1*):
mov rdx, QWORD PTR [rsi]
mov rax, QWORD PTR [rdi]
mov rsi, QWORD PTR [rdx]
mov rcx, QWORD PTR [rax]
mov rdx, QWORD PTR [rdx+8]
mov rax, QWORD PTR [rax+8]
vmovsd  xmm0, QWORD PTR [rsi]
vmovsd  QWORD PTR [rcx], xmm0
vmovsd  xmm0, QWORD PTR [rsi+8]
vmovsd  QWORD PTR [rcx+8], xmm0
vmovsd  xmm0, QWORD PTR [rdx]
vmovsd  QWORD PTR [rax], xmm0
vmovsd  xmm0, QWORD PTR [rdx+8]
vmovsd  QWORD PTR [rax+8], xmm0
ret
test2(S2*, S2*):
mov rdx, QWORD PTR [rsi]
mov rax, QWORD PTR [rdi]
mov rcx, QWORD PTR [rdx]
mov rdx, QWORD PTR [rdx+8]
vmovupd xmm0, XMMWORD PTR [rcx]
mov rcx, QWORD PTR [rax]
mov rax, QWORD PTR [rax+8]
vmovups XMMWORD PTR [rcx], xmm0
vmovupd xmm0, XMMWORD PTR [rdx]
vmovups XMMWORD PTR [rax], xmm0
ret
[/asm]

[Bug target/88473] AVX512: constant folding on mask does not remove unnecessary instructions

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88473

--- Comment #2 from Daniel Fruzynski  ---
I was playing with Compiler Explorer, to see how compilers optimize various
pieces of code. I found that next clang version (currently trunk) will be able
to analyze expressions which spans over vectors, masks and GPRs. I logged Bug
88476 to do something similar in gcc, please take a look. I think such approach
as in clang would be more beneficial.

In the past I also thought about template-based library, which would wrap
vector operations. One of unique concepts was to create separate types to hold
vector with bool values, and another one for int masks. With lazy instantiation
this should lead to faster resulting code. I did not try to write it yet, but
overall this approach look promising for me. With it such cases as in this bug
can 
appear as a side effect of inlining.

[Bug middle-end/88476] New: Optimize expressions which uses vector, mask and general purpose registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88476

Bug ID: 88476
   Summary: Optimize expressions which uses vector, mask and
general purpose registers
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I was playing with Compiler Explorer to see how compilers can optimize various
pieces of code. I found next version of clang (version 8.0.0 (trunk 348905))
can optimize expressions which uses vector, mask and general purpose registers.
Such approach opens new optimization possibilities. Here are two example
functions which demonstrates this:

[code]
#include 

void test1(void* data1, void* data2)
{
__m128i v1 = _mm_load_si128((__m128i const*)data1);
__m128i v2 = _mm_load_si128((__m128i const*)data2);
__mmask8 m1 = _mm_testn_epi16_mask(v1, v1);
__mmask8 m2 = _mm_testn_epi16_mask(v2, v2);
__mmask8 m = (m1 | 3) & (m2 | 3);
v1 = _mm_maskz_add_epi16(m, v1, v2);
_mm_store_si128((__m128i*)data2, v1);
}

void test2(void* data1, void* data2)
{
__m128i v1 = _mm_load_si128((__m128i const*)data1);
__m128i v2 = _mm_load_si128((__m128i const*)data2);
__mmask8 m1 = _mm_testn_epi16_mask(v1, v1);
__mmask8 m2 = _mm_testn_epi16_mask(v2, v2);
m1 = _kor_mask8(m1, 3);
m2 = _kor_mask8(m2, 3);
__mmask8 m = _kand_mask8(m1, m2);
v1 = _mm_maskz_add_epi16(m, v1, v2);
_mm_store_si128((__m128i*)data2, v1);
}
[/code]

When compiled using clang with -O3 -march=skylake-avx512, both are optimized to
the same code:

[asm]
test(void*, void*): # @test(void*, void*)
  vmovdqa xmm0, xmmword ptr [rdi]
  vmovdqa xmm1, xmmword ptr [rsi]
  vpor xmm2, xmm1, xmm0
  vptestnmw k0, xmm2, xmm2
  mov al, 3
  kmovd k1, eax
  korb k1, k0, k1
  vpaddw xmm0 {k1} {z}, xmm1, xmm0
  vmovdqa xmmword ptr [rsi], xmm0
  ret
[/asm]

gcc 9.0.0 20181211 (experimental) produces this:

[asm]
test1(void*, void*):
  vmovdqa64 xmm1, XMMWORD PTR [rsi]
  vmovdqa64 xmm0, XMMWORD PTR [rdi]
  vptestnmw k1, xmm1, xmm1
  vptestnmw k2{k1}, xmm0, xmm0
  kmovb eax, k2
  or eax, 3
  kmovb k3, eax
  vpaddw xmm0{k3}{z}, xmm0, xmm1
  vmovaps XMMWORD PTR [rsi], xmm0
  ret
test2(void*, void*):
  vmovdqa64 xmm0, XMMWORD PTR [rdi]
  vmovdqa64 xmm1, XMMWORD PTR [rsi]
  vptestnmw k1, xmm0, xmm0
  vptestnmw k3, xmm1, xmm1
  mov eax, 3
  kmovb k2, eax
  korb k1, k1, k2
  korb k0, k3, k2
  kandb k1, k1, k0
  vpaddw xmm0{k1}{z}, xmm0, xmm1
  vmovaps XMMWORD PTR [rsi], xmm0
  ret
[/asm]

[Bug target/88473] New: AVX512: constant folding on mask does not remove unnecessary instructions

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88473

Bug ID: 88473
   Summary: AVX512: constant folding on mask does not remove
unnecessary instructions
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 

void test(void* data, void* data2)
{
__m128i v = _mm_load_si128((__m128i const*)data);
__mmask8 m = _mm_testn_epi16_mask(v, v);
m = _kor_mask8(m, 0x0f);
m = _kor_mask8(m, 0xf0);
v = _mm_maskz_add_epi16(m, v, v);
_mm_store_si128((__m128i*)data2, v);
}
[/code]

Code compiled using gcc 8.2 with -O3 -march=skylake-avx512 . gcc was able to
fold constant expressions and simplify masked vector add to non-masked one.
However original version of folded expression is still present in output:

[asm]
test(void*, void*):
  vmovdqa64 xmm0, XMMWORD PTR [rdi]
  mov eax, 15
  vptestnmw k1, xmm0, xmm0
  kmovb k2, eax
  vpaddw xmm0, xmm0, xmm0
  mov eax, -16
  kmovb k3, eax
  vmovaps XMMWORD PTR [rsi], xmm0
  korb k0, k1, k2
  korb k0, k0, k3
  ret
[/asm]

clang properly cleaned it up:

[asm]
test(void*, void*): # @test(void*, void*)
  vmovdqa xmm0, xmmword ptr [rdi]
  vpaddw xmm0, xmm0, xmm0
  vmovdqa xmmword ptr [rsi], xmm0
  ret
[/asm]

[Bug target/88465] AVX512: optimize loading of constant values to kN registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465

--- Comment #3 from Daniel Fruzynski  ---
This "null" ia an icc bug. Matt Godbolt from Compiler Explorer filed a bug with
Intel: ref 03997020

[Bug target/88465] AVX512: optimize loading of constant values to kN registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465

--- Comment #2 from Daniel Fruzynski  ---
I have logged issue for CompileExplorer to clarify this null instruction:
https://github.com/mattgodbolt/compiler-explorer/issues/1220

[Bug c/88465] New: AVX512: optimize loading of constant values to kN registers

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88465

Bug ID: 88465
   Summary: AVX512: optimize loading of constant values to kN
registers
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

When constant value is loaded into kN register, gcc puts it into eax first, and
then moved to kN register:

[code]
#include 
#include 

__mmask8 test(__mmask8 m)
{
__mmask8 m2 = _kand_mask8(m, 3);
return m2;
}
[/code]

[asm]
test(unsigned char):
mov eax, 3
kmovb   k1, eax
kmovb   k2, edi
kandb   k0, k1, k2
kmovb   eax, k0
ret
[/asm]

icc uses one instruction for this. https://godbolt.org/ displayed it as "null",
but most probably this is wrong name:

[asm]
test(unsigned char):
vkmovbk0, edi   #6.19
null  k1, 3 #6.19
kandb k2, k0, k1#6.19
vkmovbeax, k2   #6.19
ret #7.12
[/asm]

You can also use instructions kxor and kxnor to load 0 and -1.

[Bug target/88461] AVX512: gcc should keep value in kN registers if possible

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461

--- Comment #3 from Daniel Fruzynski  ---
Good catch, mask should be 16-bit. Here is fixed version:

[code]
#include 
#include 

int test(uint16_t* data, int a)
{
__m128i v = _mm_load_si128((const __m128i*)data);
__mmask16 m = _mm_testn_epi16_mask(v, v);
m = _kshiftli_mask16(m, 1);
m = _kandn_mask16(m, a);
return m;
}
[/code]

[asm]
test(unsigned short*, int):
vmovdqa64   xmm0, XMMWORD PTR [rdi]
kmovw   k4, esi
vptestnmw   k1, xmm0, xmm0
kmovb   eax, k1
kmovw   k2, eax
kshiftlwk0, k2, 1
kandnw  k3, k0, k4
kmovw   eax, k3
ret
[/asm]

This still can be optimized, there is no need to move value from k1 to eax and
then to k2 - vptestnmw zeroes upper bits if k register.

[Bug c/81665] Please introduce flags attribute for enums which will mimic one from C#

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81665

--- Comment #4 from Daniel Fruzynski  ---
@Jonathan Wakely: constexpr requires C++11. When I reported this bug, we still
were at C++98 with most of out codebase.

[Bug target/88461] AVX512: gcc should keep value in kN registers if possible

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461

--- Comment #1 from Daniel Fruzynski  ---
For comparison, this is code generated by icc 19.0.1:

[asm]
test(unsigned short*, int):
vmovdqu   xmm0, XMMWORD PTR [rdi]   #6.48
vptestnmw k0, xmm0, xmm0#7.18
kmovw k2, esi   #11.9
kshiftlw  k1, k0, 1 #9.9
kandnwk3, k1, k2#11.9
kmovb k4, k3#13.12
kmovw eax, k4   #13.12
ret #13.12
[/asm]

[Bug c/88461] New: AVX512: gcc should keep value in kN registers if possible

2018-12-12 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88461

Bug ID: 88461
   Summary: AVX512: gcc should keep value in kN registers if
possible
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I tried to write piece of code which used new AVX512 logic instructions which
works on kN registers. It turned out that gcc was moving intermediate values
back and forth between kN and eax, what resulted in very poor code.

Example was compiled using gcc 8.2 with -O3 -march=skylake-avx512

[code]
#include 
#include 

int test(uint16_t* data, int a)
{
__m128i v = _mm_load_si128((const __m128i*)data);
__mmask8 m = _mm_testn_epi16_mask(v, v);
m = _kshiftli_mask16(m, 1);
m = _kandn_mask16(m, a);
return m;
}
[/code]

[asm]
test(unsigned short*, int):
vmovdqa64   xmm0, XMMWORD PTR [rdi]
kmovw   k5, esi
vptestnmw   k1, xmm0, xmm0
kmovb   eax, k1
kmovw   k2, eax
kshiftlwk0, k2, 1
kmovw   eax, k0
movzx   eax, al
kmovw   k4, eax
kandnw  k3, k4, k5
kmovw   eax, k3
movzx   eax, al
ret
[/asm]

[Bug target/88271] Omit test instruction after add

2018-12-10 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #10 from Daniel Fruzynski  ---
Here is possible code transformation to equivalent form, where this
optimization can be simply applied. This change also has a bit surprising side
effect, second nested while loop is unrolled.

[code]
void test2()
{
int level = 0;
int val = 1;
while (1)
{
while(1)
{
val = data[level] << 1;
++level;
if (val)
continue;
else
break;
}

while(1)
{
--level;
val = data[level];
if (!val)
continue;
else
break;

}
}
}
[/code]

[Bug target/88271] Omit test instruction after add

2018-12-07 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #9 from Daniel Fruzynski  ---
I have idea about alternate approach to this. gcc could try to look for
relations between loop control statement, and other statements which modify
variables used in that control statement. With such knowledge it could try to
reorganize code to better optimize it. This approach would eliminate randomness
here.

[Bug target/88271] Omit test instruction after add

2018-12-06 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #8 from Daniel Fruzynski  ---
I have results from Callgrind. Cycle estimation for MoveRows function (without
children) is 58.29%. This is for app without test instruction. So in synthetic
benchmark for this function only speed change would be about 2%.

[Bug target/88271] Omit test instruction after add

2018-12-06 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #7 from Daniel Fruzynski  ---
One more note: this particular function creates matrices with all possible
permutations of row order of original matrix, which satisfies some additional
criteria. So this optimization may be applicable to other algorithms which
generates permutations.

[Bug target/88271] Omit test instruction after add

2018-12-06 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #6 from Daniel Fruzynski  ---
Average for version with test is 246.313ms, I deleted too many digits.

[Bug target/88271] Omit test instruction after add

2018-12-06 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #5 from Daniel Fruzynski  ---
How to use perf? I did not have change to use it yet, I usually use time
command or callgrind.

I have run my app compiled with AVX2 instructions on Xeon E5-2683 v3, CentOS
7.6, on idle CPU. I run it 3 times for both versions (w/ and w/o test
instructions). Here are results:

With test: Average 246,3 ms, StdDev 0,198
W/o test: Average 244,013ms, StdDev 0,043

Version with test is 0,94% slower - this is result which I expected.

[Bug target/88271] Omit test instruction after add

2018-12-06 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #3 from Daniel Fruzynski  ---
What about adding new pass at the end? It would look for various possible
optimizations, which were missed earlier because they are cross-basic block.

In my case this example code is part of tight loop. From previous experiences
with it I expect that this optimization could improve speed by something like
0.5%-1%. If you want to look on real code, is it at link below. CPU spends
about 60% of time in this one function. This app runs on BOINC platform, so
such microoptimization would be worthwhile there.

https://github.com/sirzooro/RakeSearch/blob/optimizations2/RakeDiagSearch/RakeDiagSearch/MovePairSearch.cpp#L583

[Bug target/88271] Omit test instruction after add

2018-12-06 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

--- Comment #1 from Daniel Fruzynski  ---
I checked that in simple case when bit shift is used in "if", it is optimized:

[code]
void f();
void g();

void test(int n)
{
if (n << 1)
f();
else
g();
}
[/code]

[asm]
test(int):
add edi, edi
je  .L2
jmp f()
.L2:
jmp g()
[/asm]

[Bug middle-end/88387] New: Possible code optimization when right shift count >= width of type

2018-12-06 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88387

Bug ID: 88387
   Summary: Possible code optimization when right shift count >=
width of type
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

When signed int is shifted right by more than its width, results will be either
0 or -1. This can used to simplify conditions like in test() to sign check as
in test2(). Example code below was compiled using gcc 8.2 with -O3
-march=skylake-avx512.

[code]
void f();
void g();

void test(int n)
{
if (n >> )
f();
else
g();
}

void test2(int n)
{
if (n < 0)
f();
else
g();
}
[/code]

[asm]
test(int):
mov eax, -82
sarxedi, edi, eax
testedi, edi
je  .L2
jmp f()
.L2:
jmp g()
test2(int):
testedi, edi
js  .L6
jmp g()
.L6:
jmp f()
[/code]

[Bug middle-end/88361] gcc does not unroll loop

2018-12-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88361

--- Comment #1 from Daniel Fruzynski  ---
For reference, this is NEON code which I used on AARCH64:

[code]
void test2()
{
int n = 0;
for (; n < SIZE*SIZE-3; n += 4)
{
// Copy data
uint32x4_t v = vld1q_u32((uint32_t*)([0][0] + n));
vst1q_u32((uint32_t*)([0][0] + n), v);

// Calculate bitmasks
v = vshlq_u32(vdupq_n_u32(1), vreinterpretq_s32_u32(v));
vst1q_u32((uint32_t*)([0][0] + n), v);
}

for (; n < SIZE*SIZE; n++)
{
int x = *([0][0] + n);
*(([0][0] + n)) = x;
*(([0][0] + n)) = 1 << x;
}
}
[/code]

[Bug middle-end/88361] New: gcc does not unroll loop

2018-12-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88361

Bug ID: 88361
   Summary: gcc does not unroll loop
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include "immintrin.h"

#define SIZE 9

int src[SIZE][SIZE] __attribute__((aligned(16)));
int dst1[SIZE][SIZE] __attribute__((aligned(16)));
int dst2[SIZE][SIZE] __attribute__((aligned(16)));

void test1()
{
for (int i = 0; i < SIZE; ++i)
{
for (int j = 0; j < SIZE; ++j)
{
dst1[i][j] = src[i][j];
dst2[i][j] = 1u << src[i][j];
}
}
}

#pragma GCC push_options
#pragma GCC optimize ("unroll-loops")
void test2()
{
int n = 0;
for (; n < SIZE*SIZE-3; n += 4)
{
// Copy data
__m128i v = _mm_load_si128((const __m128i*)([0][0] + n));
_mm_store_si128((__m128i*)([0][0] + n), v);

// Calculate bitmasks
v = _mm_sllv_epi32(_mm_set1_epi32(1), v);
_mm_store_si128((__m128i*)([0][0] + n), v);
}

for (; n < SIZE*SIZE; n++)
{
int x = *([0][0] + n);
*(([0][0] + n)) = x;
*(([0][0] + n)) = 1 << x;
}
}
#pragma GCC pop_options
[/code]

When code above is compiled using gcc 8.2 with -O3 -mavx2 -mprefer-avx128,
loops in test1() are unrolled and vectorized as expected. However in test2()
loops are not unrolled completely, even with unroll pragma:

[asm]
test2():
  mov eax, OFFSET FLAT:dst1
  mov esi, OFFSET FLAT:src
  mov ecx, 40
  xor edx, edx
  mov rdi, rax
  vmovdqa xmm1, XMMWORD PTR .LC0[rip]
  rep movsq
.L4:
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rdx]
  lea rax, [rdx+16]
  vmovaps XMMWORD PTR dst2[rdx], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rdx+16]
  vmovaps XMMWORD PTR dst2[rax], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rdx+32]
  vmovaps XMMWORD PTR dst2[rdx+32], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rax+32]
  lea rdx, [rax+144]
  vmovaps XMMWORD PTR dst2[rax+32], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rax+48]
  vmovaps XMMWORD PTR dst2[rax+48], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rax+64]
  vmovaps XMMWORD PTR dst2[rax+64], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rax+80]
  vmovaps XMMWORD PTR dst2[rax+80], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rax+96]
  vmovaps XMMWORD PTR dst2[rax+96], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rax+112]
  vmovaps XMMWORD PTR dst2[rax+112], xmm0
  vpsllvd xmm0, xmm1, XMMWORD PTR src[rax+128]
  vmovaps XMMWORD PTR dst2[rax+128], xmm0
  cmp rax, 176
  jne .L4
  mov ecx, DWORD PTR src[rip+320]
  mov eax, 1
  sal eax, cl
  mov DWORD PTR dst1[rip+320], ecx
  mov DWORD PTR dst2[rip+320], eax
  ret
[/asm]

This issue also exists in gcc 8.2 for AARCH64. I found it there first, and then
checked that on x86_64 it is also present.

[Bug bootstrap/88321] Crosscompiled gcc does not use preinstalled as

2018-12-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88321

--- Comment #1 from Daniel Fruzynski  ---
Update: there is workaround for this, pass
"--with-ld=/bin/x86_64-w64-mingw32-ld --with-as=/bin/x86_64-w64-mingw32-as" to
configure script.

I also tried to use "--with-ld=x86_64-w64-mingw32-ld
--with-as=x86_64-w64-mingw32-as". With these options initial configure
completed successfully, but build failed that command cannot be found -
probably PATH environment variable was cleared during build, and full path to
specified as/ld was not saved.

[Bug bootstrap/88321] New: Crosscompiled gcc does not use precompiled as

2018-12-03 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88321

Bug ID: 88321
   Summary: Crosscompiled gcc does not use precompiled as
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I have build gcc 8.2.0 as as crosscompiler for Centos 7 x86_64 -> MinGW x86_64.
Before starting I installed gcc 4.9.3 MinGW crosscompiler from EPEL repository. 

My new crosscompiler was configured in following way:

[root@localhost build]# ../gcc-8.2.0/configure --prefix=/gcc-8.2-mingw/
--enable-languages=c,c++ --disable-multilib --build=x86_64-redhat-linux-gnu
--host=x86_64-redhat-linux-gnu --target=x86_64-w64-mingw32 --without-newlib
--disable-multilib --disable-plugin --with-system-zlib --disable-nls
--without-included-gettext --disable-win32-registry --enable-threads=posix
--enable-libgomp --with-sysroot=/usr/x86_64-w64-mingw32/sys-root
--with-gxx-include-dir=/usr/x86_64-w64-mingw32/sys-root/mingw/include/c++
checking build system type... x86_64-redhat-linux-gnu
checking host system type... x86_64-redhat-linux-gnu
checking target system type... x86_64-w64-mingw32
checking for a BSD-compatible install... /bin/install -c
checking whether ln works... yes
checking whether ln -s works... yes 
checking for a sed that does not truncate output... /bin/sed
checking for gawk... gawk   
checking for libatomic support... yes
checking for libitm support... no
checking for libsanitizer support... no
checking for libvtv support... yes
checking for libmpx support... no
checking for libhsail-rt support... no
checking for x86_64-redhat-linux-gnu-gcc... no
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for x86_64-redhat-linux-gnu-g++... no
checking for x86_64-redhat-linux-gnu-c++... no
checking for x86_64-redhat-linux-gnu-gpp... no
checking for x86_64-redhat-linux-gnu-aCC... no
checking for x86_64-redhat-linux-gnu-CC... no
checking for x86_64-redhat-linux-gnu-cxx... no
checking for x86_64-redhat-linux-gnu-cc++... no
checking for x86_64-redhat-linux-gnu-cl.exe... no
checking for x86_64-redhat-linux-gnu-FCC... no
checking for x86_64-redhat-linux-gnu-KCC... no
checking for x86_64-redhat-linux-gnu-RCC... no
checking for x86_64-redhat-linux-gnu-xlC_r... no
checking for x86_64-redhat-linux-gnu-xlC... no
checking for g++... g++
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking whether g++ accepts -static-libstdc++ -static-libgcc... no
checking for x86_64-redhat-linux-gnu-gnatbind... no
checking for gnatbind... no
checking for x86_64-redhat-linux-gnu-gnatmake... no
checking for gnatmake... no
checking whether compiler driver understands Ada... no
checking how to compare bootstrapped objects... cmp --ignore-initial=16 $$f1
$$f2
checking for objdir... .libs
configure: WARNING: using in-tree isl, disabling version check
*** This configuration is not supported in the following subdirectories:
 zlib target-libitm target-libsanitizer target-libmpx target-libffi
target-libgo gnattools gotools target-libada target-libhsail-rt
target-libgfortran target-libbacktrace target-libobjc target-liboffloadmic
(Any other directories should still work fine.)
checking for default BUILD_CONFIG... 
checking for --enable-vtable-verify... no
checking for bison... bison -y
checking for bison... bison
checking for gm4... no
checking for gnum4... no
checking for m4... m4
checking for flex... flex
checking for flex... flex
checking for makeinfo... makeinfo
checking for expect... no
checking for runtest... no
checking for x86_64-redhat-linux-gnu-ar... no
checking for ar... ar
checking for x86_64-redhat-linux-gnu-as... no
checking for as... as
checking for x86_64-redhat-linux-gnu-dlltool... no
checking for dlltool... no
checking for x86_64-redhat-linux-gnu-ld... no
checking for ld... ld
checking for x86_64-redhat-linux-gnu-lipo... no
checking for lipo... no
checking for x86_64-redhat-linux-gnu-nm... no
checking for nm... nm
checking for x86_64-redhat-linux-gnu-ranlib... no
checking for ranlib... ranlib
checking for x86_64-redhat-linux-gnu-strip... no
checking for strip... strip
checking for x86_64-redhat-linux-gnu-windres... no
checking for windres... no
checking for x86_64-redhat-linux-gnu-windmc... no
checking for windmc... no
checking for x86

[Bug c/88276] New: AVX512: reorder bit ops to get free and operation

2018-11-30 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88276

Bug ID: 88276
   Summary: AVX512: reorder bit ops to get free and operation
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 
#include 

int test1(const __m128i* src, int mask)
{
__m128i v = _mm_load_si128(src);
int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128());
return (cmp << 1) & mask;
}

int test2(const __m128i* src, int mask)
{
__m128i v = _mm_load_si128(src);
int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128());
return (cmp & (mask >> 1)) << 1;
}
[/code]

test1() shifts result of _mm_cmpeq_epi16_mask() first, then and it with mask.
In test2() mask is shifted first, then and-ed with cmp result, and then shifted
again. In this case result of _mm_cmpeq_epi16_mask uses 8 bits only, so both
code versions are equivalent.

This compiles to following asm code, using gcc 8.2 with -O3
-march=skylake-avx512:

[asm]
test1(long long __vector(2) const*, int):
vpxor   xmm0, xmm0, xmm0
vpcmpeqwk1, xmm0, XMMWORD PTR [rdi]
kmovb   edx, k1
lea eax, [rdx+rdx]
and eax, esi
ret
test2(long long __vector(2) const*, int):
mov eax, esi
sar eax
vpxor   xmm0, xmm0, xmm0
kmovb   k2, eax
vpcmpeqwk1{k2}, xmm0, XMMWORD PTR [rdi]
kmovb   eax, k1
add eax, eax
ret
[/asm]

Such change may lead to more effective code, as with AVX512 this and op can be
merged into vpcmpeqw instruction. In my case this was part of bigger function
which was performing series of such calculations on array, and after this
change it started working faster.

[Bug c/88271] New: Omit test instruction after add

2018-11-29 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88271

Bug ID: 88271
   Summary: Omit test instruction after add
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
int data[8];

void test(int k)
{
int level = 0;
int val = 1;
while (1)
{
if (val)
{
val = data[level] << 1;
++level;
}
else
{
--level;
val = data[level];
}
}
}
[/code]

This code compiled using gcc 8.2 with options -O3 -march=skylake-avx512
produces this:

[asm]
test(int):
mov edx, 1
xor eax, eax
.L2:
testedx, edx
je  .L3
.L6:
movsx   rdx, eax
mov edx, DWORD PTR data[0+rdx*4]
inc eax
add edx, edx
testedx, edx
jne .L6
.L3:
dec eax
movsx   rdx, eax
mov edx, DWORD PTR data[0+rdx*4]
jmp .L2
data:
.zero   32
[/asm]

I checked that add instruction updates CPU flags, so test instruction before
"jne .L6" could be omitted.

[Bug tree-optimization/88153] sqrt() is not vectorized

2018-11-26 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88153

Daniel Fruzynski  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

--- Comment #4 from Daniel Fruzynski  ---
I checked man page for errno and it has following sencence:

"Valid error numbers are all nonzero; errno is never set to zero by any system
call or library function."

This means that code like mine from Comment 0 should do the trick: it checks
for negative values for all processed values, stores status in temporary
variable, and calls sqrt(-1) once at the end if one of these values was
negative.

I have created small benchmark:

[code]
#include 
#include 
#include 

#define SIZE 160

double src[SIZE];
double dest[SIZE];

static void BM_sqrt(benchmark::State& state)
{
for (auto _ : state)
{
for (int n = 0; n < SIZE; ++n)
dest[n] = sqrt(src[n]);
benchmark::ClobberMemory();
}
}
// Register the function as a benchmark
BENCHMARK(BM_sqrt);

static void BM_sse_sqrt_errno(benchmark::State& state)
{
for (auto _ : state)
{
int m = 0;
for (int n = 0; n < SIZE; n += 2)
{
__m128d v = _mm_load_pd([n]);
__m128d vs = _mm_sqrt_pd(v);
__m128d vn = _mm_cmplt_pd(v, _mm_setzero_pd());
m |= _mm_movemask_pd(vn);
_mm_store_pd([n], vs);
}
if (m)
sqrt(-1.0);
benchmark::ClobberMemory();
}
}
// Register the function as a benchmark
BENCHMARK(BM_sse_sqrt_errno);

static void BM_sse_sqrt(benchmark::State& state)
{
for (auto _ : state)
{
for (int n = 0; n < SIZE; n += 2)
{
__m128d v = _mm_load_pd([n]);
__m128d vs = _mm_sqrt_pd(v);
_mm_store_pd([n], vs);
}
benchmark::ClobberMemory();
}
}
// Register the function as a benchmark
BENCHMARK(BM_sse_sqrt);

BENCHMARK_MAIN();
[/code]

This code was compiled using gcc 4.8.5, with following options:
g++ -std=c++11 -o test test.cc -O3 -I/benchmark/include/ -L/benchmark/lib/
-lbenchmark

Results for SIZE = 16 (loops unrolled):

-
Benchmark  Time   CPU Iterations
-
BM_sqrt   86 ns 86 ns7188074
BM_sse_sqrt_errno 15 ns 15 ns   48084834
BM_sse_sqrt   15 ns 15 ns   47797778

Results for SIZE = 160 (loops not unrolled):

-
Benchmark  Time   CPU Iterations
-
BM_sqrt  995 ns995 ns 839866
BM_sse_sqrt_errno156 ns156 ns4348870
BM_sse_sqrt  144 ns144 ns4549107

As you can see, results for BM_sse_sqrt_errno are much better than BM_sqrt and
close to BM_sse_sqrt. If optimization implemented in BM_sse_sqrt_errno
satisfies error handling requirements for sqrt(), it is definitely worth
implementing in gcc.

[Bug tree-optimization/88153] sqrt() is not vectorized

2018-11-22 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88153

--- Comment #2 from Daniel Fruzynski  ---
I checked that godbolt.org uses g++ (GCC-Explorer-Build) 9.0.0 20181110
(experimental). This version does not have such patch merged.

Anyway, code compiled with -fmath-errno enabled would benefit from
vectorization if it can be done.

[Bug c/88153] New: sqrt() is not vectorized

2018-11-22 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88153

Bug ID: 88153
   Summary: sqrt() is not vectorized
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Sequence of calls to sqrt() is not vectorized.

I found Bug 21466 that claims that it was fixed in GCC 4.3, but looks that
change was reverted - at least 4.4.7 it also is not vectorized. I suspect that
after that change errors were not reported correctly - non-vectorized code uses
sqrtsd, and for negative numbers it also calls sqrt for its side effects.

I wrote following code snippet as a possible solution for SSE instructions. I
did not check all details how errors should be reported for sequence of sqrt
calls, so it may need some changes.

#include 
#include 

#define SIZE 8
double d1[SIZE];
double d2[SIZE];

void test()
{
int m = 0;
for (int n = 0; n < SIZE; n += 2)
{
__m128d v = _mm_load_pd([n]);
__m128d vs = _mm_sqrt_pd(v);
__m128d vn = _mm_cmplt_pd(v, _mm_setzero_pd());
m |= _mm_movemask_pd(vn);
_mm_store_pd([n], vs);
}

if (m)
sqrt(-1.0);
}

[Bug middle-end/88097] Missing optimization of endian conversion

2018-11-21 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88097

--- Comment #6 from Daniel Fruzynski  ---
Thanks Joseph for info. godbolt.org now uses glibc 2.27, so no wonder that I
got results which I posted here.

[Bug middle-end/88097] Missing optimization of endian conversion

2018-11-20 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88097

Daniel Fruzynski  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

--- Comment #4 from Daniel Fruzynski  ---
Looks that there is one more issue here, ntohs is implemented with inline
assembly instead of __builtin_bswap16. When I tried to use this buildin gcc
started using movbe instruction when compiling with -O3 -mmovbe.

uint32_t test2(Test* ip)
{
return ((__builtin_bswap16(ip->Word1) << 16) | 
__builtin_bswap16(ip->Word2));
}

test2(Test*):
movbe   ax, WORD PTR [rdi]
movbe   dx, WORD PTR [rdi+2]
sal eax, 16
movzx   edx, dx
or  eax, edx
ret

When I was logging this issue yesterday, bugzilla showed Bug 54733 as a
possible duplicate. Looks that gcc already has some similar kind of
optimization implemented.  I suspect that after fixing system headers to use
__builtin_bswap* instead of inline assembly it would be possible to improve
this optimization further. I reopen this issue.

[Bug middle-end/88097] Missing optimization of endian conversion

2018-11-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88097

--- Comment #2 from Daniel Fruzynski  ---
Please also take a look on code which performs opposite conversion. gcc also
does not use movbe here. Both gcc and clang are not able to optimize this into
one 32-bit store, this is another possible optimization here.

void test2(Test* s, uint32_t ip)
{
s->Word1 = htons(ip >> 16);
s->Word2 = htons(ip);
}

[Bug middle-end/88097] Missing optimization of endian conversion

2018-11-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88097

--- Comment #1 from Daniel Fruzynski  ---
I also tried to swap Word1 and Word2 fields in structure to see what will
happen. It turned out that gcc with -O3 -mmovbe generates code without movbe:
[asm]
test(Test*):
movzx   eax, WORD PTR [rdi+2]
movzx   edx, WORD PTR [rdi]
rorw $8, ax
rorw $8, dx
sal eax, 16
movzx   edx, dx
or  eax, edx
ret
[/asm]

clang generates code with movbe:

[asm]
test(Test*):  # @test(Test*)
movbe   cx, word ptr [rdi + 2]
shl ecx, 16
movbe   ax, word ptr [rdi]
movzx   eax, ax
or  eax, ecx
ret
[/asm]

[Bug middle-end/88097] New: Missing optimization of endian conversion

2018-11-19 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88097

Bug ID: 88097
   Summary: Missing optimization of endian conversion
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I have found some old code network code which looked like this:

[code]
#include 
#include 

struct Test
{
uint16_t Word1;
uint16_t Word2;
};

uint32_t test(Test* ip)
{
return ((ntohs(ip->Word1) << 16) | ntohs(ip->Word2));
}
[/code]

gcc 8.2 compiles it in following way (with -O3):

[asm]
test(Test*):
movzx   eax, WORD PTR [rdi]
movzx   edx, WORD PTR [rdi+2]
rorw $8, ax
rorw $8, dx
sal eax, 16
movzx   edx, dx
or  eax, edx
ret
[/asm]

clang 7.0.0 recognizes that both 16-bit fields are next to each other, so
32-bit byte swap can be used:

[asm]
test(Test*):  # @test(Test*)
mov eax, dword ptr [rdi]
bswap   eax
ret
[/asm]

And this is with -mmovbe added:

[asm]
test(Test*):  # @test(Test*)
movbe   eax, dword ptr [rdi]
ret
[/asm]

[Bug c++/87731] Detection of mismatched alloc/free pairs

2018-10-24 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87731

--- Comment #3 from Daniel Fruzynski  ---
Logged Bug 87736 for new proposed attributes.

[Bug c++/87736] New: New attributes to mark custom alloc/free function pair

2018-10-24 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87736

Bug ID: 87736
   Summary: New attributes to mark custom alloc/free function pair
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

This is split from Bug 87731, as recommended by Jonathan Wakely:

Valgrind provides set of macros which allows it to track custom alloc/free
functions. It would be nice if you add new attributes which could be attached
to custom alloc and free functions, so gcc could check pairing for them too. I
think of something like this:

__attribute__((malloc("MyAllocType")))
void* MyAlloc(size_t);

__attribute__((free("MyAllocType")))
void MyFree(void*);

[Bug c++/87732] Detect and eliminate unnecessary alloc/free pairs

2018-10-24 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87732

--- Comment #1 from Daniel Fruzynski  ---
New warning for this also would be welcome.

[Bug c++/87732] New: Detect and eliminate unnecessary alloc/free pairs

2018-10-24 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87732

Bug ID: 87732
   Summary: Detect and eliminate unnecessary alloc/free pairs
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
void foo()
{
char* c = new char[4];
delete[] c;
}
[/code]

gcc with -O3 generates this:

[asm]
foo():
sub rsp, 8
mov edi, 4
calloperator new[](unsigned long)
add rsp, 8
mov rdi, rax
jmp operator delete[](void*)
[/asm]

clang 7.0.0 is able to remove unnecessary alloc/free pair:

[asm]
foo():# @foo()
ret
[/asm]

Please do similar thing in gcc too.

[Bug c++/87731] New: Detection of mismatched alloc/free pairs

2018-10-24 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87731

Bug ID: 87731
   Summary: Detection of mismatched alloc/free pairs
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Following code compiles cleanly on gcc:

void foo()
{
char* c = new char[4];
delete c;
}

When it is compiles using clang 7.0.0, it generates following warning. Please
do the same in gcc.

:4:5: warning: 'delete' applied to a pointer that was allocated with
'new[]'; did you mean 'delete[]'? [-Wmismatched-new-delete]
delete c;
^
  []
:3:15: note: allocated with 'new[]' here
char* c = new char[4];
  ^
1 warning generated.
Compiler returned: 0

Valgrind also has similar diagnostics, it checks checks following pairs by
default: malloc/free, new/delete, new[]/delete[]. Please implement something
similar in gcc.

Valgrind also provides set of macros which allows it to track custom alloc/free
functions. It would be nice if you add new attributes which could be attached
to custom alloc and free functions, so gcc could check pairing for them too. I
think of something like this:

__attribute__((malloc("MyAllocType")))
void* MyAlloc(size_t);

__attribute__((free("MyAllocType")))
void MyFree(void*);

[Bug c++/87729] New: Please include -Woverloaded-virtual in -Wall

2018-10-24 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87729

Bug ID: 87729
   Summary: Please include -Woverloaded-virtual in -Wall
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

clang includes -Woverloaded-virtual in -Wall. Please do same for gcc.

[Bug web/87684] -Woverloaded-virtual is not documented

2018-10-22 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87684

--- Comment #4 from Daniel Fruzynski  ---
Thanks for the link. I have tried to google for "gcc Woverloaded-virtual" and
it did not show on the top, so I assumed that option is undocumented.

I will open new issue to add it to -Wall.

[Bug web/87684] -Woverloaded-virtual is not documented

2018-10-22 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87684

--- Comment #1 from Daniel Fruzynski  ---
Last paragraph should be "clang includes -Woverloaded-virtual in -Wall", I
noticed this too late to correct it.

[Bug web/87684] New: -Woverloaded-virtual is not documented

2018-10-22 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87684

Bug ID: 87684
   Summary: -Woverloaded-virtual is not documented
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: web
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

-Woverloaded-virtual is not documented at
https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html . I found this option
when clang reported this kind of issue in my code, and then I tried to use it
with gcc to get similar warning.

BTW, clang includes -Woverloaded-virtual in -Werror. Consider doing the same
for gcc.

[Bug c/87323] New: More complicated assembly for sode with custom copy constructor

2018-09-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87323

Bug ID: 87323
   Summary: More complicated assembly for sode with custom copy
constructor
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include 

typedef int32_t VInt __attribute((vector_size(32)));

class V1
{
VInt v;
public:
constexpr V1(const V1& v) = default;
//constexpr V1(const V1& v) : v(v.v) {}

constexpr V1(const VInt& v) : v(v) {}

constexpr V1 operator+(const V1& v2) const
{ return V1(v + v2.v); }

constexpr V1 operator*(const V1& v2) const
{ return V1(v * v2.v); }

constexpr operator VInt() const
{ return v; }
};

V1 test1(V1 a, V1 b, V1 c)
{
return a * c + b * c;
}
[/code]

When code above is compiled, gcc produces following assembly:

[out]
test1(V1, V1, V1):
  vpaddd ymm0, ymm1, ymm0
  vpmulld ymm0, ymm0, ymm2
  ret
[/out]

However when I comment out default copy constructor and uncomment custom one
(which should be equivalent), generated assembly is as follows:

[out]
test1(V1, V1, V1):
  vmovdqa ymm0, YMMWORD PTR [rcx]
  mov rax, rdi
  vpmulld ymm1, ymm0, YMMWORD PTR [rsi]
  vpmulld ymm0, ymm0, YMMWORD PTR [rdx]
  vpaddd ymm0, ymm1, ymm0
  vmovdqa YMMWORD PTR [rdi], ymm0
  vzeroupper
  ret
[/out]

[Bug c/87319] New: When vector is wrapped, expression is not optimized.

2018-09-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87319

Bug ID: 87319
   Summary: When vector is wrapped, expression is not optimized.
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

I was playing with vector extensions and intrinsics, checking if gcc would be
able to optimize vector expression a*c+b*c to (a+b)*c. It turned out that this
works for intrinsics (both wrapped in class and non-wrapped), and vector
extensions (non-wrapped only). When built-in operators for vector extensions
were used, code was not optimized (test3). Code was compiled with "-O3 -mavx2
-std=c++11".

[code]
#include 
#include 

typedef int32_t VInt __attribute((vector_size(32)));

class V1
{
VInt v;
public:
constexpr V1(const V1& v) : v(v.v) {}
constexpr V1(const VInt& v) : v(v) {}

constexpr V1 operator+(const V1& v2) const
{ return V1(v + v2.v); }

constexpr V1 operator*(const V1& v2) const
{ return V1(v * v2.v); }

constexpr operator VInt() const
{ return v; }
};

class V2
{
__m256i v;
public:
constexpr V2(const V2& v) : v(v.v) {}
constexpr V2(const __m256i& v) : v(v) {}

V2 operator+(const V2& v2) const
{ return V2(_mm256_add_epi32(v, v2.v)); }

V2 operator*(const V2& v2) const
{ return V2(_mm256_mullo_epi32(v, v2.v)); }

constexpr operator __m256i() const
{ return v; }
};

void test1(const int* a, const int* b, const int* c, int* d)
{
const VInt va = *(VInt*)a;
const VInt vb = *(VInt*)b;
const VInt vc = *(VInt*)c;
*(VInt*)d = va * vc + vb * vc;
}

void test2(const int* a, const int* b, const int* c, int* d)
{
const __m256i va = *(__m256i*)a;
const __m256i vb = *(__m256i*)b;
const __m256i vc = *(__m256i*)c;
const __m256i vd =_mm256_add_epi32(
_mm256_mullo_epi32(va, vc),
_mm256_mullo_epi32(vb, vc)
);
*(__m256i*)d = vd;
}

void test3(const int* a, const int* b, const int* c, int* d)
{
const V1 va = V1(*(VInt*)a);
const V1 vb = V1(*(VInt*)b);
const V1 vc = V1(*(VInt*)c);
*(VInt*)d = va * vc + vb * vc;
}

void test4(const int* a, const int* b, const int* c, int* d)
{
const V2 va(*(__m256i*)a);
const V2 vb(*(__m256i*)b);
const V2 vc(*(__m256i*)c);
*(__m256i*)d = va * vc + vb * vc;
}
[/code]

[out]
test1(int const*, int const*, int const*, int*):
  vmovdqa ymm0, YMMWORD PTR [rdi]
  vpaddd ymm0, ymm0, YMMWORD PTR [rsi]
  vpmulld ymm0, ymm0, YMMWORD PTR [rdx]
  vmovdqa YMMWORD PTR [rcx], ymm0
  vzeroupper
  ret
test2(int const*, int const*, int const*, int*):
  vmovdqa ymm0, YMMWORD PTR [rdi]
  vpaddd ymm0, ymm0, YMMWORD PTR [rsi]
  vpmulld ymm0, ymm0, YMMWORD PTR [rdx]
  vmovdqa YMMWORD PTR [rcx], ymm0
  vzeroupper
  ret
test3(int const*, int const*, int const*, int*):
  vmovdqa ymm0, YMMWORD PTR [rdx]
  vpmulld ymm1, ymm0, YMMWORD PTR [rdi]
  vpmulld ymm0, ymm0, YMMWORD PTR [rsi]
  vpaddd ymm0, ymm1, ymm0
  vmovdqa YMMWORD PTR [rcx], ymm0
  vzeroupper
  ret
test4(int const*, int const*, int const*, int*):
  vmovdqa ymm0, YMMWORD PTR [rdi]
  vpaddd ymm0, ymm0, YMMWORD PTR [rsi]
  vpmulld ymm0, ymm0, YMMWORD PTR [rdx]
  vmovdqa YMMWORD PTR [rcx], ymm0
  vzeroupper
  ret
[/out]

[Bug c/87307] New: Implicit conversion from int to vector works, explicit is an error

2018-09-14 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87307

Bug ID: 87307
   Summary: Implicit conversion from int to vector works, explicit
is an error
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Vector types defined using __attribute((vector_size(N))) supports implicit
conversion from int to vector type for arithmetic operators. However when you
try to perform explicit conversion, you will gen an error. Please add support
for explicit conversion from int to vector type, it would be handy.

typedef int VInt __attribute((vector_size(32)));

VInt test1(VInt v)
{
return v + 2; // OK
}

VInt test2(VInt v)
{
return v + (VInt)2; // error: can't convert a value of type 'int' to vector
type 'VInt' {aka '__vector(8) int'} which has different size
}

VInt test3(VInt v)
{
VInt v2(2); // error: cannot convert 'int' to 'VInt' {aka '__vector(8)
int'} in initialization
return v + v2;
}

[Bug bootstrap/58828] Problem compiling gcc 4.8.2 using gcc 4.4.6

2018-09-07 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58828

--- Comment #5 from Daniel Fruzynski  ---
(In reply to Eric Gallager from comment #4)
> (In reply to Daniel Fruzynski from comment #3)
> > OK, I found it. I used script symlink-tree (distributed with binutils) to
> > create symlinks to binutils in gcc source dir. This script removed some gcc
> > source files and replaced them with symlinks to corresponding files in
> > binutils dir. I assumed that it will help me, but it created more problems.
> > 
> > I am building gcc without binutils symlinked, and build is on stage 2 now.
> > Look that it will complete successfully.
> > 
> > I think that dedicated script to symlink all binutils into gcc dir would be
> > useful. Could you create one?
> 
> "You" as in Richard or Ian?

I meant someone from gcc or binutils team who is responsible for this area.

BTW, now gcc is shipped with contrib/download_prerequisites script which
handles all this stuff, so you may close this issue if you want.

[Bug c++/84403] New: Possible further extension of constexpr: allow to use them as template parameters

2018-02-15 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84403

Bug ID: 84403
   Summary: Possible further extension of constexpr: allow to use
them as template parameters
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Values in constexpr functions are known at compile time, so theoretically they
could be used as template parameters like in example below. Please consider
proposing this and implementing for some future version of C++ standard.


#include 

constexpr int test(int n)
{
return std::integral_constant<int, n>::value;
}

[Bug tree-optimization/84106] loop distribution cost-model needs work

2018-02-05 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84106

--- Comment #6 from Daniel Fruzynski <bugzi...@poradnik-webmastera.com> ---
When you will be revisiting your cost-model for loops, please also take a look
on this code. test2 has one assignment moved to separate loops, and it is about
twice as fast as test1 function (for gcc 4.8.5).

[code]
#include 
#include 

#define N 9

int a1[N][N];
int a2[N][N];
int a3[N][N];
uint16_t a4[N][N-1];

void test1()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a2[i][j] = a1[i][j];
a3[i][j] = 1u << a1[i][j];
if (i > 0)
  a4[j][i-1] = a3[i][j];
   }
}
}

void test2()
{
for (int i = 0; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a2[i][j] = a1[i][j];
a3[i][j] = 1u << a1[i][j];
}
}
for (int i = 1; i < N; ++i)
{
for (int j = 0; j < N; ++j)
{
a4[j][i-1] = a3[i][j];
}
}
}
[/code]

[Bug bootstrap/84199] New: Error building gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu): cannot load liblto_plugin.so

2018-02-04 Thread bugzi...@poradnik-webmastera.com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84199

Bug ID: 84199
   Summary: Error building gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu):
cannot load liblto_plugin.so
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: bootstrap
  Assignee: unassigned at gcc dot gnu.org
  Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Created attachment 43337
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43337=edit
Full build log

I was trying to build gcc 7.3.0 on Odroid XU4 (ARM, Ubuntu) but build failed
with following error:

/gcc/build/./gcc/xgcc -B/gcc/build/./gcc/
-B/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/bin/
-B/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/lib/ -isystem
/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/include -isystem
/gcc-7.3.0/armv7l-unknown-linux-gnueabihf/sys-include-O2  -g -O2 -DIN_GCC  
 -W -Wall -Wno-narrowing -Wwrite-strings -Wcast-qual -Wstrict-prototypes
-Wmissing-prototypes -Wold-style-definition  -isystem ./include   -fPIC
-fno-inline -g -DIN_LIBGCC2 -fbuilding-libgcc -fno-stack-protector  -shared
-nodefaultlibs -Wl,--soname=libgcc_s.so.1 -Wl,--version-script=libgcc.map -o
./libgcc_s.so.1.tmp -g -O2 -B./ _thumb1_case_sqi_s.o _thumb1_case_uqi_s.o
_thumb1_case_shi_s.o _thumb1_case_uhi_s.o _thumb1_case_si_s.o _udivsi3_s.o
_divsi3_s.o _umodsi3_s.o _modsi3_s.o _bb_init_func_s.o _call_via_rX_s.o
_interwork_call_via_rX_s.o _lshrdi3_s.o _ashrdi3_s.o _ashldi3_s.o
_arm_negdf2_s.o _arm_addsubdf3_s.o 
[cut cut cut]
eqdf2_s.o gedf2_s.o ledf2_s.o muldf3_s.o negdf2_s.o subdf3_s.o unorddf2_s.o
fixdfsi_s.o floatsidf_s.o floatunsidf_s.o extendsfdf2_s.o truncdfsf2_s.o
enable-execute-stack_s.o unwind-arm_s.o libunwind_s.o pr-support_s.o
unwind-c_s.o emutls_s.o libgcc.a -lc && rm -f ./libgcc_s.so && if [ -f
./libgcc_s.so.1 ]; then mv -f ./libgcc_s.so.1 ./libgcc_s.so.1.backup; else
true; fi && mv ./libgcc_s.so.1.tmp ./libgcc_s.so.1 && (echo "/* GNU ld script";
echo "   Use the shared library, but some functions are only in"; echo "   the
static library.  */"; echo "GROUP ( libgcc_s.so.1 -lgcc )" ) > ./libgcc_s.so
/usr/bin/ld: /gcc/build/./gcc/liblto_plugin.so: error loading plugin:
/gcc/build/./gcc/liblto_plugin.so: cannot open shared object file: No such file
or directory
collect2: error: ld returned 1 exit status
Makefile:977: recipe for target 'libgcc_s.so' failed
make[3]: *** [libgcc_s.so] Error 1
make[3]: Leaving directory '/gcc/build/armv7l-unknown-linux-gnueabihf/libgcc'
Makefile:21293: recipe for target 'all-stage2-target-libgcc' failed
make[2]: *** [all-stage2-target-libgcc] Error 2
make[2]: Leaving directory '/gcc/build'
Makefile:26191: recipe for target 'stage2-bubble' failed
make[1]: *** [stage2-bubble] Error 2
make[1]: Leaving directory '/gcc/build'
Makefile:939: recipe for target 'all' failed
make: *** [all] Error 2


odroid@odroid-linux-1:~$ gcc --version
gcc (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.6) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

odroid@odroid-linux-1:~$ uname -a
Linux odroid-linux-1 3.10.105-138 #1 SMP PREEMPT Fri Apr 7 12:40:29 UTC 2017
armv7l armv7l armv7l GNU/Linux

odroid@odroid-linux-1:~$ cat /etc/*release*
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.3 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/;
SUPPORT_URL="http://help.ubuntu.com/;
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
odroid@odroid-linux-1:~$

  1   2   >