date:20170524

[Bug tree-optimization/80874] gcc does not emit cmov for minmax

2017-05-24 Thread denis.campredon at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80874

--- Comment #1 from denis.campredon at gmail dot com ---
Sorry, minmax3 should not produce the same asm, since minmax return a pair of
const reference.
But still the code is less than optimal.
One part it is because gcc might be because gcc is not able to optimize the two
functions the same way:

---
struct pair {
const int &x, y;
};

pair minmax(int x) {
return {x, x};
}

const std::pair minmax2(int x) {
return std::minmax(x, x);
}
--

[Bug tree-optimization/80876] [8 Regression] ICE in verify_loop_structure, at cfgloop.c:1644 (error: loop 1's latch does not have an edge to its header)

2017-05-24 Thread trippels at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80876

Markus Trippelsdorf  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-05-25
 CC||trippels at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Markus Trippelsdorf  ---
Started with r247879.

[Bug debug/80877] New: Derived template class can access base class's private constexpr/const static fields

2017-05-24 Thread tomasz.jankowski at nokia dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80877

Bug ID: 80877
   Summary: Derived template class can access base class's private
constexpr/const static fields
   Product: gcc
   Version: 6.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: debug
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tomasz.jankowski at nokia dot com
  Target Milestone: ---

Following example shows that in some cases GCC incorrectly grants access to
base class's private static members (both constexpr and const). The issue
occurs when derived class is a template class.

class Base
{
   private:
  constexpr static int value1 {4};
  const static int value2 {5};
};

template
class Derived : public Base
{
   public:
  Derived() :
 x{value1 + value2}
  {
 x = value1 + value2;
  }

  T getX() const
  {
 return x + value1 + value2;
  }

   private:
  T x;
};   

int main()
{
   Derived temp;
   return temp.getX();
}

Code was tested on recent x86-64 Linux machine using GCC v6.2.0 and v5.3.0.
Sample was compiled with following flags:
-std=c++11 -Wall -Wextra

[Bug tree-optimization/80876] New: [8 Regression] ICE in verify_loop_structure, at cfgloop.c:1644 (error: loop 1's latch does not have an edge to its header)

2017-05-24 Thread asolokha at gmx dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80876

Bug ID: 80876
   Summary: [8 Regression] ICE in verify_loop_structure, at
cfgloop.c:1644 (error: loop 1's latch does not have an
edge to its header)
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: asolokha at gmx dot com
  Target Milestone: ---

gcc-8.0.0-alpha20170521 snapshot ICEs when compiling the following snippet w/
-O2:

int sy;

void
fo (char o5)
{
  char yh = 0;

  if (o5 == 0)
return;

  while (o5 != 0)
if (0)
  {
while (yh != 0)
  {
o5 = 0;
while (o5 < 2)
  {
sy &= yh;
if (sy != 0)
  {
 km:
sy = yh;
  }
  }
++yh;
  }
  }
else
  {
o5 = sy;
goto km;
  }
}

void
on (void)
{
  fo (sy);
}

% x86_64-pc-linux-gnu-gcc-8.0.0-alpha20170521 -O2 -c a0nuylan.c 
a0nuylan.c: In function 'fo.part.0':
a0nuylan.c:34:1: error: loop 1's latch does not have an edge to its header
 }
 ^
a0nuylan.c:34:1: internal compiler error: in verify_loop_structure, at
cfgloop.c:1644

[Bug libgomp/80822] libgomp incorrect affinity when OMP_PLACES=threads

2017-05-24 Thread weeks at iastate dot edu

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80822

--- Comment #3 from Nathan Weeks  ---
Setting OMP_DISPLAY_ENV=verbose results in the following output with Intel
17.0.2:


OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] KMP_ABORT_DELAY='0'
  [host] KMP_ADAPTIVE_LOCK_PROPS='1,1024'
  [host] KMP_ALIGN_ALLOC='64'
  [host] KMP_ALL_THREADPRIVATE='256'
  [host] KMP_ALL_THREADS='2147483647'
  [host] KMP_ATOMIC_MODE='2'
  [host] KMP_BLOCKTIME='200'
  [host] KMP_CPUINFO_FILE: value is not defined
  [host] KMP_DETERMINISTIC_REDUCTION='FALSE'
  [host] KMP_DISP_NUM_BUFFERS='7'
  [host] KMP_DUPLICATE_LIB_OK='FALSE'
  [host] KMP_FORCE_REDUCTION: value is not defined
  [host] KMP_FOREIGN_THREADS_THREADPRIVATE='TRUE'
  [host] KMP_FORKJOIN_BARRIER='2,2'
  [host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_FORKJOIN_FRAMES='TRUE'
  [host] KMP_FORKJOIN_FRAMES_MODE='3'
  [host] KMP_GTID_MODE='3'
  [host] KMP_HANDLE_SIGNALS='FALSE'
  [host] KMP_HOT_TEAMS_MAX_LEVEL='1'
  [host] KMP_HOT_TEAMS_MODE='0'
  [host] KMP_INIT_AT_FORK='TRUE'
  [host] KMP_INIT_WAIT='2048'
  [host] KMP_ITT_PREPARE_DELAY='0'
  [host] KMP_LIBRARY='throughput'
  [host] KMP_LOCK_KIND='queuing'
  [host] KMP_MALLOC_POOL_INCR='1M'
  [host] KMP_NEXT_WAIT='1024'
  [host] KMP_NUM_LOCKS_IN_BLOCK='1'
  [host] KMP_PLAIN_BARRIER='2,2'
  [host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_REDUCTION_BARRIER='1,1'
  [host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
  [host] KMP_SCHEDULE='static,balanced;guided,iterative'
  [host] KMP_SETTINGS='FALSE'
  [host] KMP_SPIN_BACKOFF_PARAMS='4096,100'
  [host] KMP_STACKOFFSET='64'
  [host] KMP_STACKPAD='0'
  [host] KMP_STACKSIZE='4M'
  [host] KMP_STORAGE_MAP='FALSE'
  [host] KMP_TASKING='2'
  [host] KMP_TASK_STEALING_CONSTRAINT='1'
  [host] KMP_USER_LEVEL_MWAIT='FALSE'
  [host] KMP_VERSION='FALSE'
  [host] KMP_WARNINGS='TRUE'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='VERBOSE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='FALSE'
  [host] OMP_NUM_THREADS='32'
  [host] OMP_PLACES='threads'
  [host] OMP_PROC_BIND='spread'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='4M'
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_WAIT_POLICY='PASSIVE'
  [host]
KMP_AFFINITY='noverbose,warnings,respect,granularity=thread,noduplicates,compact,0,0'
OPENMP DISPLAY ENVIRONMENT END


For comparison, the Cray 8.5.4 OpenMP runtime (which produces the same thread
affinity as the Intel 17.0.2 OpenMP runtime in the aforementioned example)
outputs the following when OMP_DISPLAY_ENV=verbose:


OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP='201307'
  OMP_SCHEDULE='static,0'
  OMP_NUM_THREADS='32'
  OMP_DYNAMIC='TRUE'
  OMP_NESTED='FALSE'
  OMP_STACKSIZE='128MB'
  OMP_WAIT_POLICY='ACTIVE'
  OMP_MAX_ACTIVE_LEVELS='1023'
  OMP_THREAD_LIMIT='256'
  CRAY_OMP_CHECK_AFFINITY='FALSE'
  OMP_PROC_BIND='spread'
  OMP_PLACES='threads'
  OMP_CANCELLATION='FALSE'
  OMP_DISPLAY_ENV='VERBOSE'
  OMP_DEFAULT_DEVICE='0'
  CRAY_OMP_GUARD_SIZE='0B'
  CRAY_OMP_TASK_Q_LIMIT='256'
  CRAY_OMP_CONTENTION_POLICY='Automatic'
OPENMP DISPLAY ENVIRONMENT END


Also, in this environment, with OMP_NUM_THREADS=2 OMP_PLACES=threads
OMP_PROC_BIND=close, the libgomp affinity results in both threads being pinned
to different sockets:


$ OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close ./xthi-omp.gnu |
sort -k 4n,4n
Hello from thread 0, on nid00015. (core affinity = 0)
Hello from thread 1, on nid00015. (core affinity = 1)


Both the Intel and Cray OpenMP runtimes pin the threads to the same physical
core:


$ OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close ./xthi-omp.intel |
sort -k 4n,4n
Hello from thread 0, on nid00015. (core affinity = 0)
Hello from thread 1, on nid00015. (core affinity = 32)


It does seem that the OpenMP 4.5 specification can be interpreted to support
the libgomp behavior (e.g., p. 52 lines 33-38), though it at least seems
counterintuitive.

[Bug rtl-optimization/79801] Disable ira.c:add_store_equivs for some targets?

2017-05-24 Thread amodra at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79801

Alan Modra  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |WONTFIX

--- Comment #2 from Alan Modra  ---
Given the spec result (thanks Pat!) I think this issue can be closed.

[Bug c++/80544] result of const_cast should by cv-unqualified

2017-05-24 Thread redi at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80544

--- Comment #6 from Jonathan Wakely  ---
GCC now accepts the original testcase, and with -Wignored-qualifiers (which is
included in -Wextra) prints:

q.cc: In function ‘int main()’:
q.cc:8:30: warning: type qualifiers ignored on cast result type
[-Wignored-qualifiers]
   f(const_cast(&i));
  ^

I couldn't figure out how to get the caret to point to the ignored qualifier.

[Bug c/80868] "Duplicate const" warning emitted in `const typeof(foo) bar;`

2017-05-24 Thread george.burgess.iv at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80868

--- Comment #3 from George Burgess IV  ---
Thanks for the response!

From the standpoint of consistency, I agree.

My point is more that GCC isn't bound by the standard to be as strict with
`typeof`, and making an exception for `typeof` here would make it easier to use
in macros. I believe the gain in usability here outweighs the cost of having
this inconsistency.

(I also feel that this warning in general isn't useful when the only
"duplicate" const has been inferred from an expression, but it seems that
__auto_type has the same "duplicate const" behavior as typeof, so...)

[Bug c++/80544] result of const_cast should by cv-unqualified

2017-05-24 Thread redi at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80544

Jonathan Wakely  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED
   Target Milestone|--- |8.0

--- Comment #5 from Jonathan Wakely  ---
Fixed for GCC 8.

[Bug c++/80544] result of const_cast should by cv-unqualified

2017-05-24 Thread redi at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80544

--- Comment #4 from Jonathan Wakely  ---
Author: redi
Date: Wed May 24 22:16:59 2017
New Revision: 248432

URL: https://gcc.gnu.org/viewcvs?rev=248432&root=gcc&view=rev
Log:
PR c++/80544 strip cv-quals from cast results

gcc/cp:

PR c++/80544
* tree.c (reshape_init): Use unqualified type for direct enum init.
* typeck.c (maybe_warn_about_cast_ignoring_quals): New.
(build_static_cast_1, build_reinterpret_cast_1): Strip cv-quals from
non-class destination types.
(build_const_cast_1): Strip cv-quals from destination types.
(build_static_cast, build_reinterpret_cast, build_const_cast)
(cp_build_c_cast): Add calls to maybe_warn_about_cast_ignoring_quals.

gcc/testsuite:

PR c++/80544
* g++.dg/expr/cast11.C: New test.

Added:
trunk/gcc/testsuite/g++.dg/expr/cast11.C
Modified:
trunk/gcc/cp/ChangeLog
trunk/gcc/cp/decl.c
trunk/gcc/cp/typeck.c
trunk/gcc/testsuite/ChangeLog

[Bug bootstrap/80867] gnat bootstrap broken on powerpc64le-linux-gnu with -O3

2017-05-24 Thread doko at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80867

Matthias Klose  changed:

   What|Removed |Added

 Status|WAITING |UNCONFIRMED
 CC||nicolas.boulenguez at free dot 
fr
Summary|[7 Regression] gnat |gnat bootstrap broken on
   |bootstrap broken on |powerpc64le-linux-gnu with
   |powerpc64le-linux-gnu   |-O3
 Ever confirmed|1   |0

--- Comment #3 from Matthias Klose  ---
there was no backtrace.  and it's not a regression, not introduced by the above
revision, but by building libada with -O3. Reverting to -O2 lets the build
succeed.

[Bug c/80731] poor -Woverflow warnings, missing detail

2017-05-24 Thread msebor at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80731

Martin Sebor  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Martin Sebor  ---
Implemented in r248431.

[Bug c/80731] poor -Woverflow warnings, missing detail

2017-05-24 Thread msebor at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80731

--- Comment #3 from Martin Sebor  ---
Author: msebor
Date: Wed May 24 22:07:21 2017
New Revision: 248431

URL: https://gcc.gnu.org/viewcvs?rev=248431&root=gcc&view=rev
Log:
PR c/80731 - poor -Woverflow warnings

gcc/c-family/ChangeLog:

PR c/80731
* c-common.h (unsafe_conversion_p): Add a function argument.
* c-common.c (unsafe_conversion_p): Same.
Add type names and values to diagnostics.
(scalar_to_vector): Adjust.
* c-warn.c (constant_expression_error): Add a function argument.
Add type names and values to diagnostics.
(conversion_warning): Add a function argument.
Add type names and values to diagnostics.
(warnings_for_convert_and_check): Same.

gcc/c/ChangeLog:

PR c/80731
* c-fold.c (c_fully_fold_internal): Adjust.
* c-typeck.c (parser_build_unary_op): Adjust.

gcc/cp/ChangeLog:

PR c/80731
* call.c (fully_fold_internal): Adjust.

gcc/testsuite/ChangeLog:

PR c/80731
* c-c++-common/Wfloat-conversion.c: Adjust.
* c-c++-common/dfp/convert-int-saturate.c: Same.
* c-c++-common/pr68657-1.c: Same.
* g++.dg/ext/utf-cvt.C: Same.
* g++.dg/ext/utf16-4.C: Same.
* g++.dg/warn/Wconversion-real-integer-3.C: Same.
* g++.dg/warn/Wconversion-real-integer2.C: Same.
* g++.dg/warn/Wconversion3.C: Same.
* g++.dg/warn/Wconversion4.C: Same.
* g++.dg/warn/Wsign-conversion.C: Same.
* g++.dg/warn/overflow-warn-1.C: Same.
* g++.dg/warn/overflow-warn-3.C: Same.
* g++.dg/warn/overflow-warn-4.C: Same.
* g++.dg/warn/pr35635.C: Same.
* g++.old-deja/g++.mike/enum1.C: Same.
* gcc.dg/Wconversion-3.c: Same.
* gcc.dg/Wconversion-5.c: Same.
* gcc.dg/Wconversion-complex-c99.c: Same.
* gcc.dg/Wconversion-complex-gnu.c: Same.
* gcc.dg/Wconversion-integer.c: Same.
* gcc.dg/Wsign-conversion.c: Same.
* gcc.dg/bitfld-2.c: Same.
* gcc.dg/c90-const-expr-11.c: Same.
* gcc.dg/c90-const-expr-7.c: Same.
* gcc.dg/c99-const-expr-7.c: Same.
* gcc.dg/overflow-warn-1.c: Same.
* gcc.dg/overflow-warn-2.c: Same.
* gcc.dg/overflow-warn-3.c: Same.
* gcc.dg/overflow-warn-4.c: Same.
* gcc.dg/overflow-warn-5.c: Same.
* gcc.dg/overflow-warn-8.c: Same.
* gcc.dg/overflow-warn-9.c: New test.
* gcc.dg/pr35635.c: Adjust.
* gcc.dg/pr59940.c: Same.
* gcc.dg/pr59963-2.c: Same.
* gcc.dg/pr60114.c: Same.
* gcc.dg/switch-warn-2.c: Same.
* gcc.dg/utf-cvt.c: Same.
* gcc.dg/utf16-4.c: Same.


Added:
trunk/gcc/testsuite/gcc.dg/overflow-warn-9.c
Modified:
trunk/gcc/c-family/ChangeLog
trunk/gcc/c-family/c-common.c
trunk/gcc/c-family/c-common.h
trunk/gcc/c-family/c-warn.c
trunk/gcc/c/ChangeLog
trunk/gcc/c/c-fold.c
trunk/gcc/c/c-typeck.c
trunk/gcc/cp/ChangeLog
trunk/gcc/cp/call.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/c-c++-common/Wfloat-conversion.c
trunk/gcc/testsuite/c-c++-common/dfp/convert-int-saturate.c
trunk/gcc/testsuite/c-c++-common/pr68657-1.c
trunk/gcc/testsuite/g++.dg/ext/utf-cvt.C
trunk/gcc/testsuite/g++.dg/ext/utf16-4.C
trunk/gcc/testsuite/g++.dg/warn/Wconversion-real-integer-3.C
trunk/gcc/testsuite/g++.dg/warn/Wconversion-real-integer2.C
trunk/gcc/testsuite/g++.dg/warn/Wconversion3.C
trunk/gcc/testsuite/g++.dg/warn/Wconversion4.C
trunk/gcc/testsuite/g++.dg/warn/Wsign-conversion.C
trunk/gcc/testsuite/g++.dg/warn/overflow-warn-1.C
trunk/gcc/testsuite/g++.dg/warn/overflow-warn-3.C
trunk/gcc/testsuite/g++.dg/warn/overflow-warn-4.C
trunk/gcc/testsuite/g++.dg/warn/pr35635.C
trunk/gcc/testsuite/g++.old-deja/g++.mike/enum1.C
trunk/gcc/testsuite/gcc.dg/Wconversion-3.c
trunk/gcc/testsuite/gcc.dg/Wconversion-5.c
trunk/gcc/testsuite/gcc.dg/Wconversion-complex-c99.c
trunk/gcc/testsuite/gcc.dg/Wconversion-complex-gnu.c
trunk/gcc/testsuite/gcc.dg/Wconversion-integer.c
trunk/gcc/testsuite/gcc.dg/Wsign-conversion.c
trunk/gcc/testsuite/gcc.dg/bitfld-2.c
trunk/gcc/testsuite/gcc.dg/c90-const-expr-11.c
trunk/gcc/testsuite/gcc.dg/c90-const-expr-7.c
trunk/gcc/testsuite/gcc.dg/c99-const-expr-7.c
trunk/gcc/testsuite/gcc.dg/overflow-warn-1.c
trunk/gcc/testsuite/gcc.dg/overflow-warn-2.c
trunk/gcc/testsuite/gcc.dg/overflow-warn-3.c
trunk/gcc/testsuite/gcc.dg/overflow-warn-4.c
trunk/gcc/testsuite/gcc.dg/overflow-warn-5.c
trunk/gcc/testsuite/gcc.dg/overflow-warn-8.c
trunk/gcc/testsuite/gcc.dg/pr35635.c
trunk/gcc/testsuite/gcc.dg/pr59940.c
trunk/gcc/testsuite/gcc.dg/pr59963-2.c
trunk/gcc/testsuite/gcc.dg/pr60114.c
trunk/gcc/testsuite/gcc.dg/switch-warn-2.c
trunk/gcc/testsuite/gcc.dg/utf-cvt.c
trunk/gcc/testsuite/

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-24 Thread peter at cordes dot ca

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

--- Comment #2 from Peter Cordes  ---
(In reply to Richard Biener from comment #1)
> That is, it was supposed to end up using pslldq

I think you mean PSRLDQ.  Byte zero is the right-most when drawn in a way that
makes bit/byte shift directions all match up with the diagram.  Opposite of
array-initializer order.


PSRLDQ is sub-optimal without AVX.  It needs a MOVDQA to copy-and-shuffle.  For
integer shuffles, PSHUFD is what you want (and PSHUFLW for a final step with
16-bit elements).

None of the x86 copy-and-shuffle instructions can zero an element, only copy
from one of the source elements.  PSHUFD can easily swap the high and low
halves, but maybe some other targets can't do that as efficiently as just
duplicating the high half into the low half or something.  (I only really know
x86 SIMD).

Ideally we could tell the back-end that the high half values are actually
don't-care and can be anything, so it can choose the best shuffles to extract
the high half for each successive narrowing step.

I think clang does this: its shuffle-optimizer makes different choices in a
function that returns __m128 vs. returning the low element of a vector as a
scalar float.  (e.g. for hand-coded horizontal sum using Intel _mm_
intrinsics).

---

To get truly optimal code, the backend needs more choice in what order to do
the shuffles.  e.g. with SSE3, the optimal sequence for FP hsum is probably

   movshdup  %xmm0, %xmm1   # DCBA -> DDBB
   addps %xmm1, %xmm0   # D+D C+D B+B A+B  (elements 0 and 2 are valuable)
   movhlps   %xmm0, %xmm1   # Do this second so a dependency on xmm1 can't be a
problem
   addss %xmm1, %xmm0   # addps saves 1 byte of code.

We could use MOVHLPS first to maintain the usual successive-narrowing pattern,
but (to avoid a MOVAPS) only if we have a scratch register that was ready
earlier in xmm0's dep chain (so introducing a dependency on it can't hurt the
critical path).  It also needs to be holding FP values that won't cause a
slowdown from denormals in the high two elements for the first addps.  (NaN /
infinity are ok for all current SSE hardware).  Adding a number to itself is
safe enough, so a shuffle that duplicates the high-half values is good.

However, when auto-vectorizing an FP reduction with -fassociative-math but
without the rest of -ffast-math, I guess we need to avoid spurious exceptions
from values in elements that are otherwise don't-care.  Swapping high/low
halves is always safe, e.g. using movaps + shufps for both steps:

  DCBA -> BADC and get D+B C+A B+D A+C.
  WXYZ -> XWZY and get D+B+C+A repeated four times

With AVX, the MOVAPS instructions go away, but vshufps's immediate byte still
makes it 1 byte larger than vmovhlps or vunpckhpd.

x86 has very few FP copy-and-shuffle instructions, so it's a trickier problem
than for integer code where you can always just use PSHUFD unless tuning for
SlowShuffle CPUs like first-gen Core2, or K8.

With AVX, VPERMILPS with an immediate operand is pointless, except I guess with
a memory source.  It always needs a 3-byte VEX, but VSHUFPS can use a 2-byte
VEX prefix and do the same copy+in-lane-shuffle just as fast on all CPUs (using
the same register as both sources), except KNL where single-source shuffles are
faster.  Moreover, 3-operand AVX makes it possible to use VMOVHLPS or VUNPCKHPD
as a copy+shuffle.

--


> demote this to first add the two halves and then continue with the reduction 
> scheme.

Sounds good.

With x86 AVX512, it takes two successive narrowing steps to get down to 128bit
vectors.  Narrowing to 256b allows shorter instructions (VEX instead of EVEX).

Even with 512b or 256b execution units, narrowing to 128b is about as efficient
as possible.  Doing the lower-latency in-lane shuffles first would let more
instructions retire earlier, but only by a couple cycles.  I don't think it
makes any sense to special-case for AVX512 and do in-lane 256b ops ending with
vextractf128, especially since gcc's current design does most of the reduction
strategy in target-independent code.

IDK if any AVX512 CPU will ever handle wide vectors as two or four 128b uops. 
It seems unlikely since 512b lane-crossing shuffles would have to decode to
*many* uops, especially stuff like VPERMI2PS (select elements from two 512b
sources).  Still, AMD seems to like the half-width idea, so I wouldn't be
surprised to see an AVX512 CPU with 256b execution units.  Even 128b is
plausible (especially for low-power / small-core CPUs like Jaguar or
Silvermont).

[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247848

2017-05-24 Thread wschmidt at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803

Bill Schmidt  changed:

   What|Removed |Added

Summary|libgo appears to be |libgo appears to be
   |miscompiled on powerpc64le  |miscompiled on powerpc64le
   |since r247923   |since r247848

--- Comment #13 from Bill Schmidt  ---
I had a bad bisection due to revisions that broke bootstrap in between. 
Building just c,c++,go I was able to determine that the bug started happening
with r247848, which is just the big merge of changes to the go frontend and
libraries.  (Unfortunately that doesn't provide any clues.)  Nathan, sorry for
the noise!

[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247923

2017-05-24 Thread ian at airs dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803

--- Comment #12 from Ian Lance Taylor  ---
A global variable that can not be statically initialized would be initialized
by a function named "net..import", invoked before the Go main function starts. 
Since the net.IPv4 function is trivial, it is probably being inlined into
net..import.

[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247923

2017-05-24 Thread boger at us dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803

--- Comment #11 from boger at us dot ibm.com ---
The first failure happens in TestParseIP from ip_test.go because the "out"
entries in the var parseIPTests are not initialized correctly.  This causes the
failures because the actual value (which is correct) doesn't match the expected
value (which is incorrect).

var parseIPTests = []struct {
in  string
out IP
}{
{"127.0.1.2", IPv4(127, 0, 1, 2)},
{"127.0.0.1", IPv4(127, 0, 0, 1)},
{"127.001.002.003", IPv4(127, 1, 2, 3)},
{":::127.1.2.3", IPv4(127, 1, 2, 3)},
{":::127.001.002.003", IPv4(127, 1, 2, 3)},
{":::7f01:0203", IPv4(127, 1, 2, 3)},
{"0:0:0:0:::127.1.2.3", IPv4(127, 1, 2, 3)},
{"0:0:0:0:00::127.1.2.3", IPv4(127, 1, 2, 3)},
{"0:0:0:0:::127.1.2.3", IPv4(127, 1, 2, 3)},
.

I believe this is a static var, and the initialization of "out" is done through
a call to IPv4 (which does a make) but I'm not sure where and when this
initialization is supposed to occur?  I tried to use gdb and set a watch where
I thought it should be initialized and it didn't trigger.

Another oddity, when the entire net.test is run, the output from the call to
UnmarshalText is extremely long and that is what causes the output file to get
so large, but if the test ParseIP is run by itself, it fails but does not
generate the excessive output.  But in both cases, the initial failure is due
to  bad initialization of the "out" entries.

[Bug fortran/37131] inline matmul for small matrix sizes

2017-05-24 Thread tkoenig at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131
Bug 37131 depends on bug 66094, which changed state.

Bug 66094 Summary: Handle transpose(A) in inline matmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66094

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug fortran/66094] Handle transpose(A) in inline matmul

2017-05-24 Thread tkoenig at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66094

Thomas Koenig  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Thomas Koenig  ---
All significant use cases are handled now.

Closing.

[Bug fortran/66094] Handle transpose(A) in inline matmul

2017-05-24 Thread tkoenig at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66094

--- Comment #10 from Thomas Koenig  ---
Author: tkoenig
Date: Wed May 24 18:44:35 2017
New Revision: 248425

URL: https://gcc.gnu.org/viewcvs?rev=248425&root=gcc&view=rev
Log:
2017-05-24  Thomas Koenig  

PR fortran/66094
* frontend-passes.c (matrix_case):  Add A2TB2.
(inline_limit_check):  Handle MATMUL(TRANSPOSE(A),B)
(inline_matmul_assign):  Likewise.

2017-05-24  Thomas Koenig  

PR fortran/66094
* gfortran.dg/inline_matmul_16.f90:  New test.


Added:
trunk/gcc/testsuite/gfortran.dg/inline_matmul_16.f90
Modified:
trunk/gcc/fortran/ChangeLog
trunk/gcc/fortran/frontend-passes.c
trunk/gcc/testsuite/ChangeLog

[Bug sanitizer/80875] [7/8 Regression] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817

2017-05-24 Thread mpolacek at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875

Marek Polacek  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |mpolacek at gcc dot 
gnu.org

--- Comment #3 from Marek Polacek  ---
I'll look.

[Bug sanitizer/80875] [7/8 Regression] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817

2017-05-24 Thread mpolacek at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875

Marek Polacek  changed:

   What|Removed |Added

   Keywords||ice-on-valid-code
   Target Milestone|--- |7.2
Summary|UBSAN: compile time crash   |[7/8 Regression] UBSAN:
   |in fold_binary_loc at   |compile time crash in
   |fold-const.c:9817   |fold_binary_loc at
   ||fold-const.c:9817

[Bug sanitizer/80875] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817

2017-05-24 Thread mpolacek at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875

--- Comment #2 from Marek Polacek  ---
commit 0123775a88c6cf1035e4633fde7823a3e9889809
Author: rguenth 
Date:   Wed Oct 28 13:41:25 2015 +

2015-10-28  Richard Biener  

* fold-const.c (negate_expr_p): Adjust the division case to
properly avoid introducing undefined overflow.
(fold_negate_expr): Likewise.


git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@229484
138bc75d-0d04-0410-961f-82ee72b054a4

[Bug sanitizer/80875] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817

2017-05-24 Thread mpolacek at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875

Marek Polacek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-05-24
 Ever confirmed|0   |1

--- Comment #1 from Marek Polacek  ---
Confirmed.

[Bug sanitizer/80875] New: UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817

2017-05-24 Thread babokin at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875

Bug ID: 80875
   Summary: UBSAN: compile time crash in fold_binary_loc at
fold-const.c:9817
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: babokin at gmail dot com
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at 
gcc dot gnu.org
  Target Milestone: ---

gcc rev248384, x86_64.

> cat f.cpp
void foo() {
~2147483647 * (0 / 0);
}

> g++ -fsanitize=undefined -w -c f.cpp
f.cpp: In function ‘void foo()’:
f.cpp:3:1: internal compiler error: tree check: expected class ‘constant’, have
‘unary’ (negate_expr) in fold_binary_loc, at fold-const.c:9817
 }
 ^
0x10384a7 tree_class_check_failed(tree_node const*, tree_code_class, char
const*, int, char const*)
../../gcc_svn/gcc/tree.c:9909
<...>

[Bug c++/78591] [c++1z] ICE when using decomposition identifier from closure object

2017-05-24 Thread paolo.carlini at oracle dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78591

--- Comment #1 from Paolo Carlini  ---
The released 7.1.0 doesn't ICE.

[Bug rtl-optimization/80754] [8 Regression][LRA] Invalid smull instruction generated in lra-remat

2017-05-24 Thread wilco at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80754

--- Comment #5 from Wilco  ---
Author: wilco
Date: Wed May 24 17:06:55 2017
New Revision: 248424

URL: https://gcc.gnu.org/viewcvs?rev=248424&root=gcc&view=rev
Log:
When lra-remat rematerializes an instruction with a clobber, it checks
that the clobber does not kill live registers.  However it fails to check
that the clobber also doesn't overlap with the destination register of the
final rematerialized instruction.  As a result it is possible to generate
illegal instructions with the same hard register as the destination and a
clobber.  Fix this by also checking for overlaps with the destination
register.

gcc/
PR rtl-optimization/80754
* lra-remat.c (do_remat): Add overlap checks for dst_regno.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/lra-remat.c

[Bug c++/71451] [5/6/7/8 Regression] ICE on invalid C++11 code on x86_64-linux-gnu: in dependent_type_p, at cp/pt.c:22599

2017-05-24 Thread paolo.carlini at oracle dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71451

Paolo Carlini  changed:

   What|Removed |Added

 CC||paolo.carlini at oracle dot com

--- Comment #4 from Paolo Carlini  ---
This seems already fixed in trunk. I guess we can as well add the testcase
there and of course keep the bug open.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #19 from Thorsten Kurth  ---
Thanks you very much. I am sorry that I do not have a simpler test case. The
kernel which is executed is in the same directory as ABecLaplacian and called
MG_3D_cpp.cpp.

We have seen similar problems with the fortran kernels (they are scattered
across multiple files) but the fortran kernels and our C++ ports give the same
performance with the original OpenMP parallelization. In any case, I wonder why
the compiler honors the target region even if -march=knl is specified. However,
please let me know if you have further questions. I can guide you through that
code. The code is big but the relevant files are technically 2 or 3 and the
relevant lines of code also not very many.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #18 from Jakub Jelinek  ---
Ok, I'll grab your git code and will have a look tomorrow what's going on.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #17 from Thorsten Kurth  ---
the result though is correct, I verified that both codes generate the correct
output.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #16 from Thorsten Kurth  ---
FYI, the code is:

https://github.com/zronaghi/BoxLib.git

in branch

cpp_kernels_openmp4dot5

and then in Src/LinearSolvers/C_CellMG

the file ABecLaplacian.cpp. For example, lines 542 and 543 can be commented out
and commented in and when the test case in run you get significant slowdown
when the code is compiled with that stuff commented in. I did not map all the
scalar stuff so it might be that this is a problem. But in any case, it should
not create copies of that stuff at all in my opinion.

Please don't look at that code right now because it is a bit convoluted I just
wanted to show that this issue appears. So when I have the target section I
mentioned above commented in I get by running:

#!/bin/bash

export OMP_NESTED=false
export OMP_NUM_THREADS=64
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
export OMP_MAX_ACTIVE_LEVELS=1

execpath="/project/projectdirs/mpccc/tkurth/Portability/BoxLib/Tutorials/MultiGrid_C"
exec=`ls -latr ${execpath}/main3d.*.MPI.OMP.ex | awk '{print $9}'`

#execute
${exec} inputs

the following:

tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh
MPI initialized with 1 MPI processes
OMP initialized with 64 OMP threads
Using Dirichlet or Neumann boundary conditions.
Grid resolution : 128 (cells)
Domain size : 1 (length unit) 
Max_grid_size   : 32 (cells)
Number of grids : 64
Sum of RHS  : -2.68882138776405e-17

Solving with BoxLib C++ solver 
WARNING: using C++ kernels in LinOp
WARNING: using C++ MG solver with C kernels
MultiGrid: Initial rhs= 135.516568492921
MultiGrid: Initial residual   = 135.516568492921
MultiGrid: Iteration   1 resid/bnorm = 0.379119045820053
MultiGrid: Iteration   2 resid/bnorm = 0.0107971623268356
MultiGrid: Iteration   3 resid/bnorm = 0.000551321916982188
MultiGrid: Iteration   4 resid/bnorm = 3.55014555643671e-05
MultiGrid: Iteration   5 resid/bnorm = 2.57082340920002e-06
MultiGrid: Iteration   6 resid/bnorm = 1.90970439886018e-07
MultiGrid: Iteration   7 resid/bnorm = 1.44525222814178e-08
MultiGrid: Iteration   8 resid/bnorm = 1.10675190626368e-09
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
, Solve time: 5.84898591041565, CG time: 0.162226438522339
   Converged res < eps_rel*max(bnorm,res_norm)
   Run time  : 5.98936820030212

Unused ParmParse Variables:
[TOP]::hypre.solver_flag(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_rap_type(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_relax_type(nvals = 1)  :: [2]
[TOP]::hypre.num_pre_relax(nvals = 1)  :: [2]
[TOP]::hypre.num_post_relax(nvals = 1)  :: [2]
[TOP]::hypre.skip_relax(nvals = 1)  :: [1]
[TOP]::hypre.print_level(nvals = 1)  :: [1]
done.

When I comment it out, recompile, I get:

tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh
MPI initialized with 1 MPI processes
OMP initialized with 64 OMP threads
Using Dirichlet or Neumann boundary conditions.
Grid resolution : 128 (cells)
Domain size : 1 (length unit) 
Max_grid_size   : 32 (cells)
Number of grids : 64
Sum of RHS  : -2.68882138776405e-17

Solving with BoxLib C++ solver 
WARNING: using C++ kernels in LinOp
WARNING: using C++ MG solver with C kernels
MultiGrid: Initial rhs= 135.516568492921
MultiGrid: Initial residual   = 135.516568492921
MultiGrid: Iteration   1 resid/bnorm = 0.379119045820053
MultiGrid: Iteration   2 resid/bnorm = 0.0107971623268356
MultiGrid: Iteration   3 resid/bnorm = 0.000551321916981978
MultiGrid: Iteration   4 resid/bnorm = 3.5501455563633e-05
MultiGrid: Iteration   5 resid/bnorm = 2.5708234090034e-06
MultiGrid: Iteration   6 resid/bnorm = 1.90970439781153e-07
MultiGrid: Iteration   7 resid/bnorm = 1.44525225042545e-08
MultiGrid: Iteration   8 resid/bnorm = 1.10675108045705e-09
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
MultiGrid: Iteration   9 resid/bnorm = 8.55424251440489e-11
, Solve time: 0.759385108947754, CG time: 0.14183521270752
   Converged res < eps_rel*max(bnorm,res_norm)
   Run time  : 0.879786014556885

Unused ParmParse Variables:
[TOP]::hypre.solver_flag(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_rap_type(nvals = 1)  :: [1]
[TOP]::hypre.pfmg_relax_type(nvals = 1)  :: [2]
[TOP]::hypre.num_pre_relax(nvals = 1)  :: [2]
[TOP]::hypre.num_post_relax(nvals = 1)  :: [2]
[TOP]::hypre.skip_relax(nvals = 1)  :: [1]
[TOP]::hypre.print_level(nvals = 1)  :: [1]
done.

it is like 7.3x slowdown. The smoothing kernel (gauss-seidel red-black) is the
most expensive kernel in the Multi-Grid code, so I see the biggest effect here.
But the other kernels (prolongation, restriction, dot products etc) have
slowdowns as well amounting to a total of more than 10x for the whole app.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #15 from Thorsten Kurth  ---
The code I care about definitely has optimization enabled. For the fortran
stuff it does (for example): 

ftn  -g -O3 -ffree-line-length-none -fno-range-check -fno-second-underscore
-Jo/3d.gnu.MPI.OMP.EXE -I o/3d.gnu.MPI.OMP.EXE -fimplicit-none  -fopenmp -I.
-I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG
-I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib
-I../../Src/C_BoundaryLib -I../../Src/C_BaseLib
-I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4
-I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include
-I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG
-I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG
-I../../Src/F_BaseLib -I../../Src/F_BaseLib -c
../../Src/LinearSolvers/F_MG/itsol.f90 -o o/3d.gnu.MPI.OMP.EXE/itsol.o
Compiling cc_mg_tower_smoother.f90 ...

and for the C++ stuff it does

CC  -g -O3 -std=c++14  -fopenmp -g -DCG_USE_OLD_CONVERGENCE_CRITERIA
-DBL_OMP_FABS -DDEVID=0 -DNUM_TEAMS=1 -DNUM_THREADS_PER_BOX=1 -march=knl 
-DNDEBUG -DBL_USE_MPI -DBL_USE_OMP -DBL_GCC_VERSION='6.3.0'
-DBL_GCC_MAJOR_VERSION=6 -DBL_GCC_MINOR_VERSION=3 -DBL_SPACEDIM=3
-DBL_FORT_USE_UNDERSCORE -DBL_Linux -DMG_USE_FBOXLIB -DBL_USE_F_BASELIB
-DBL_USE_FORTRAN_MPI -DUSE_F90_SOLVERS -I. -I../../Src/C_BoundaryLib
-I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4
-I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib
-I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4
-I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include
-I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG
-I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG
-I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/C_BaseLib/FPC.cpp -o
o/3d.gnu.MPI.OMP.EXE/FPC.o
Compiling Box.cpp ...

But the kernels I care about are in C++.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #14 from Jakub Jelinek  ---
(In reply to Thorsten Kurth from comment #13)
> the compiler options are just -fopenmp. I am sure it does not have to do
> anything with vectorization as I compare the code runtime with and without
> the target directives and thus vectorization should be the same between
> them. The remaining OpenMP sections are the same. In our work we have not
> seen 10x because of insufficient vectorization, it is usually because of
> cache locality but that is the same for OMP 4.5 and OMP 3 because the loops
> are not touched.
> I do not specify an ISA choice, but I will try specifying KNL now and will
> tell you what the compiler is going to do.

The compiler doesn't optimize by default (i.e. default is -O0), so if you are
measuring -O0 -fopenmp performance or code size, that is something that is
completely uninteresting.  For -O0 the most important is compilation speed, not
quality of generated code.  For runtime performance of generated code only -O2,
-O3 or -Ofast are optimization levels that make sense.

[Bug fortran/28004] Warn if intent(out) dummy variable is used before being defined

2017-05-24 Thread tkoenig at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28004

Thomas Koenig  changed:

   What|Removed |Added

   Last reconfirmed|2007-07-03 21:06:36 |2017-5-24

--- Comment #12 from Thomas Koenig  ---
Still current with current trunk.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #13 from Thorsten Kurth  ---
Hello Jakub,

the compiler options are just -fopenmp. I am sure it does not have to do
anything with vectorization as I compare the code runtime with and without the
target directives and thus vectorization should be the same between them. The
remaining OpenMP sections are the same. In our work we have not seen 10x
because of insufficient vectorization, it is usually because of cache locality
but that is the same for OMP 4.5 and OMP 3 because the loops are not touched.
I do not specify an ISA choice, but I will try specifying KNL now and will tell
you what the compiler is going to do.

Best
Thorsten

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #12 from Jakub Jelinek  ---
(In reply to Thorsten Kurth from comment #11)
> yes, you are right. I thought that map(tofrom:) is the default mapping
> but I might be wrong. In any case, teams is always 1. So this code is

Variables that aren't pointers nor scalars are still implicitly
map(tofrom:),
scalars are implicitly firstprivate(), pointers are map(alloc:ptr[0:0]).

> basically just data streaming  so there is no need for a detailed
> performance analysis. When I timed the code (not profiling it) the OpenMP
> 4.5 code had a tiny bit more overhead, but not significant. 
> However, we might nevertheless learn from that. 

What kind of compiler options you use?  -O2 -fopenmp, -O3 -fopenmp, -Ofast
-fopenmp, something different?  What ISA choice? -march=native, -mavx2, ...?
The 10x slowdown could most likely be explained by the inner loop being
vectorized in one case and not the other.  You aren't using #pragma omp
parallel for simd that you'd explicitly ask for vectorization e.g. even at -O2
-fopenmp.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #11 from Thorsten Kurth  ---
Hello Jakub,

yes, you are right. I thought that map(tofrom:) is the default mapping but
I might be wrong. In any case, teams is always 1. So this code is basically
just data streaming  so there is no need for a detailed performance analysis.
When I timed the code (not profiling it) the OpenMP 4.5 code had a tiny bit
more overhead, but not significant. 
However, we might nevertheless learn from that. 

Best
Thorsten

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #10 from Jakub Jelinek  ---
(In reply to Thorsten Kurth from comment #7)
> Hello Jakub,
> 
> thanks for your comment but I think the parallel for is not racey. Every
> thread is working a block of i-indices so that is fine. The dotprod kernel
> is actually a kernel from the OpenMP standard documentation and I am sure
> that this is not racey. 

I was not talking about the parallel for, but about the parallel I've cited.
Even if you write the same value from all threads, at least pedantically it is
racy, even when you might get away with that.  Which is why you should assign
it just once, e.g. through #pragma omp master or single.

> The example with the regions you mentioned I do not see a problem with that
> either. By default, everything is shared so the variable is updated by all
> the threads/teams with the same value. 

The omp target I've cited above is by default handled in OpenMP 4.0 as
#pragma omp target teams map(tofrom:num_teams)
and will work that way, although it is again pedantically racy, multiple teams
write the same value.
In OpenMP 4.5 it is
#pragma omp target teams firstprivate(num_teams)
and you will always end up with 1, even if there is accelerator that has say
1024 teams by default.  So you really need explicit map(from:num_teams) or
similar to get the value back.  And to be pedantically correct also
assign it only once, e.g. by doing the assignment only if (omp_get_team_num ()
== 0).

> Concerning splitting distribute and parallel: I tried both combinations and
> found that they behave the same. But in the end I split it so that I could
> comment out the distribute section to see if that makes a performance
> difference (and it does).

I was just asking why you are doing it, I haven't yet analyzed the code if
there is something that could be easily improved.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #9 from Thorsten Kurth  ---
Sorry, in the second run I set the number of threads to 12. I think the code
works as expected.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #8 from Thorsten Kurth  ---
Here is the output of the get_num_threads section:

[tkurth@cori02 omp_3_vs_45_test]$ export OMP_NUM_THREADS=32
[tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x
We got 1 teams and 32 threads.

and:

[tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x
We got 1 teams and 12 threads.

I think the code is OK.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #7 from Thorsten Kurth  ---
Hello Jakub,

thanks for your comment but I think the parallel for is not racey. Every thread
is working a block of i-indices so that is fine. The dotprod kernel is actually
a kernel from the OpenMP standard documentation and I am sure that this is not
racey. 

The example with the regions you mentioned I do not see a problem with that
either. By default, everything is shared so the variable is updated by all the
threads/teams with the same value. 

The issue is that num_teams=1 is only true for CPU, for GPU it is OS, driver,
architecture and whatever dependent. 

Concerning splitting distribute and parallel: I tried both combinations and
found that they behave the same. But in the end I split it so that I could
comment out the distribute section to see if that makes a performance
difference (and it does).

I believe that the overhead instructions are responsible for the bad
performance because that is the only thing which distinguishes the target
annotated code from the plain openmp code. I used vtune to look at thread
utilization and they look similar, L1, L2 hit rates are very close (100% vs 99%
and 92% vs 89%) for the plain openmp and for the target annotated code. BUT the
performance of the target annotated code can be up to 10x worse. So I think
there might be register spilling due to copying a large amount of variables. If
you like I can point you to the github repo code (BoxLib) which clearly
exhibits this issue. This small test case only shows minor overhead of OpenMP
4.5 vs, say, OpenMP 3 but it clearly generates some additional overhead.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #6 from Jakub Jelinek  ---
movq/pushq etc. aren't that expensive, if it affects performance it must be
something in the inner loops.  A compiler switch that ignores omp target, teams
and distribute would basically create a new OpenMP version if it would ignore
the requirements on those constructs, you can achive it yourself by using
those in _Pragma in some macro and defining it conditionally based on whether
you want offloading or not, then the "you can ignore all side effects" is
decided by you.  For OpenMP 5.0, there is some work on prescriptive vs.
descriptive clauses/constructs where in your case you could just use a describe
that the loop could be parallelized, simdized and/or offloaded and keep that up
to the implementation what it does with that.

What we perhaps could do is when not offloading try to simplify omp distribute
(if we know omp_get_num_teams () will be always 1), either just by folding the
library calls in that case to 1 or 0, or perhaps doing some more.

#pragma omp target teams
{
num_teams=omp_get_num_teams();
}

#pragma omp parallel
{
num_threads=omp_get_num_threads();
}
in your testcase is just wrong, the target would be ok in OpenMP 4.0, but it is
not in 4.5, num_teams, being a scalar variable, is firstprivate, so you won't
get the value back.
The parallel is racy, to avoid races you'd need #pragma omp single or #pragma
omp master.

Why are you using separate distribute and parallel for constructs and
prescribing what they handle, instead of just using
#pragma omp distribute parallel for
  for (int i = 0; i < N; ++i) D[i] += B[i] * C[i];
?  Do you expect or see any gains from that?

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #5 from Thorsten Kurth  ---
To clarify the problem:
I think that the additional movq, pushq and other instructions generated when
using the target directive can cause a big hit on the performance. I understand
that these instructions are necessary when offloading is used but in case when
I compile for native architecture those should not be there. So maybe I am just
missing a GNU compiler flag which disables offloading and lets the compiler
ignore the target, teams and distribute directives at compile time but still
honoring all the other OpenMP constructs. 
Is there a way to do that right now and if not, is there a way to add that flag
that supports this.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #4 from Thorsten Kurth  ---
Created attachment 41415
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41415&action=edit
Testcase

This is the test case. The files ending on .as contain the assembly code with
and without target region

[Bug libfortran/78379] Processor-specific versions for matmul

2017-05-24 Thread jvdelisle at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78379

--- Comment #36 from Jerry DeLisle  ---
Results look very good.

Gfortran 7, no patch gives:

$ gfc7 -static -Ofast -ftree-vectorize compare.f90 
$ ./a.out 
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000  4.706  0.046  0.094  0.162
4  2000  1.246  0.246  0.305  0.351
8  2000  1.410  0.605  0.958  1.791
   16  2000  5.413  2.787  2.228  2.615
   32  2000  4.676  3.416  4.622  4.618
   64  2000  6.368  2.652  6.339  6.167
  128  2000  8.165  2.998  8.118  8.260
  256   477  9.334  3.202  9.248  9.355
  51259  8.730  2.239  8.596  8.730
 1024 7  8.805  1.378  8.673  8.812
 2048 1  8.781  1.728  8.649  8.789

Latest gfortran trunk with patch gives:

$ gfc -static -Ofast -ftree-vectorize compare.f90 
$ ./a.out 
 =
 MEASURED GIGAFLOPS  =
 =
 Matmul   Matmul
 fixed Matmul variable
 Size  Loops explicit   refMatmul  assumedexplicit
 =
2  2000  4.738  0.048  0.092  0.172
4  2000  1.438  0.248  0.305  0.378
8  2000  1.511  0.617  1.177  1.955
   16  2000  5.426  2.810  1.854  2.881
   32  2000  4.688  3.314  4.357  5.091
   64  2000  6.669  2.674  6.629  7.110
  128  2000  9.139  3.000  9.076  9.131
  256   477 10.495  3.184 10.466 10.516
  51259  9.577  2.189  9.477  9.635
 1024 7  9.593  1.381  9.519  9.658
 2048 1  9.722  1.709  9.625  9.785

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread thorstenkurth at me dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #3 from Thorsten Kurth  ---
Created attachment 41414
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41414&action=edit
OpenMP 4.5 Testcase

This is the source code

[Bug bootstrap/80843] [8 Regression] bootstrap fails in stage1 on powerpc-linux-gnu

2017-05-24 Thread segher at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80843

Segher Boessenkool  changed:

   What|Removed |Added

 CC||segher at gcc dot gnu.org

--- Comment #2 from Segher Boessenkool  ---
I suspect my patch for PR80860 has fixed this as well; Matthias, can
you check please?

[Bug bootstrap/80860] AIX Bootstrap failure

2017-05-24 Thread segher at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80860

Segher Boessenkool  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #6 from Segher Boessenkool  ---
Should be fixed now.  Please reopen if not.

[Bug bootstrap/80843] [8 Regression] bootstrap fails in stage1 on powerpc-linux-gnu

2017-05-24 Thread segher at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80843

--- Comment #1 from Segher Boessenkool  ---
Author: segher
Date: Wed May 24 14:33:11 2017
New Revision: 248421

URL: https://gcc.gnu.org/viewcvs?rev=248421&root=gcc&view=rev
Log:
rs6000: Fix for separate shrink-wrapping for fp (PR80860, PR80843)

After my r248256, rs6000_components_for_bb allocates an sbitmap of size
only 32 while it can use up to 64.  This patch fixes it.  It moves the
n_components variable into the machine_function struct so that other
hooks can use it.


PR bootstrap/80860
PR bootstrap/80843
* config/rs6000/rs6000.c (struct machine_function): Add new field
n_components.
(rs6000_get_separate_components): Init that field, use it.
(rs6000_components_for_bb): Use the field.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/rs6000.c

[Bug bootstrap/80860] AIX Bootstrap failure

2017-05-24 Thread segher at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80860

--- Comment #5 from Segher Boessenkool  ---
Author: segher
Date: Wed May 24 14:33:11 2017
New Revision: 248421

URL: https://gcc.gnu.org/viewcvs?rev=248421&root=gcc&view=rev
Log:
rs6000: Fix for separate shrink-wrapping for fp (PR80860, PR80843)

After my r248256, rs6000_components_for_bb allocates an sbitmap of size
only 32 while it can use up to 64.  This patch fixes it.  It moves the
n_components variable into the machine_function struct so that other
hooks can use it.


PR bootstrap/80860
PR bootstrap/80843
* config/rs6000/rs6000.c (struct machine_function): Add new field
n_components.
(rs6000_get_separate_components): Init that field, use it.
(rs6000_components_for_bb): Use the field.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/rs6000/rs6000.c

[Bug libfortran/78379] Processor-specific versions for matmul

2017-05-24 Thread jvdelisle at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78379

--- Comment #35 from Jerry DeLisle  ---
(In reply to Thomas Koenig from comment #34)
> Created attachment 41410 [details]
> Patch which has all the files
> 
> Well, I suspect my way of splitting the previous patch into
> one real patch and one *.tar.gz - file was not really the best way
> to go :-)
> 
> Here is a patch which should include all the new files.
> 
> At least it fits into the 1000 kb limit.

I am finishing a build in maintainer mode so will try the first approach and if
that fails, will try the new patch. Everything looks reasonable, just think we
should test on my AMD boxes.

[Bug tree-optimization/80874] New: gcc does not emit cmov for minmax

2017-05-24 Thread denis.campredon at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80874

Bug ID: 80874
   Summary: gcc does not emit cmov for minmax
   Product: gcc
   Version: 8.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: denis.campredon at gmail dot com
  Target Milestone: ---

Hello,
Considering the following code:
--
struct pair {
int min, max;
};

pair minmax1(int x, int y) {
if (x > y)
  return {y, x};
else
  return {x, y};
}

#include 

std::pair minmax2(int x, int y) {
return std::minmax(x, y);
}

auto minmax3(int x, int y) {
return std::minmax(x, y);
}
---
I've found that for minmax1 and minmax 2, gcc fails to emit cmov at -03.
Instead it produces the following:

minmax1(int, int):
cmp edi, esi
jle .L2
mov eax, edi
mov edi, esi
mov esi, eax
.L2:
mov eax, edi
sal rsi, 32
or  rax, rsi
ret
 
For minmax3, the asm should be the same (I think), but it produces a more
complex code.

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-24 Thread ubizjak at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #12 from Uroš Bizjak  ---
(In reply to Peter Cordes from comment #4)
> MMX is also a saving in code-size: one fewer prefix byte vs. SSE2 integer
> instructions.  It's also another set of 8 registers for 32-bit mode.

After touching a MMX register, the compiler needs to emit emms insn, so MMX
moves are practically unusable as generic moves.

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-24 Thread ubizjak at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #11 from Uroš Bizjak  ---
(In reply to Peter Cordes from comment #0)
> A lower-latency xmm->int strategy would be:
> 
> movd%xmm0, %eax
> pextrd  $1, %xmm0, %edx

Proposed patch implements the above for generic moves.

> Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other
> CPUs where an FP shuffle has bypass delay between integer ops)
> 
> movd %xmm0, %eax
> movshdup %xmm0, %xmm0  # saves 1B of code-size vs. psrldq, I think.
> movd %xmm0, %edx
> 
> Or without SSE3,
> 
> movd %xmm0, %eax
> psrldq   $4,  %xmm0# 1 m-op cheaper than pshufd on K8
> movd %xmm0, %edx

The above two proposals are not suitable for generic moves. We should not
clobber input value, and we are not allowed to use temporary.

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-24 Thread ubizjak at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #10 from Uroš Bizjak  ---
(In reply to Peter Cordes from comment #0)

> Scalar 64-bit integer ops in vector regs may be useful in general in 32-bit
> code in some cases, especially if it helps with register pressure.

We have scalar-to-vector pass (-mstv) that does the above, but chooses not to
convert the above code due to costs.

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-24 Thread ubizjak at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #9 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #8)
> movq%xmm0, (%esp)   <<-- unneeded store due to RA problem

For some reason, reload "fixes" direct DImode register moves, and passes value
via memory. Later passes partially merge these moves, but leave the above insn.

[Bug c++/80873] ICE in tsubst_copy when trying to use an overloaded function without a definition in a lambda

2017-05-24 Thread hafnermorris at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80873

--- Comment #2 from Morris Hafner  ---
Created attachment 41413
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41413&action=edit
Minimal example code (valid)

[Bug c++/80873] ICE in tsubst_copy when trying to use an overloaded function without a definition in a lambda

2017-05-24 Thread hafnermorris at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80873

--- Comment #1 from Morris Hafner  ---
I managed to create an example that is a valid program:

struct Buffer {};

auto parse(Buffer b);
template  void parse(T target);

template 
auto field(T target) {
return [&] {
parse(target);
};
}

template 
void parse(T target) {}

auto parse(Buffer b) {
field(0);
}

int main() {
}

[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247923

2017-05-24 Thread boger at us dot ibm.com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803

boger at us dot ibm.com changed:

   What|Removed |Added

 CC||boger at us dot ibm.com

--- Comment #10 from boger at us dot ibm.com ---
Bill, I've been out for a few days but can help debug this now that I'm back.

It looks like the test case gets the correct answer but the string it thinks is
correct is wrong (always nil).  Either the initialization of the expected
values is wrong to begin with or they are being corrupted during the run.  We
should be able to figure that out with gdb.

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-24 Thread ubizjak at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #8 from Uroš Bizjak  ---
The patch from comment #7 generates:

a) DImode move for 32 bit targets:

--cut here--
long long test (long long a)
{
  asm ("" : "+x" (a));
  return a;
}
--cut here--

gcc -O2 -msse4.1 -mtune=intel -mregparm=2:

movd%eax, %xmm0
pinsrd  $1, %edx, %xmm0
movq%xmm0, (%esp)   <<-- unneeded store due to RA problem
movd%xmm0, %eax
pextrd  $1, %xmm0, %edx
leal12(%esp), %esp

b) TImode move for 64 bit targets:

--cut here--
__int128 test (__int128 a)
{
  asm ("" : "+x" (a));
  return a;
}
--cut here--

gcc -O2 -msse4.1 -mtune=intel

movq%rdi, %xmm0
pinsrq  $1, %rsi, %xmm0
pextrq  $1, %xmm0, %rdx
movq%xmm0, %rax

[Bug tree-optimization/46186] Clang creates code running 1600 times faster than gcc's

2017-05-24 Thread drraph at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Raphael C  changed:

   What|Removed |Added

 CC||drraph at gmail dot com

--- Comment #26 from Raphael C  ---
If I understood this PR correctly, this simpler code shows the same issue:

unsigned long f(unsigned long a)
{
unsigned long sum = 0;
for (; a <10; a++)
sum += a;
return sum;
}

In gcc 7.1 with -O3 -march=native I get:


f:
cmp rdi, 9
ja  .L7
mov eax, 9
mov ecx, 10
sub rax, rdi
sub rcx, rdi
cmp rax, 7
jbe .L8
vmovq   xmm3, rdi
mov rdx, rcx
vpxor   xmm0, xmm0, xmm0
xor eax, eax
vpbroadcastqymm1, xmm3
vmovdqa ymm2, YMMWORD PTR .LC1[rip]
vpaddq  ymm1, ymm1, YMMWORD PTR .LC0[rip]
shr rdx, 2
.L4:
add rax, 1
vpaddq  ymm0, ymm0, ymm1
vpaddq  ymm1, ymm1, ymm2
cmp rax, rdx
jb  .L4
vpxor   xmm1, xmm1, xmm1
mov rdx, rcx
vperm2i128  ymm2, ymm0, ymm1, 33
and rdx, -4
vpaddq  ymm0, ymm0, ymm2
add rdi, rdx
vperm2i128  ymm1, ymm0, ymm1, 33
vpalignrymm1, ymm1, ymm0, 8
vpaddq  ymm0, ymm0, ymm1
vmovq   rax, xmm0
cmp rcx, rdx
je  .L33
vzeroupper
.L3:
lea rdx, [rdi+1]
add rax, rdi
cmp rdx, 10
je  .L31
add rax, rdx
lea rdx, [rdi+2]
cmp rdx, 10
je  .L31
add rax, rdx
lea rdx, [rdi+3]
cmp rdx, 10
je  .L31
add rax, rdx
lea rdx, [rdi+4]
cmp rdx, 10
je  .L31
add rax, rdx
lea rdx, [rdi+5]
cmp rdx, 10
je  .L31
add rax, rdx
lea rdx, [rdi+6]
cmp rdx, 10
je  .L31
add rax, rdx
add rdi, 7
lea rdx, [rax+rdi]
cmp rdi, 10
cmovne  rax, rdx
ret
.L7:
xor eax, eax
.L31:
ret
.L33:
vzeroupper
ret
.L8:
xor eax, eax
jmp .L3


However in clang I get:

f:  # @f
cmp rdi, 9
ja  .LBB0_1
mov eax, 9
sub rax, rdi
lea rcx, [rdi + 1]
imulrcx, rax
mov edx, 8
sub rdx, rdi
mul rdx
shl rdx, 63
shr rax
or  rax, rdx
add rcx, rdi
add rcx, rax
mov rax, rcx
ret
.LBB0_1:
xor ecx, ecx
mov rax, rcx
ret

which is greatly simpler and avoids looping altogether.

What is the current status of this (very old) PR?  Do people think it is worth
addressing?

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-24 Thread ubizjak at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

--- Comment #7 from Uroš Bizjak  ---
Created attachment 41412
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41412&action=edit
Prototype patch

Patch that emits mov/pinsr or mov/pextr pairs for DImode (x86_32) and TImode
(x86_64) moves.

[Bug tree-optimization/80844] OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844

--- Comment #5 from Richard Biener  ---
(In reply to Jakub Jelinek from comment #2)
> (In reply to Richard Biener from comment #1)
> > If OMP SIMD always zeros the vector then it could also emit the maybe easier
> > to optimize
> > 
> >   WITH_SIZE_EXPR<_3, D.2841> = {};
> 
> It doesn't always zero, it can be pretty arbitrary.

Ah, the memset gets exposed by loop distribution.  Before we have

   [5.67%]:
  # _28 = PHI <_27(13), 0(10)>
  D.2357[_28] = 0.0;
  _27 = _28 + 1;
  if (_15 > _27)
goto ; [85.00%]
  else
goto ; [15.00%]

   [4.82%]:
  goto ; [100.00%]

so indeed the other cases will be more "interesting".

For your latest idea to work we have to make sure the prologue / epilogue
loop doesn't get unrolled / pattern matched.

I'll still look at enhancing memset folding (it's pretty conservative
in the cases it handles).

>  For the default
> reductions on integral/floating point types it does zero for +/-/|/^/||
> reductions, but e.g. 1 for */&&, or ~0 for &, or maximum or minimum for min
> or max.  For user defined reductions it can be whatever the user requests,
> constructor for some class type, function call, set to arbitrary value etc.
> For other privatization clauses it is again something different
> (uninitialized for private/lastprivate, some other var + some bias for
> linear, ...).
> And then after the simd loop there is again a reduction or something
> similar, but again can be quite complex in the general case.  If it helps,
> we could mark the pre-simd and post-simd loops somehow in the loop structure
> or something, but the actual work needs to be done later, especially after
> inlining, including the vectorizer and other passes.
> E.g. for the typical reduction where the vectorizer computes the "simd
> array" in a vector temporary (or collection of them), it would be nice if we
> were able to pattern recognize simple cases and turn those into vector
> reduction patterns.

[Bug libstdc++/80826] Compilation Time for many of std::map insertions

2017-05-24 Thread hubicka at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80826

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #9 from Jan Hubicka  ---
will take a look at the cgraph issue

[Bug c++/80864] Brace-initialization of a constexpr variable of an array in a POD triggers ICE from templates

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80864

Richard Biener  changed:

   What|Removed |Added

   Keywords||ice-on-valid-code
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-05-24
 Ever confirmed|0   |1
  Known to fail||7.1.0

[Bug bootstrap/80867] [7 Regression] gnat bootstrap broken on powerpc64le-linux-gnu

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80867

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |7.2

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization, openmp
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2017-05-24
 Ever confirmed|0   |1

[Bug c++/80856] [7/8 Regression] ICE from template local overload resolution

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80856

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
   Target Milestone|--- |7.2

[Bug middle-end/80853] [6/7/8 Regression] OpenMP ICE in build_outer_var_ref with array reduction

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80853

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2
   Target Milestone|--- |6.4

[Bug middle-end/80823] [8 Regression] ICE: verify_flow_info failed

2017-05-24 Thread bergner at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80823

Peter Bergner  changed:

   What|Removed |Added

 Status|RESOLVED|CLOSED

--- Comment #7 from Peter Bergner  ---
Closing as fixed.

[Bug middle-end/80823] [8 Regression] ICE: verify_flow_info failed

2017-05-24 Thread bergner at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80823

Peter Bergner  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
URL||https://gcc.gnu.org/ml/gcc-
   ||patches/2017-05/msg01791.ht
   ||ml
 Resolution|--- |FIXED

--- Comment #6 from Peter Bergner  ---
Fixed in revision r248408.

[Bug middle-end/80823] [8 Regression] ICE: verify_flow_info failed

2017-05-24 Thread bergner at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80823

--- Comment #5 from Peter Bergner  ---
Author: bergner
Date: Wed May 24 12:10:54 2017
New Revision: 248408

URL: https://gcc.gnu.org/viewcvs?rev=248408&root=gcc&view=rev
Log:
gcc/
PR middle-end/80823
* tree-cfg.c (group_case_labels_stmt): Delete increment of "i";

gcc/testsuite/
PR middle-end/80823
* gcc.dg/pr80823.c: New test.

Added:
trunk/gcc/testsuite/gcc.dg/pr80823.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-cfg.c

[Bug target/80725] [7/8 Regression] s390x ICE on alsa-lib

2017-05-24 Thread krebbel at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80725

--- Comment #5 from Andreas Krebbel  ---
Author: krebbel
Date: Wed May 24 11:36:54 2017
New Revision: 248407

URL: https://gcc.gnu.org/viewcvs?rev=248407&root=gcc&view=rev
Log:
S/390: Fix PR80725.

gcc/ChangeLog:

2017-05-24  Andreas Krebbel  

PR target/80725
* config/s390/s390.c (s390_check_qrst_address): Check incoming
address against address_operand predicate.
* config/s390/s390.md ("*indirect_jump"): Swap alternatives.

gcc/testsuite/ChangeLog:

2017-05-24  Andreas Krebbel  

* gcc.target/s390/pr80725.c: New test.


Added:
trunk/gcc/testsuite/gcc.target/s390/pr80725.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/s390/s390.c
trunk/gcc/config/s390/s390.md
trunk/gcc/testsuite/ChangeLog

[Bug c++/80851] All versions that support C++11 are confused by combination of inherited constructors with member initializer that captures this

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80851

Richard Biener  changed:

   What|Removed |Added

   Keywords||rejects-valid
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-05-24
 Ever confirmed|0   |1

--- Comment #2 from Richard Biener  ---
Confirmed.  clang accepts this.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-24 Thread rguenth at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2017-05-24
 CC||uros at gcc dot gnu.org
 Blocks||53947
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
So the vectorizer uses "whole vector shift" to do the final reduction:

  vect_sum_11.8_5 = VEC_PERM_EXPR ;
  vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6;
  vect_sum_11.8_19 = VEC_PERM_EXPR ;
  vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20;
  vect_sum_11.8_13 = VEC_PERM_EXPR ;
  vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18;
  stmp_sum_11.7_27 = BIT_FIELD_REF ;

I can see that for Zen that is bad (even for avx256 in general eventually
because
it crosses lanes).

That is, it was supposed to end up using pslldq, not the vperm + palign combos.

That said, the vectorizer could "easily" demote this to first add the two
halves
and then continue with the reduction scheme.  The GIMPLE representation of this
is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86
backend can handle (hi/lo subregs?).

I'll see to handle this better in the vectorizer.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug libstdc++/71579] type_traits miss checks for type completeness in some traits

2017-05-24 Thread antoshkka at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71579

--- Comment #6 from Antony Polukhin  ---
C++ LWG related issue: http://cplusplus.github.io/LWG/lwg-active.html#2797

[Bug c++/80873] New: ICE in tsubst_copy when trying to use an overloaded function without a definition in a lambda

2017-05-24 Thread hafnermorris at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80873

Bug ID: 80873
   Summary: ICE in tsubst_copy when trying to use an overloaded
function without a definition in a lambda
   Product: gcc
   Version: 7.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: hafnermorris at gmail dot com
  Target Milestone: ---

Created attachment 41411
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41411&action=edit
Minimal example code

The following invalid code causes an ICE:

struct S {};

auto overloaded(S &);

template 
int overloaded(T &) {
return 0;
}

template 
auto returns_lambda(T ¶m) {
return [&] {
overloaded(param);
};
}

int main() {
S s;
returns_lambda(s);
}

On Wandbox:
https://wandbox.org/permlink/bU36doHcn0MoXWrK

Only gcc versions 7.1 and up seem to be affected. No compiler flags are
required.

[Bug c++/79583] ICE (internal compiler error) upon instantiation of class template with `auto` template parameter containing inner class template

2017-05-24 Thread paolo.carlini at oracle dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79583

--- Comment #2 from Paolo Carlini  ---
The released 7.1.0, current gcc-7-branch and trunk are fine. I'm adding the
testcase and closing the bug.

[Bug c/80868] "Duplicate const" warning emitted in `const typeof(foo) bar;`

2017-05-24 Thread mpolacek at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80868

Marek Polacek  changed:

   What|Removed |Added

 CC||mpolacek at gcc dot gnu.org

--- Comment #2 from Marek Polacek  ---
We're supposed to complain for

const const int x;
and
typedef const int t;
const t x;

and I think we should thus also warn for this (-std=gnu89 -pedantic only).

const int a;
const __typeof(a) x;

because __typeof() doesn't strip outermost type qualifications.  There were
discussion about adding __nonqual_typeof() but that hasn't been added yet.

[Bug c++/68578] [5 Regression] ICE on invalid template declaration and instantiation

2017-05-24 Thread paolo.carlini at oracle dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68578

Paolo Carlini  changed:

   What|Removed |Added

 CC||paolo.carlini at oracle dot com

--- Comment #6 from Paolo Carlini  ---
I have just confirmed that we don't ICE in 6.x and 7.x. Frankly, at this point
it seems highly unlikely to me that we are going to fix this ICE on invalid in
the gcc-5-branch, thus I'm adding a testcase and then in a few hours I will
resolve the bug.

[Bug c/80872] New: There is no warning on accidental infinite loops

2017-05-24 Thread david at westcontrol dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80872

Bug ID: 80872
   Summary: There is no warning on accidental infinite loops
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: david at westcontrol dot com
  Target Milestone: ---

Would it be possible to add warnings on accidental infinite loops, such as:

void foo(void) {
  for (int i = 0; i <= 0x7fff; i++) {
 // ...
  }
}

The compiler (correctly) translates this to an infinite loop, in all versions
of gcc that I tested, and in both C and C++, with optimisation enabled.  But no
warning is given, even with -Wall -Wextra.  That includes the -Wtype-limits
warning, which I thought should trigger here.  Perhaps the order of passes is
such that the code is simplified to an infinite loop before the type limits
checking is done?

Replacing " <= 0x7fff" with "< 0x8000" gives triggers -Wsign-compare
but not -Wtype-limits, which is relevant because in C -Wsign-compare is in
-Wextra but not -Wall.  Exceeding the limits of unsigned int in the literal
here correctly triggers -Wtype-limits.

[Bug c++/80396] New builtin to make std::make_integer_sequence efficient and scalable

2017-05-24 Thread clyon at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80396

Christophe Lyon  changed:

   What|Removed |Added

 CC||clyon at gcc dot gnu.org

--- Comment #3 from Christophe Lyon  ---
Hi Jason,
One of the new tests (integer-pack2.C) fails on arm* targets. The log says:
Excess errors:
/testsuite/g++.dg/ext/integer-pack2.C:10:48: error: overflow in constant
expression [-fpermissive]
/testsuite/g++.dg/ext/integer-pack2.C:10:48: error: overflow in constant
expression [-fpermissive]

[Bug c++/80812] [8 Regression] ICE: in build_value_init_noctor, at cp/init.c:483

2017-05-24 Thread ville.voutilainen at gmail dot com

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80812

Ville Voutilainen  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 CC||ville.voutilainen at gmail dot 
com
   Assignee|unassigned at gcc dot gnu.org  |ville.voutilainen at 
gmail dot com

--- Comment #3 from Ville Voutilainen  ---
Mine.

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

--- Comment #2 from Jakub Jelinek  ---
Also, even for host fallback there is a separate set of ICVs and many other
properties, the target region can't be just ignored for many reasons even if
there is no data sharing.
Of course, if you provide small testcases then we can discuss on what can and
what can't be optimized in detail.

[Bug tree-optimization/80844] OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844

--- Comment #4 from Jakub Jelinek  ---
What we should do is first vectorize the main simd loop and then, once we've
determined the vectorization factor thereof etc., see if there is any related
preparation and finalization loop around it and try to vectorize those with the
same vectorization factor.

[Bug bootstrap/80867] [7 Regression] gnat bootstrap broken on powerpc64le-linux-gnu

2017-05-24 Thread ebotcazou at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80867

Eric Botcazou  changed:

   What|Removed |Added

 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2017-05-24
 Ever confirmed|0   |1

--- Comment #1 from Eric Botcazou  ---
Do you build only PowerPC64le with "-gnatn -g -O3" or other platforms?  If the
latter, do they still build correctly at the same revision?  I'm trying to find
out whether this was introduced by the only Ada patch in the range:

2017-05-22  Eric Botcazou  

* gcc-interface/decl.c (gnat_to_gnu_entity): Skip regular processing
for Itypes that are E_Access_Subtype.
: Use the DECL of the base type directly.

--- Comment #2 from Eric Botcazou  ---
Btw, can you post a backtrace?

[Bug c++/80859] Performance Problems with OpenMP 4.5 support

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #1 from Jakub Jelinek  ---
I don't see any attachments.  The target directives can't be ignored, some
clauses have significant role even when doing host fallback (data sharing).
Plus the question is what do you mean by no offloading.  Do you mean compiler
configured without any offloading support, or compiler configured with
offloading support, but with offloading not selected, or by compiler with
offloading support and offloading generated, but at runtime deciding it has to
use host fallback?
E.g. the second case, the decision whether offloading will be supported or not
is not done at compile time, but at link time, where one can choose whether to
emit offloading or not and to which subset of the configured offloading
targets.

[Bug tree-optimization/78969] bogus snprintf truncation warning due to missing range info

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78969

--- Comment #8 from Jakub Jelinek  ---
idx_10 addition is a consequence of TODO_update_ssa in vrp1's todo_flags,
triggered by jump threading creating the bb6.

[Bug tree-optimization/78969] bogus snprintf truncation warning due to missing range info

2017-05-24 Thread jakub at gcc dot gnu.org

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78969

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek  ---
This is because the PHI at that point is only created during CFG changes at the
end of VRP1 pass.
After creating ASSERT_EXPRs, we still have:
   [1.00%]:
  goto ; [100.00%]

   [99.00%]:
  # RANGE [0, 1000] NONZERO 1023
  idx_7 = ASSERT_EXPR ;
  __builtin_snprintf (p_4(D), 4, "%d", idx_7);
  # RANGE [1, 1000] NONZERO 1023
  idx_6 = idx_7 + 1;

   [100.00%]:
  # RANGE [0, 1000] NONZERO 1023
  # idx_1 = PHI <0(2), idx_6(3)>
  if (idx_1 <= 999)
goto ; [99.00%]
  else
goto ; [1.00%]

   [1.00%]:
  return;
Then VRP1 correctly determines:
idx_1: [0, 1000]
.MEM_2: VARYING
idx_6: [1, 1000]
idx_7: [0, 999]  EQUIVALENCES: { idx_1 } (1 elements)
then the ASSERT_EXPRs are removed, which means whenever idx_7 is used we now
use idx_1, and finally the loop is changed, idx_8 and idx_10 is created:
   [1.00%]:
  goto ; [100.00%]

   [99.00%]:
  # RANGE [0, 1000] NONZERO 1023
  # idx_10 = PHI 
  __builtin_snprintf (p_4(D), 4, "%d", idx_10);
  # RANGE [1, 1000] NONZERO 1023
  idx_6 = idx_10 + 1;

   [99.00%]:
  # RANGE [0, 1000] NONZERO 1023
  # idx_1 = PHI 
  if (idx_1 != 1000)
goto ; [98.99%]
  else
goto ; [1.01%]

   [1.00%]:
  return;

   [1.00%]:
  # RANGE [0, 1000] NONZERO 1023
  # idx_8 = PHI <0(2)>
  goto ; [100.00%]

But there is no obvious connection between idx_10 (which indeed nicely could
hold the [0, 999] range) and idx_7 (the already removed ASSERT_EXPR);
similarly, idx_8 could very well have RANGE [0, 0] but doesn't (though in that
case it isn't that big a deal, because it is going to be removed immediately
afterwards).

90 matches

Mail list logo