[Bug tree-optimization/80874] gcc does not emit cmov for minmax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80874 --- Comment #1 from denis.campredon at gmail dot com --- Sorry, minmax3 should not produce the same asm, since minmax return a pair of const reference. But still the code is less than optimal. One part it is because gcc might be because gcc is not able to optimize the two functions the same way: --- struct pair { const int &x, y; }; pair minmax(int x) { return {x, x}; } const std::pair minmax2(int x) { return std::minmax(x, x); } --
[Bug tree-optimization/80876] [8 Regression] ICE in verify_loop_structure, at cfgloop.c:1644 (error: loop 1's latch does not have an edge to its header)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80876 Markus Trippelsdorf changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-05-25 CC||trippels at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Markus Trippelsdorf --- Started with r247879.
[Bug debug/80877] New: Derived template class can access base class's private constexpr/const static fields
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80877 Bug ID: 80877 Summary: Derived template class can access base class's private constexpr/const static fields Product: gcc Version: 6.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: debug Assignee: unassigned at gcc dot gnu.org Reporter: tomasz.jankowski at nokia dot com Target Milestone: --- Following example shows that in some cases GCC incorrectly grants access to base class's private static members (both constexpr and const). The issue occurs when derived class is a template class. class Base { private: constexpr static int value1 {4}; const static int value2 {5}; }; template class Derived : public Base { public: Derived() : x{value1 + value2} { x = value1 + value2; } T getX() const { return x + value1 + value2; } private: T x; }; int main() { Derived temp; return temp.getX(); } Code was tested on recent x86-64 Linux machine using GCC v6.2.0 and v5.3.0. Sample was compiled with following flags: -std=c++11 -Wall -Wextra
[Bug tree-optimization/80876] New: [8 Regression] ICE in verify_loop_structure, at cfgloop.c:1644 (error: loop 1's latch does not have an edge to its header)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80876 Bug ID: 80876 Summary: [8 Regression] ICE in verify_loop_structure, at cfgloop.c:1644 (error: loop 1's latch does not have an edge to its header) Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: ice-on-valid-code Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: asolokha at gmx dot com Target Milestone: --- gcc-8.0.0-alpha20170521 snapshot ICEs when compiling the following snippet w/ -O2: int sy; void fo (char o5) { char yh = 0; if (o5 == 0) return; while (o5 != 0) if (0) { while (yh != 0) { o5 = 0; while (o5 < 2) { sy &= yh; if (sy != 0) { km: sy = yh; } } ++yh; } } else { o5 = sy; goto km; } } void on (void) { fo (sy); } % x86_64-pc-linux-gnu-gcc-8.0.0-alpha20170521 -O2 -c a0nuylan.c a0nuylan.c: In function 'fo.part.0': a0nuylan.c:34:1: error: loop 1's latch does not have an edge to its header } ^ a0nuylan.c:34:1: internal compiler error: in verify_loop_structure, at cfgloop.c:1644
[Bug libgomp/80822] libgomp incorrect affinity when OMP_PLACES=threads
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80822 --- Comment #3 from Nathan Weeks --- Setting OMP_DISPLAY_ENV=verbose results in the following output with Intel 17.0.2: OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP='201511' [host] KMP_ABORT_DELAY='0' [host] KMP_ADAPTIVE_LOCK_PROPS='1,1024' [host] KMP_ALIGN_ALLOC='64' [host] KMP_ALL_THREADPRIVATE='256' [host] KMP_ALL_THREADS='2147483647' [host] KMP_ATOMIC_MODE='2' [host] KMP_BLOCKTIME='200' [host] KMP_CPUINFO_FILE: value is not defined [host] KMP_DETERMINISTIC_REDUCTION='FALSE' [host] KMP_DISP_NUM_BUFFERS='7' [host] KMP_DUPLICATE_LIB_OK='FALSE' [host] KMP_FORCE_REDUCTION: value is not defined [host] KMP_FOREIGN_THREADS_THREADPRIVATE='TRUE' [host] KMP_FORKJOIN_BARRIER='2,2' [host] KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper' [host] KMP_FORKJOIN_FRAMES='TRUE' [host] KMP_FORKJOIN_FRAMES_MODE='3' [host] KMP_GTID_MODE='3' [host] KMP_HANDLE_SIGNALS='FALSE' [host] KMP_HOT_TEAMS_MAX_LEVEL='1' [host] KMP_HOT_TEAMS_MODE='0' [host] KMP_INIT_AT_FORK='TRUE' [host] KMP_INIT_WAIT='2048' [host] KMP_ITT_PREPARE_DELAY='0' [host] KMP_LIBRARY='throughput' [host] KMP_LOCK_KIND='queuing' [host] KMP_MALLOC_POOL_INCR='1M' [host] KMP_NEXT_WAIT='1024' [host] KMP_NUM_LOCKS_IN_BLOCK='1' [host] KMP_PLAIN_BARRIER='2,2' [host] KMP_PLAIN_BARRIER_PATTERN='hyper,hyper' [host] KMP_REDUCTION_BARRIER='1,1' [host] KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper' [host] KMP_SCHEDULE='static,balanced;guided,iterative' [host] KMP_SETTINGS='FALSE' [host] KMP_SPIN_BACKOFF_PARAMS='4096,100' [host] KMP_STACKOFFSET='64' [host] KMP_STACKPAD='0' [host] KMP_STACKSIZE='4M' [host] KMP_STORAGE_MAP='FALSE' [host] KMP_TASKING='2' [host] KMP_TASK_STEALING_CONSTRAINT='1' [host] KMP_USER_LEVEL_MWAIT='FALSE' [host] KMP_VERSION='FALSE' [host] KMP_WARNINGS='TRUE' [host] OMP_CANCELLATION='FALSE' [host] OMP_DEFAULT_DEVICE='0' [host] OMP_DISPLAY_ENV='VERBOSE' [host] OMP_DYNAMIC='FALSE' [host] OMP_MAX_ACTIVE_LEVELS='2147483647' [host] OMP_MAX_TASK_PRIORITY='0' [host] OMP_NESTED='FALSE' [host] OMP_NUM_THREADS='32' [host] OMP_PLACES='threads' [host] OMP_PROC_BIND='spread' [host] OMP_SCHEDULE='static' [host] OMP_STACKSIZE='4M' [host] OMP_THREAD_LIMIT='2147483647' [host] OMP_WAIT_POLICY='PASSIVE' [host] KMP_AFFINITY='noverbose,warnings,respect,granularity=thread,noduplicates,compact,0,0' OPENMP DISPLAY ENVIRONMENT END For comparison, the Cray 8.5.4 OpenMP runtime (which produces the same thread affinity as the Intel 17.0.2 OpenMP runtime in the aforementioned example) outputs the following when OMP_DISPLAY_ENV=verbose: OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP='201307' OMP_SCHEDULE='static,0' OMP_NUM_THREADS='32' OMP_DYNAMIC='TRUE' OMP_NESTED='FALSE' OMP_STACKSIZE='128MB' OMP_WAIT_POLICY='ACTIVE' OMP_MAX_ACTIVE_LEVELS='1023' OMP_THREAD_LIMIT='256' CRAY_OMP_CHECK_AFFINITY='FALSE' OMP_PROC_BIND='spread' OMP_PLACES='threads' OMP_CANCELLATION='FALSE' OMP_DISPLAY_ENV='VERBOSE' OMP_DEFAULT_DEVICE='0' CRAY_OMP_GUARD_SIZE='0B' CRAY_OMP_TASK_Q_LIMIT='256' CRAY_OMP_CONTENTION_POLICY='Automatic' OPENMP DISPLAY ENVIRONMENT END Also, in this environment, with OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close, the libgomp affinity results in both threads being pinned to different sockets: $ OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close ./xthi-omp.gnu | sort -k 4n,4n Hello from thread 0, on nid00015. (core affinity = 0) Hello from thread 1, on nid00015. (core affinity = 1) Both the Intel and Cray OpenMP runtimes pin the threads to the same physical core: $ OMP_NUM_THREADS=2 OMP_PLACES=threads OMP_PROC_BIND=close ./xthi-omp.intel | sort -k 4n,4n Hello from thread 0, on nid00015. (core affinity = 0) Hello from thread 1, on nid00015. (core affinity = 32) It does seem that the OpenMP 4.5 specification can be interpreted to support the libgomp behavior (e.g., p. 52 lines 33-38), though it at least seems counterintuitive.
[Bug rtl-optimization/79801] Disable ira.c:add_store_equivs for some targets?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79801 Alan Modra changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |WONTFIX --- Comment #2 from Alan Modra --- Given the spec result (thanks Pat!) I think this issue can be closed.
[Bug c++/80544] result of const_cast should by cv-unqualified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80544 --- Comment #6 from Jonathan Wakely --- GCC now accepts the original testcase, and with -Wignored-qualifiers (which is included in -Wextra) prints: q.cc: In function ‘int main()’: q.cc:8:30: warning: type qualifiers ignored on cast result type [-Wignored-qualifiers] f(const_cast(&i)); ^ I couldn't figure out how to get the caret to point to the ignored qualifier.
[Bug c/80868] "Duplicate const" warning emitted in `const typeof(foo) bar;`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80868 --- Comment #3 from George Burgess IV --- Thanks for the response! From the standpoint of consistency, I agree. My point is more that GCC isn't bound by the standard to be as strict with `typeof`, and making an exception for `typeof` here would make it easier to use in macros. I believe the gain in usability here outweighs the cost of having this inconsistency. (I also feel that this warning in general isn't useful when the only "duplicate" const has been inferred from an expression, but it seems that __auto_type has the same "duplicate const" behavior as typeof, so...)
[Bug c++/80544] result of const_cast should by cv-unqualified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80544 Jonathan Wakely changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED Target Milestone|--- |8.0 --- Comment #5 from Jonathan Wakely --- Fixed for GCC 8.
[Bug c++/80544] result of const_cast should by cv-unqualified
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80544 --- Comment #4 from Jonathan Wakely --- Author: redi Date: Wed May 24 22:16:59 2017 New Revision: 248432 URL: https://gcc.gnu.org/viewcvs?rev=248432&root=gcc&view=rev Log: PR c++/80544 strip cv-quals from cast results gcc/cp: PR c++/80544 * tree.c (reshape_init): Use unqualified type for direct enum init. * typeck.c (maybe_warn_about_cast_ignoring_quals): New. (build_static_cast_1, build_reinterpret_cast_1): Strip cv-quals from non-class destination types. (build_const_cast_1): Strip cv-quals from destination types. (build_static_cast, build_reinterpret_cast, build_const_cast) (cp_build_c_cast): Add calls to maybe_warn_about_cast_ignoring_quals. gcc/testsuite: PR c++/80544 * g++.dg/expr/cast11.C: New test. Added: trunk/gcc/testsuite/g++.dg/expr/cast11.C Modified: trunk/gcc/cp/ChangeLog trunk/gcc/cp/decl.c trunk/gcc/cp/typeck.c trunk/gcc/testsuite/ChangeLog
[Bug bootstrap/80867] gnat bootstrap broken on powerpc64le-linux-gnu with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80867 Matthias Klose changed: What|Removed |Added Status|WAITING |UNCONFIRMED CC||nicolas.boulenguez at free dot fr Summary|[7 Regression] gnat |gnat bootstrap broken on |bootstrap broken on |powerpc64le-linux-gnu with |powerpc64le-linux-gnu |-O3 Ever confirmed|1 |0 --- Comment #3 from Matthias Klose --- there was no backtrace. and it's not a regression, not introduced by the above revision, but by building libada with -O3. Reverting to -O2 lets the build succeed.
[Bug c/80731] poor -Woverflow warnings, missing detail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80731 Martin Sebor changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #4 from Martin Sebor --- Implemented in r248431.
[Bug c/80731] poor -Woverflow warnings, missing detail
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80731 --- Comment #3 from Martin Sebor --- Author: msebor Date: Wed May 24 22:07:21 2017 New Revision: 248431 URL: https://gcc.gnu.org/viewcvs?rev=248431&root=gcc&view=rev Log: PR c/80731 - poor -Woverflow warnings gcc/c-family/ChangeLog: PR c/80731 * c-common.h (unsafe_conversion_p): Add a function argument. * c-common.c (unsafe_conversion_p): Same. Add type names and values to diagnostics. (scalar_to_vector): Adjust. * c-warn.c (constant_expression_error): Add a function argument. Add type names and values to diagnostics. (conversion_warning): Add a function argument. Add type names and values to diagnostics. (warnings_for_convert_and_check): Same. gcc/c/ChangeLog: PR c/80731 * c-fold.c (c_fully_fold_internal): Adjust. * c-typeck.c (parser_build_unary_op): Adjust. gcc/cp/ChangeLog: PR c/80731 * call.c (fully_fold_internal): Adjust. gcc/testsuite/ChangeLog: PR c/80731 * c-c++-common/Wfloat-conversion.c: Adjust. * c-c++-common/dfp/convert-int-saturate.c: Same. * c-c++-common/pr68657-1.c: Same. * g++.dg/ext/utf-cvt.C: Same. * g++.dg/ext/utf16-4.C: Same. * g++.dg/warn/Wconversion-real-integer-3.C: Same. * g++.dg/warn/Wconversion-real-integer2.C: Same. * g++.dg/warn/Wconversion3.C: Same. * g++.dg/warn/Wconversion4.C: Same. * g++.dg/warn/Wsign-conversion.C: Same. * g++.dg/warn/overflow-warn-1.C: Same. * g++.dg/warn/overflow-warn-3.C: Same. * g++.dg/warn/overflow-warn-4.C: Same. * g++.dg/warn/pr35635.C: Same. * g++.old-deja/g++.mike/enum1.C: Same. * gcc.dg/Wconversion-3.c: Same. * gcc.dg/Wconversion-5.c: Same. * gcc.dg/Wconversion-complex-c99.c: Same. * gcc.dg/Wconversion-complex-gnu.c: Same. * gcc.dg/Wconversion-integer.c: Same. * gcc.dg/Wsign-conversion.c: Same. * gcc.dg/bitfld-2.c: Same. * gcc.dg/c90-const-expr-11.c: Same. * gcc.dg/c90-const-expr-7.c: Same. * gcc.dg/c99-const-expr-7.c: Same. * gcc.dg/overflow-warn-1.c: Same. * gcc.dg/overflow-warn-2.c: Same. * gcc.dg/overflow-warn-3.c: Same. * gcc.dg/overflow-warn-4.c: Same. * gcc.dg/overflow-warn-5.c: Same. * gcc.dg/overflow-warn-8.c: Same. * gcc.dg/overflow-warn-9.c: New test. * gcc.dg/pr35635.c: Adjust. * gcc.dg/pr59940.c: Same. * gcc.dg/pr59963-2.c: Same. * gcc.dg/pr60114.c: Same. * gcc.dg/switch-warn-2.c: Same. * gcc.dg/utf-cvt.c: Same. * gcc.dg/utf16-4.c: Same. Added: trunk/gcc/testsuite/gcc.dg/overflow-warn-9.c Modified: trunk/gcc/c-family/ChangeLog trunk/gcc/c-family/c-common.c trunk/gcc/c-family/c-common.h trunk/gcc/c-family/c-warn.c trunk/gcc/c/ChangeLog trunk/gcc/c/c-fold.c trunk/gcc/c/c-typeck.c trunk/gcc/cp/ChangeLog trunk/gcc/cp/call.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/c-c++-common/Wfloat-conversion.c trunk/gcc/testsuite/c-c++-common/dfp/convert-int-saturate.c trunk/gcc/testsuite/c-c++-common/pr68657-1.c trunk/gcc/testsuite/g++.dg/ext/utf-cvt.C trunk/gcc/testsuite/g++.dg/ext/utf16-4.C trunk/gcc/testsuite/g++.dg/warn/Wconversion-real-integer-3.C trunk/gcc/testsuite/g++.dg/warn/Wconversion-real-integer2.C trunk/gcc/testsuite/g++.dg/warn/Wconversion3.C trunk/gcc/testsuite/g++.dg/warn/Wconversion4.C trunk/gcc/testsuite/g++.dg/warn/Wsign-conversion.C trunk/gcc/testsuite/g++.dg/warn/overflow-warn-1.C trunk/gcc/testsuite/g++.dg/warn/overflow-warn-3.C trunk/gcc/testsuite/g++.dg/warn/overflow-warn-4.C trunk/gcc/testsuite/g++.dg/warn/pr35635.C trunk/gcc/testsuite/g++.old-deja/g++.mike/enum1.C trunk/gcc/testsuite/gcc.dg/Wconversion-3.c trunk/gcc/testsuite/gcc.dg/Wconversion-5.c trunk/gcc/testsuite/gcc.dg/Wconversion-complex-c99.c trunk/gcc/testsuite/gcc.dg/Wconversion-complex-gnu.c trunk/gcc/testsuite/gcc.dg/Wconversion-integer.c trunk/gcc/testsuite/gcc.dg/Wsign-conversion.c trunk/gcc/testsuite/gcc.dg/bitfld-2.c trunk/gcc/testsuite/gcc.dg/c90-const-expr-11.c trunk/gcc/testsuite/gcc.dg/c90-const-expr-7.c trunk/gcc/testsuite/gcc.dg/c99-const-expr-7.c trunk/gcc/testsuite/gcc.dg/overflow-warn-1.c trunk/gcc/testsuite/gcc.dg/overflow-warn-2.c trunk/gcc/testsuite/gcc.dg/overflow-warn-3.c trunk/gcc/testsuite/gcc.dg/overflow-warn-4.c trunk/gcc/testsuite/gcc.dg/overflow-warn-5.c trunk/gcc/testsuite/gcc.dg/overflow-warn-8.c trunk/gcc/testsuite/gcc.dg/pr35635.c trunk/gcc/testsuite/gcc.dg/pr59940.c trunk/gcc/testsuite/gcc.dg/pr59963-2.c trunk/gcc/testsuite/gcc.dg/pr60114.c trunk/gcc/testsuite/gcc.dg/switch-warn-2.c trunk/gcc/testsuite/gcc.dg/utf-cvt.c trunk/gcc/testsuite/
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > That is, it was supposed to end up using pslldq I think you mean PSRLDQ. Byte zero is the right-most when drawn in a way that makes bit/byte shift directions all match up with the diagram. Opposite of array-initializer order. PSRLDQ is sub-optimal without AVX. It needs a MOVDQA to copy-and-shuffle. For integer shuffles, PSHUFD is what you want (and PSHUFLW for a final step with 16-bit elements). None of the x86 copy-and-shuffle instructions can zero an element, only copy from one of the source elements. PSHUFD can easily swap the high and low halves, but maybe some other targets can't do that as efficiently as just duplicating the high half into the low half or something. (I only really know x86 SIMD). Ideally we could tell the back-end that the high half values are actually don't-care and can be anything, so it can choose the best shuffles to extract the high half for each successive narrowing step. I think clang does this: its shuffle-optimizer makes different choices in a function that returns __m128 vs. returning the low element of a vector as a scalar float. (e.g. for hand-coded horizontal sum using Intel _mm_ intrinsics). --- To get truly optimal code, the backend needs more choice in what order to do the shuffles. e.g. with SSE3, the optimal sequence for FP hsum is probably movshdup %xmm0, %xmm1 # DCBA -> DDBB addps %xmm1, %xmm0 # D+D C+D B+B A+B (elements 0 and 2 are valuable) movhlps %xmm0, %xmm1 # Do this second so a dependency on xmm1 can't be a problem addss %xmm1, %xmm0 # addps saves 1 byte of code. We could use MOVHLPS first to maintain the usual successive-narrowing pattern, but (to avoid a MOVAPS) only if we have a scratch register that was ready earlier in xmm0's dep chain (so introducing a dependency on it can't hurt the critical path). It also needs to be holding FP values that won't cause a slowdown from denormals in the high two elements for the first addps. (NaN / infinity are ok for all current SSE hardware). Adding a number to itself is safe enough, so a shuffle that duplicates the high-half values is good. However, when auto-vectorizing an FP reduction with -fassociative-math but without the rest of -ffast-math, I guess we need to avoid spurious exceptions from values in elements that are otherwise don't-care. Swapping high/low halves is always safe, e.g. using movaps + shufps for both steps: DCBA -> BADC and get D+B C+A B+D A+C. WXYZ -> XWZY and get D+B+C+A repeated four times With AVX, the MOVAPS instructions go away, but vshufps's immediate byte still makes it 1 byte larger than vmovhlps or vunpckhpd. x86 has very few FP copy-and-shuffle instructions, so it's a trickier problem than for integer code where you can always just use PSHUFD unless tuning for SlowShuffle CPUs like first-gen Core2, or K8. With AVX, VPERMILPS with an immediate operand is pointless, except I guess with a memory source. It always needs a 3-byte VEX, but VSHUFPS can use a 2-byte VEX prefix and do the same copy+in-lane-shuffle just as fast on all CPUs (using the same register as both sources), except KNL where single-source shuffles are faster. Moreover, 3-operand AVX makes it possible to use VMOVHLPS or VUNPCKHPD as a copy+shuffle. -- > demote this to first add the two halves and then continue with the reduction > scheme. Sounds good. With x86 AVX512, it takes two successive narrowing steps to get down to 128bit vectors. Narrowing to 256b allows shorter instructions (VEX instead of EVEX). Even with 512b or 256b execution units, narrowing to 128b is about as efficient as possible. Doing the lower-latency in-lane shuffles first would let more instructions retire earlier, but only by a couple cycles. I don't think it makes any sense to special-case for AVX512 and do in-lane 256b ops ending with vextractf128, especially since gcc's current design does most of the reduction strategy in target-independent code. IDK if any AVX512 CPU will ever handle wide vectors as two or four 128b uops. It seems unlikely since 512b lane-crossing shuffles would have to decode to *many* uops, especially stuff like VPERMI2PS (select elements from two 512b sources). Still, AMD seems to like the half-width idea, so I wouldn't be surprised to see an AVX512 CPU with 256b execution units. Even 128b is plausible (especially for low-power / small-core CPUs like Jaguar or Silvermont).
[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247848
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803 Bill Schmidt changed: What|Removed |Added Summary|libgo appears to be |libgo appears to be |miscompiled on powerpc64le |miscompiled on powerpc64le |since r247923 |since r247848 --- Comment #13 from Bill Schmidt --- I had a bad bisection due to revisions that broke bootstrap in between. Building just c,c++,go I was able to determine that the bug started happening with r247848, which is just the big merge of changes to the go frontend and libraries. (Unfortunately that doesn't provide any clues.) Nathan, sorry for the noise!
[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247923
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803 --- Comment #12 from Ian Lance Taylor --- A global variable that can not be statically initialized would be initialized by a function named "net..import", invoked before the Go main function starts. Since the net.IPv4 function is trivial, it is probably being inlined into net..import.
[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247923
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803 --- Comment #11 from boger at us dot ibm.com --- The first failure happens in TestParseIP from ip_test.go because the "out" entries in the var parseIPTests are not initialized correctly. This causes the failures because the actual value (which is correct) doesn't match the expected value (which is incorrect). var parseIPTests = []struct { in string out IP }{ {"127.0.1.2", IPv4(127, 0, 1, 2)}, {"127.0.0.1", IPv4(127, 0, 0, 1)}, {"127.001.002.003", IPv4(127, 1, 2, 3)}, {":::127.1.2.3", IPv4(127, 1, 2, 3)}, {":::127.001.002.003", IPv4(127, 1, 2, 3)}, {":::7f01:0203", IPv4(127, 1, 2, 3)}, {"0:0:0:0:::127.1.2.3", IPv4(127, 1, 2, 3)}, {"0:0:0:0:00::127.1.2.3", IPv4(127, 1, 2, 3)}, {"0:0:0:0:::127.1.2.3", IPv4(127, 1, 2, 3)}, . I believe this is a static var, and the initialization of "out" is done through a call to IPv4 (which does a make) but I'm not sure where and when this initialization is supposed to occur? I tried to use gdb and set a watch where I thought it should be initialized and it didn't trigger. Another oddity, when the entire net.test is run, the output from the call to UnmarshalText is extremely long and that is what causes the output file to get so large, but if the test ParseIP is run by itself, it fails but does not generate the excessive output. But in both cases, the initial failure is due to bad initialization of the "out" entries.
[Bug fortran/37131] inline matmul for small matrix sizes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37131 Bug 37131 depends on bug 66094, which changed state. Bug 66094 Summary: Handle transpose(A) in inline matmul https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66094 What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED
[Bug fortran/66094] Handle transpose(A) in inline matmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66094 Thomas Koenig changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #11 from Thomas Koenig --- All significant use cases are handled now. Closing.
[Bug fortran/66094] Handle transpose(A) in inline matmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66094 --- Comment #10 from Thomas Koenig --- Author: tkoenig Date: Wed May 24 18:44:35 2017 New Revision: 248425 URL: https://gcc.gnu.org/viewcvs?rev=248425&root=gcc&view=rev Log: 2017-05-24 Thomas Koenig PR fortran/66094 * frontend-passes.c (matrix_case): Add A2TB2. (inline_limit_check): Handle MATMUL(TRANSPOSE(A),B) (inline_matmul_assign): Likewise. 2017-05-24 Thomas Koenig PR fortran/66094 * gfortran.dg/inline_matmul_16.f90: New test. Added: trunk/gcc/testsuite/gfortran.dg/inline_matmul_16.f90 Modified: trunk/gcc/fortran/ChangeLog trunk/gcc/fortran/frontend-passes.c trunk/gcc/testsuite/ChangeLog
[Bug sanitizer/80875] [7/8 Regression] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875 Marek Polacek changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |mpolacek at gcc dot gnu.org --- Comment #3 from Marek Polacek --- I'll look.
[Bug sanitizer/80875] [7/8 Regression] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875 Marek Polacek changed: What|Removed |Added Keywords||ice-on-valid-code Target Milestone|--- |7.2 Summary|UBSAN: compile time crash |[7/8 Regression] UBSAN: |in fold_binary_loc at |compile time crash in |fold-const.c:9817 |fold_binary_loc at ||fold-const.c:9817
[Bug sanitizer/80875] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875 --- Comment #2 from Marek Polacek --- commit 0123775a88c6cf1035e4633fde7823a3e9889809 Author: rguenth Date: Wed Oct 28 13:41:25 2015 + 2015-10-28 Richard Biener * fold-const.c (negate_expr_p): Adjust the division case to properly avoid introducing undefined overflow. (fold_negate_expr): Likewise. git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@229484 138bc75d-0d04-0410-961f-82ee72b054a4
[Bug sanitizer/80875] UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875 Marek Polacek changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-05-24 Ever confirmed|0 |1 --- Comment #1 from Marek Polacek --- Confirmed.
[Bug sanitizer/80875] New: UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80875 Bug ID: 80875 Summary: UBSAN: compile time crash in fold_binary_loc at fold-const.c:9817 Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: sanitizer Assignee: unassigned at gcc dot gnu.org Reporter: babokin at gmail dot com CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org, jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at gcc dot gnu.org Target Milestone: --- gcc rev248384, x86_64. > cat f.cpp void foo() { ~2147483647 * (0 / 0); } > g++ -fsanitize=undefined -w -c f.cpp f.cpp: In function ‘void foo()’: f.cpp:3:1: internal compiler error: tree check: expected class ‘constant’, have ‘unary’ (negate_expr) in fold_binary_loc, at fold-const.c:9817 } ^ 0x10384a7 tree_class_check_failed(tree_node const*, tree_code_class, char const*, int, char const*) ../../gcc_svn/gcc/tree.c:9909 <...>
[Bug c++/78591] [c++1z] ICE when using decomposition identifier from closure object
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78591 --- Comment #1 from Paolo Carlini --- The released 7.1.0 doesn't ICE.
[Bug rtl-optimization/80754] [8 Regression][LRA] Invalid smull instruction generated in lra-remat
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80754 --- Comment #5 from Wilco --- Author: wilco Date: Wed May 24 17:06:55 2017 New Revision: 248424 URL: https://gcc.gnu.org/viewcvs?rev=248424&root=gcc&view=rev Log: When lra-remat rematerializes an instruction with a clobber, it checks that the clobber does not kill live registers. However it fails to check that the clobber also doesn't overlap with the destination register of the final rematerialized instruction. As a result it is possible to generate illegal instructions with the same hard register as the destination and a clobber. Fix this by also checking for overlaps with the destination register. gcc/ PR rtl-optimization/80754 * lra-remat.c (do_remat): Add overlap checks for dst_regno. Modified: trunk/gcc/ChangeLog trunk/gcc/lra-remat.c
[Bug c++/71451] [5/6/7/8 Regression] ICE on invalid C++11 code on x86_64-linux-gnu: in dependent_type_p, at cp/pt.c:22599
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71451 Paolo Carlini changed: What|Removed |Added CC||paolo.carlini at oracle dot com --- Comment #4 from Paolo Carlini --- This seems already fixed in trunk. I guess we can as well add the testcase there and of course keep the bug open.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #19 from Thorsten Kurth --- Thanks you very much. I am sorry that I do not have a simpler test case. The kernel which is executed is in the same directory as ABecLaplacian and called MG_3D_cpp.cpp. We have seen similar problems with the fortran kernels (they are scattered across multiple files) but the fortran kernels and our C++ ports give the same performance with the original OpenMP parallelization. In any case, I wonder why the compiler honors the target region even if -march=knl is specified. However, please let me know if you have further questions. I can guide you through that code. The code is big but the relevant files are technically 2 or 3 and the relevant lines of code also not very many.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #18 from Jakub Jelinek --- Ok, I'll grab your git code and will have a look tomorrow what's going on.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #17 from Thorsten Kurth --- the result though is correct, I verified that both codes generate the correct output.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #16 from Thorsten Kurth --- FYI, the code is: https://github.com/zronaghi/BoxLib.git in branch cpp_kernels_openmp4dot5 and then in Src/LinearSolvers/C_CellMG the file ABecLaplacian.cpp. For example, lines 542 and 543 can be commented out and commented in and when the test case in run you get significant slowdown when the code is compiled with that stuff commented in. I did not map all the scalar stuff so it might be that this is a problem. But in any case, it should not create copies of that stuff at all in my opinion. Please don't look at that code right now because it is a bit convoluted I just wanted to show that this issue appears. So when I have the target section I mentioned above commented in I get by running: #!/bin/bash export OMP_NESTED=false export OMP_NUM_THREADS=64 export OMP_PLACES=threads export OMP_PROC_BIND=spread export OMP_MAX_ACTIVE_LEVELS=1 execpath="/project/projectdirs/mpccc/tkurth/Portability/BoxLib/Tutorials/MultiGrid_C" exec=`ls -latr ${execpath}/main3d.*.MPI.OMP.ex | awk '{print $9}'` #execute ${exec} inputs the following: tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh MPI initialized with 1 MPI processes OMP initialized with 64 OMP threads Using Dirichlet or Neumann boundary conditions. Grid resolution : 128 (cells) Domain size : 1 (length unit) Max_grid_size : 32 (cells) Number of grids : 64 Sum of RHS : -2.68882138776405e-17 Solving with BoxLib C++ solver WARNING: using C++ kernels in LinOp WARNING: using C++ MG solver with C kernels MultiGrid: Initial rhs= 135.516568492921 MultiGrid: Initial residual = 135.516568492921 MultiGrid: Iteration 1 resid/bnorm = 0.379119045820053 MultiGrid: Iteration 2 resid/bnorm = 0.0107971623268356 MultiGrid: Iteration 3 resid/bnorm = 0.000551321916982188 MultiGrid: Iteration 4 resid/bnorm = 3.55014555643671e-05 MultiGrid: Iteration 5 resid/bnorm = 2.57082340920002e-06 MultiGrid: Iteration 6 resid/bnorm = 1.90970439886018e-07 MultiGrid: Iteration 7 resid/bnorm = 1.44525222814178e-08 MultiGrid: Iteration 8 resid/bnorm = 1.10675190626368e-09 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 , Solve time: 5.84898591041565, CG time: 0.162226438522339 Converged res < eps_rel*max(bnorm,res_norm) Run time : 5.98936820030212 Unused ParmParse Variables: [TOP]::hypre.solver_flag(nvals = 1) :: [1] [TOP]::hypre.pfmg_rap_type(nvals = 1) :: [1] [TOP]::hypre.pfmg_relax_type(nvals = 1) :: [2] [TOP]::hypre.num_pre_relax(nvals = 1) :: [2] [TOP]::hypre.num_post_relax(nvals = 1) :: [2] [TOP]::hypre.skip_relax(nvals = 1) :: [1] [TOP]::hypre.print_level(nvals = 1) :: [1] done. When I comment it out, recompile, I get: tkurth@nid06760:/global/cscratch1/sd/tkurth/boxlib_omp45> ./run_example.sh MPI initialized with 1 MPI processes OMP initialized with 64 OMP threads Using Dirichlet or Neumann boundary conditions. Grid resolution : 128 (cells) Domain size : 1 (length unit) Max_grid_size : 32 (cells) Number of grids : 64 Sum of RHS : -2.68882138776405e-17 Solving with BoxLib C++ solver WARNING: using C++ kernels in LinOp WARNING: using C++ MG solver with C kernels MultiGrid: Initial rhs= 135.516568492921 MultiGrid: Initial residual = 135.516568492921 MultiGrid: Iteration 1 resid/bnorm = 0.379119045820053 MultiGrid: Iteration 2 resid/bnorm = 0.0107971623268356 MultiGrid: Iteration 3 resid/bnorm = 0.000551321916981978 MultiGrid: Iteration 4 resid/bnorm = 3.5501455563633e-05 MultiGrid: Iteration 5 resid/bnorm = 2.5708234090034e-06 MultiGrid: Iteration 6 resid/bnorm = 1.90970439781153e-07 MultiGrid: Iteration 7 resid/bnorm = 1.44525225042545e-08 MultiGrid: Iteration 8 resid/bnorm = 1.10675108045705e-09 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 MultiGrid: Iteration 9 resid/bnorm = 8.55424251440489e-11 , Solve time: 0.759385108947754, CG time: 0.14183521270752 Converged res < eps_rel*max(bnorm,res_norm) Run time : 0.879786014556885 Unused ParmParse Variables: [TOP]::hypre.solver_flag(nvals = 1) :: [1] [TOP]::hypre.pfmg_rap_type(nvals = 1) :: [1] [TOP]::hypre.pfmg_relax_type(nvals = 1) :: [2] [TOP]::hypre.num_pre_relax(nvals = 1) :: [2] [TOP]::hypre.num_post_relax(nvals = 1) :: [2] [TOP]::hypre.skip_relax(nvals = 1) :: [1] [TOP]::hypre.print_level(nvals = 1) :: [1] done. it is like 7.3x slowdown. The smoothing kernel (gauss-seidel red-black) is the most expensive kernel in the Multi-Grid code, so I see the biggest effect here. But the other kernels (prolongation, restriction, dot products etc) have slowdowns as well amounting to a total of more than 10x for the whole app.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #15 from Thorsten Kurth --- The code I care about definitely has optimization enabled. For the fortran stuff it does (for example): ftn -g -O3 -ffree-line-length-none -fno-range-check -fno-second-underscore -Jo/3d.gnu.MPI.OMP.EXE -I o/3d.gnu.MPI.OMP.EXE -fimplicit-none -fopenmp -I. -I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/LinearSolvers/F_MG/itsol.f90 -o o/3d.gnu.MPI.OMP.EXE/itsol.o Compiling cc_mg_tower_smoother.f90 ... and for the C++ stuff it does CC -g -O3 -std=c++14 -fopenmp -g -DCG_USE_OLD_CONVERGENCE_CRITERIA -DBL_OMP_FABS -DDEVID=0 -DNUM_TEAMS=1 -DNUM_THREADS_PER_BOX=1 -march=knl -DNDEBUG -DBL_USE_MPI -DBL_USE_OMP -DBL_GCC_VERSION='6.3.0' -DBL_GCC_MAJOR_VERSION=6 -DBL_GCC_MINOR_VERSION=3 -DBL_SPACEDIM=3 -DBL_FORT_USE_UNDERSCORE -DBL_Linux -DMG_USE_FBOXLIB -DBL_USE_F_BASELIB -DBL_USE_FORTRAN_MPI -DUSE_F90_SOLVERS -I. -I../../Src/C_BoundaryLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I../../Src/C_BaseLib -I../../Src/C_BoundaryLib -I../../Src/C_BaseLib -I../../Src/LinearSolvers/C_CellMG -I../../Src/LinearSolvers/C_CellMG4 -I/opt/intel/vtune_amplifier_xe_2017.2.0.499904/include -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/C_to_F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/LinearSolvers/F_MG -I../../Src/F_BaseLib -I../../Src/F_BaseLib -c ../../Src/C_BaseLib/FPC.cpp -o o/3d.gnu.MPI.OMP.EXE/FPC.o Compiling Box.cpp ... But the kernels I care about are in C++.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #14 from Jakub Jelinek --- (In reply to Thorsten Kurth from comment #13) > the compiler options are just -fopenmp. I am sure it does not have to do > anything with vectorization as I compare the code runtime with and without > the target directives and thus vectorization should be the same between > them. The remaining OpenMP sections are the same. In our work we have not > seen 10x because of insufficient vectorization, it is usually because of > cache locality but that is the same for OMP 4.5 and OMP 3 because the loops > are not touched. > I do not specify an ISA choice, but I will try specifying KNL now and will > tell you what the compiler is going to do. The compiler doesn't optimize by default (i.e. default is -O0), so if you are measuring -O0 -fopenmp performance or code size, that is something that is completely uninteresting. For -O0 the most important is compilation speed, not quality of generated code. For runtime performance of generated code only -O2, -O3 or -Ofast are optimization levels that make sense.
[Bug fortran/28004] Warn if intent(out) dummy variable is used before being defined
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28004 Thomas Koenig changed: What|Removed |Added Last reconfirmed|2007-07-03 21:06:36 |2017-5-24 --- Comment #12 from Thomas Koenig --- Still current with current trunk.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #13 from Thorsten Kurth --- Hello Jakub, the compiler options are just -fopenmp. I am sure it does not have to do anything with vectorization as I compare the code runtime with and without the target directives and thus vectorization should be the same between them. The remaining OpenMP sections are the same. In our work we have not seen 10x because of insufficient vectorization, it is usually because of cache locality but that is the same for OMP 4.5 and OMP 3 because the loops are not touched. I do not specify an ISA choice, but I will try specifying KNL now and will tell you what the compiler is going to do. Best Thorsten
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #12 from Jakub Jelinek --- (In reply to Thorsten Kurth from comment #11) > yes, you are right. I thought that map(tofrom:) is the default mapping > but I might be wrong. In any case, teams is always 1. So this code is Variables that aren't pointers nor scalars are still implicitly map(tofrom:), scalars are implicitly firstprivate(), pointers are map(alloc:ptr[0:0]). > basically just data streaming so there is no need for a detailed > performance analysis. When I timed the code (not profiling it) the OpenMP > 4.5 code had a tiny bit more overhead, but not significant. > However, we might nevertheless learn from that. What kind of compiler options you use? -O2 -fopenmp, -O3 -fopenmp, -Ofast -fopenmp, something different? What ISA choice? -march=native, -mavx2, ...? The 10x slowdown could most likely be explained by the inner loop being vectorized in one case and not the other. You aren't using #pragma omp parallel for simd that you'd explicitly ask for vectorization e.g. even at -O2 -fopenmp.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #11 from Thorsten Kurth --- Hello Jakub, yes, you are right. I thought that map(tofrom:) is the default mapping but I might be wrong. In any case, teams is always 1. So this code is basically just data streaming so there is no need for a detailed performance analysis. When I timed the code (not profiling it) the OpenMP 4.5 code had a tiny bit more overhead, but not significant. However, we might nevertheless learn from that. Best Thorsten
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #10 from Jakub Jelinek --- (In reply to Thorsten Kurth from comment #7) > Hello Jakub, > > thanks for your comment but I think the parallel for is not racey. Every > thread is working a block of i-indices so that is fine. The dotprod kernel > is actually a kernel from the OpenMP standard documentation and I am sure > that this is not racey. I was not talking about the parallel for, but about the parallel I've cited. Even if you write the same value from all threads, at least pedantically it is racy, even when you might get away with that. Which is why you should assign it just once, e.g. through #pragma omp master or single. > The example with the regions you mentioned I do not see a problem with that > either. By default, everything is shared so the variable is updated by all > the threads/teams with the same value. The omp target I've cited above is by default handled in OpenMP 4.0 as #pragma omp target teams map(tofrom:num_teams) and will work that way, although it is again pedantically racy, multiple teams write the same value. In OpenMP 4.5 it is #pragma omp target teams firstprivate(num_teams) and you will always end up with 1, even if there is accelerator that has say 1024 teams by default. So you really need explicit map(from:num_teams) or similar to get the value back. And to be pedantically correct also assign it only once, e.g. by doing the assignment only if (omp_get_team_num () == 0). > Concerning splitting distribute and parallel: I tried both combinations and > found that they behave the same. But in the end I split it so that I could > comment out the distribute section to see if that makes a performance > difference (and it does). I was just asking why you are doing it, I haven't yet analyzed the code if there is something that could be easily improved.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #9 from Thorsten Kurth --- Sorry, in the second run I set the number of threads to 12. I think the code works as expected.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #8 from Thorsten Kurth --- Here is the output of the get_num_threads section: [tkurth@cori02 omp_3_vs_45_test]$ export OMP_NUM_THREADS=32 [tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x We got 1 teams and 32 threads. and: [tkurth@cori02 omp_3_vs_45_test]$ ./nested_test_omp_4dot5.x We got 1 teams and 12 threads. I think the code is OK.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #7 from Thorsten Kurth --- Hello Jakub, thanks for your comment but I think the parallel for is not racey. Every thread is working a block of i-indices so that is fine. The dotprod kernel is actually a kernel from the OpenMP standard documentation and I am sure that this is not racey. The example with the regions you mentioned I do not see a problem with that either. By default, everything is shared so the variable is updated by all the threads/teams with the same value. The issue is that num_teams=1 is only true for CPU, for GPU it is OS, driver, architecture and whatever dependent. Concerning splitting distribute and parallel: I tried both combinations and found that they behave the same. But in the end I split it so that I could comment out the distribute section to see if that makes a performance difference (and it does). I believe that the overhead instructions are responsible for the bad performance because that is the only thing which distinguishes the target annotated code from the plain openmp code. I used vtune to look at thread utilization and they look similar, L1, L2 hit rates are very close (100% vs 99% and 92% vs 89%) for the plain openmp and for the target annotated code. BUT the performance of the target annotated code can be up to 10x worse. So I think there might be register spilling due to copying a large amount of variables. If you like I can point you to the github repo code (BoxLib) which clearly exhibits this issue. This small test case only shows minor overhead of OpenMP 4.5 vs, say, OpenMP 3 but it clearly generates some additional overhead.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #6 from Jakub Jelinek --- movq/pushq etc. aren't that expensive, if it affects performance it must be something in the inner loops. A compiler switch that ignores omp target, teams and distribute would basically create a new OpenMP version if it would ignore the requirements on those constructs, you can achive it yourself by using those in _Pragma in some macro and defining it conditionally based on whether you want offloading or not, then the "you can ignore all side effects" is decided by you. For OpenMP 5.0, there is some work on prescriptive vs. descriptive clauses/constructs where in your case you could just use a describe that the loop could be parallelized, simdized and/or offloaded and keep that up to the implementation what it does with that. What we perhaps could do is when not offloading try to simplify omp distribute (if we know omp_get_num_teams () will be always 1), either just by folding the library calls in that case to 1 or 0, or perhaps doing some more. #pragma omp target teams { num_teams=omp_get_num_teams(); } #pragma omp parallel { num_threads=omp_get_num_threads(); } in your testcase is just wrong, the target would be ok in OpenMP 4.0, but it is not in 4.5, num_teams, being a scalar variable, is firstprivate, so you won't get the value back. The parallel is racy, to avoid races you'd need #pragma omp single or #pragma omp master. Why are you using separate distribute and parallel for constructs and prescribing what they handle, instead of just using #pragma omp distribute parallel for for (int i = 0; i < N; ++i) D[i] += B[i] * C[i]; ? Do you expect or see any gains from that?
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #5 from Thorsten Kurth --- To clarify the problem: I think that the additional movq, pushq and other instructions generated when using the target directive can cause a big hit on the performance. I understand that these instructions are necessary when offloading is used but in case when I compile for native architecture those should not be there. So maybe I am just missing a GNU compiler flag which disables offloading and lets the compiler ignore the target, teams and distribute directives at compile time but still honoring all the other OpenMP constructs. Is there a way to do that right now and if not, is there a way to add that flag that supports this.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #4 from Thorsten Kurth --- Created attachment 41415 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41415&action=edit Testcase This is the test case. The files ending on .as contain the assembly code with and without target region
[Bug libfortran/78379] Processor-specific versions for matmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78379 --- Comment #36 from Jerry DeLisle --- Results look very good. Gfortran 7, no patch gives: $ gfc7 -static -Ofast -ftree-vectorize compare.f90 $ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 4.706 0.046 0.094 0.162 4 2000 1.246 0.246 0.305 0.351 8 2000 1.410 0.605 0.958 1.791 16 2000 5.413 2.787 2.228 2.615 32 2000 4.676 3.416 4.622 4.618 64 2000 6.368 2.652 6.339 6.167 128 2000 8.165 2.998 8.118 8.260 256 477 9.334 3.202 9.248 9.355 51259 8.730 2.239 8.596 8.730 1024 7 8.805 1.378 8.673 8.812 2048 1 8.781 1.728 8.649 8.789 Latest gfortran trunk with patch gives: $ gfc -static -Ofast -ftree-vectorize compare.f90 $ ./a.out = MEASURED GIGAFLOPS = = Matmul Matmul fixed Matmul variable Size Loops explicit refMatmul assumedexplicit = 2 2000 4.738 0.048 0.092 0.172 4 2000 1.438 0.248 0.305 0.378 8 2000 1.511 0.617 1.177 1.955 16 2000 5.426 2.810 1.854 2.881 32 2000 4.688 3.314 4.357 5.091 64 2000 6.669 2.674 6.629 7.110 128 2000 9.139 3.000 9.076 9.131 256 477 10.495 3.184 10.466 10.516 51259 9.577 2.189 9.477 9.635 1024 7 9.593 1.381 9.519 9.658 2048 1 9.722 1.709 9.625 9.785
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #3 from Thorsten Kurth --- Created attachment 41414 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41414&action=edit OpenMP 4.5 Testcase This is the source code
[Bug bootstrap/80843] [8 Regression] bootstrap fails in stage1 on powerpc-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80843 Segher Boessenkool changed: What|Removed |Added CC||segher at gcc dot gnu.org --- Comment #2 from Segher Boessenkool --- I suspect my patch for PR80860 has fixed this as well; Matthias, can you check please?
[Bug bootstrap/80860] AIX Bootstrap failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80860 Segher Boessenkool changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #6 from Segher Boessenkool --- Should be fixed now. Please reopen if not.
[Bug bootstrap/80843] [8 Regression] bootstrap fails in stage1 on powerpc-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80843 --- Comment #1 from Segher Boessenkool --- Author: segher Date: Wed May 24 14:33:11 2017 New Revision: 248421 URL: https://gcc.gnu.org/viewcvs?rev=248421&root=gcc&view=rev Log: rs6000: Fix for separate shrink-wrapping for fp (PR80860, PR80843) After my r248256, rs6000_components_for_bb allocates an sbitmap of size only 32 while it can use up to 64. This patch fixes it. It moves the n_components variable into the machine_function struct so that other hooks can use it. PR bootstrap/80860 PR bootstrap/80843 * config/rs6000/rs6000.c (struct machine_function): Add new field n_components. (rs6000_get_separate_components): Init that field, use it. (rs6000_components_for_bb): Use the field. Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/rs6000.c
[Bug bootstrap/80860] AIX Bootstrap failure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80860 --- Comment #5 from Segher Boessenkool --- Author: segher Date: Wed May 24 14:33:11 2017 New Revision: 248421 URL: https://gcc.gnu.org/viewcvs?rev=248421&root=gcc&view=rev Log: rs6000: Fix for separate shrink-wrapping for fp (PR80860, PR80843) After my r248256, rs6000_components_for_bb allocates an sbitmap of size only 32 while it can use up to 64. This patch fixes it. It moves the n_components variable into the machine_function struct so that other hooks can use it. PR bootstrap/80860 PR bootstrap/80843 * config/rs6000/rs6000.c (struct machine_function): Add new field n_components. (rs6000_get_separate_components): Init that field, use it. (rs6000_components_for_bb): Use the field. Modified: trunk/gcc/ChangeLog trunk/gcc/config/rs6000/rs6000.c
[Bug libfortran/78379] Processor-specific versions for matmul
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78379 --- Comment #35 from Jerry DeLisle --- (In reply to Thomas Koenig from comment #34) > Created attachment 41410 [details] > Patch which has all the files > > Well, I suspect my way of splitting the previous patch into > one real patch and one *.tar.gz - file was not really the best way > to go :-) > > Here is a patch which should include all the new files. > > At least it fits into the 1000 kb limit. I am finishing a build in maintainer mode so will try the first approach and if that fails, will try the new patch. Everything looks reasonable, just think we should test on my AMD boxes.
[Bug tree-optimization/80874] New: gcc does not emit cmov for minmax
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80874 Bug ID: 80874 Summary: gcc does not emit cmov for minmax Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: denis.campredon at gmail dot com Target Milestone: --- Hello, Considering the following code: -- struct pair { int min, max; }; pair minmax1(int x, int y) { if (x > y) return {y, x}; else return {x, y}; } #include std::pair minmax2(int x, int y) { return std::minmax(x, y); } auto minmax3(int x, int y) { return std::minmax(x, y); } --- I've found that for minmax1 and minmax 2, gcc fails to emit cmov at -03. Instead it produces the following: minmax1(int, int): cmp edi, esi jle .L2 mov eax, edi mov edi, esi mov esi, eax .L2: mov eax, edi sal rsi, 32 or rax, rsi ret For minmax3, the asm should be the same (I think), but it produces a more complex code.
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #12 from Uroš Bizjak --- (In reply to Peter Cordes from comment #4) > MMX is also a saving in code-size: one fewer prefix byte vs. SSE2 integer > instructions. It's also another set of 8 registers for 32-bit mode. After touching a MMX register, the compiler needs to emit emms insn, so MMX moves are practically unusable as generic moves.
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #11 from Uroš Bizjak --- (In reply to Peter Cordes from comment #0) > A lower-latency xmm->int strategy would be: > > movd%xmm0, %eax > pextrd $1, %xmm0, %edx Proposed patch implements the above for generic moves. > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other > CPUs where an FP shuffle has bypass delay between integer ops) > > movd %xmm0, %eax > movshdup %xmm0, %xmm0 # saves 1B of code-size vs. psrldq, I think. > movd %xmm0, %edx > > Or without SSE3, > > movd %xmm0, %eax > psrldq $4, %xmm0# 1 m-op cheaper than pshufd on K8 > movd %xmm0, %edx The above two proposals are not suitable for generic moves. We should not clobber input value, and we are not allowed to use temporary.
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #10 from Uroš Bizjak --- (In reply to Peter Cordes from comment #0) > Scalar 64-bit integer ops in vector regs may be useful in general in 32-bit > code in some cases, especially if it helps with register pressure. We have scalar-to-vector pass (-mstv) that does the above, but chooses not to convert the above code due to costs.
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #9 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #8) > movq%xmm0, (%esp) <<-- unneeded store due to RA problem For some reason, reload "fixes" direct DImode register moves, and passes value via memory. Later passes partially merge these moves, but leave the above insn.
[Bug c++/80873] ICE in tsubst_copy when trying to use an overloaded function without a definition in a lambda
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80873 --- Comment #2 from Morris Hafner --- Created attachment 41413 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41413&action=edit Minimal example code (valid)
[Bug c++/80873] ICE in tsubst_copy when trying to use an overloaded function without a definition in a lambda
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80873 --- Comment #1 from Morris Hafner --- I managed to create an example that is a valid program: struct Buffer {}; auto parse(Buffer b); template void parse(T target); template auto field(T target) { return [&] { parse(target); }; } template void parse(T target) {} auto parse(Buffer b) { field(0); } int main() { }
[Bug other/80803] libgo appears to be miscompiled on powerpc64le since r247923
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80803 boger at us dot ibm.com changed: What|Removed |Added CC||boger at us dot ibm.com --- Comment #10 from boger at us dot ibm.com --- Bill, I've been out for a few days but can help debug this now that I'm back. It looks like the test case gets the correct answer but the string it thinks is correct is wrong (always nil). Either the initialization of the expected values is wrong to begin with or they are being corrupted during the run. We should be able to figure that out with gdb.
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #8 from Uroš Bizjak --- The patch from comment #7 generates: a) DImode move for 32 bit targets: --cut here-- long long test (long long a) { asm ("" : "+x" (a)); return a; } --cut here-- gcc -O2 -msse4.1 -mtune=intel -mregparm=2: movd%eax, %xmm0 pinsrd $1, %edx, %xmm0 movq%xmm0, (%esp) <<-- unneeded store due to RA problem movd%xmm0, %eax pextrd $1, %xmm0, %edx leal12(%esp), %esp b) TImode move for 64 bit targets: --cut here-- __int128 test (__int128 a) { asm ("" : "+x" (a)); return a; } --cut here-- gcc -O2 -msse4.1 -mtune=intel movq%rdi, %xmm0 pinsrq $1, %rsi, %xmm0 pextrq $1, %xmm0, %rdx movq%xmm0, %rax
[Bug tree-optimization/46186] Clang creates code running 1600 times faster than gcc's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 Raphael C changed: What|Removed |Added CC||drraph at gmail dot com --- Comment #26 from Raphael C --- If I understood this PR correctly, this simpler code shows the same issue: unsigned long f(unsigned long a) { unsigned long sum = 0; for (; a <10; a++) sum += a; return sum; } In gcc 7.1 with -O3 -march=native I get: f: cmp rdi, 9 ja .L7 mov eax, 9 mov ecx, 10 sub rax, rdi sub rcx, rdi cmp rax, 7 jbe .L8 vmovq xmm3, rdi mov rdx, rcx vpxor xmm0, xmm0, xmm0 xor eax, eax vpbroadcastqymm1, xmm3 vmovdqa ymm2, YMMWORD PTR .LC1[rip] vpaddq ymm1, ymm1, YMMWORD PTR .LC0[rip] shr rdx, 2 .L4: add rax, 1 vpaddq ymm0, ymm0, ymm1 vpaddq ymm1, ymm1, ymm2 cmp rax, rdx jb .L4 vpxor xmm1, xmm1, xmm1 mov rdx, rcx vperm2i128 ymm2, ymm0, ymm1, 33 and rdx, -4 vpaddq ymm0, ymm0, ymm2 add rdi, rdx vperm2i128 ymm1, ymm0, ymm1, 33 vpalignrymm1, ymm1, ymm0, 8 vpaddq ymm0, ymm0, ymm1 vmovq rax, xmm0 cmp rcx, rdx je .L33 vzeroupper .L3: lea rdx, [rdi+1] add rax, rdi cmp rdx, 10 je .L31 add rax, rdx lea rdx, [rdi+2] cmp rdx, 10 je .L31 add rax, rdx lea rdx, [rdi+3] cmp rdx, 10 je .L31 add rax, rdx lea rdx, [rdi+4] cmp rdx, 10 je .L31 add rax, rdx lea rdx, [rdi+5] cmp rdx, 10 je .L31 add rax, rdx lea rdx, [rdi+6] cmp rdx, 10 je .L31 add rax, rdx add rdi, 7 lea rdx, [rax+rdi] cmp rdi, 10 cmovne rax, rdx ret .L7: xor eax, eax .L31: ret .L33: vzeroupper ret .L8: xor eax, eax jmp .L3 However in clang I get: f: # @f cmp rdi, 9 ja .LBB0_1 mov eax, 9 sub rax, rdi lea rcx, [rdi + 1] imulrcx, rax mov edx, 8 sub rdx, rdi mul rdx shl rdx, 63 shr rax or rax, rdx add rcx, rdi add rcx, rax mov rax, rcx ret .LBB0_1: xor ecx, ecx mov rax, rcx ret which is greatly simpler and avoids looping altogether. What is the current status of this (very old) PR? Do people think it is worth addressing?
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #7 from Uroš Bizjak --- Created attachment 41412 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41412&action=edit Prototype patch Patch that emits mov/pinsr or mov/pextr pairs for DImode (x86_32) and TImode (x86_64) moves.
[Bug tree-optimization/80844] OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844 --- Comment #5 from Richard Biener --- (In reply to Jakub Jelinek from comment #2) > (In reply to Richard Biener from comment #1) > > If OMP SIMD always zeros the vector then it could also emit the maybe easier > > to optimize > > > > WITH_SIZE_EXPR<_3, D.2841> = {}; > > It doesn't always zero, it can be pretty arbitrary. Ah, the memset gets exposed by loop distribution. Before we have [5.67%]: # _28 = PHI <_27(13), 0(10)> D.2357[_28] = 0.0; _27 = _28 + 1; if (_15 > _27) goto ; [85.00%] else goto ; [15.00%] [4.82%]: goto ; [100.00%] so indeed the other cases will be more "interesting". For your latest idea to work we have to make sure the prologue / epilogue loop doesn't get unrolled / pattern matched. I'll still look at enhancing memset folding (it's pretty conservative in the cases it handles). > For the default > reductions on integral/floating point types it does zero for +/-/|/^/|| > reductions, but e.g. 1 for */&&, or ~0 for &, or maximum or minimum for min > or max. For user defined reductions it can be whatever the user requests, > constructor for some class type, function call, set to arbitrary value etc. > For other privatization clauses it is again something different > (uninitialized for private/lastprivate, some other var + some bias for > linear, ...). > And then after the simd loop there is again a reduction or something > similar, but again can be quite complex in the general case. If it helps, > we could mark the pre-simd and post-simd loops somehow in the loop structure > or something, but the actual work needs to be done later, especially after > inlining, including the vectorizer and other passes. > E.g. for the typical reduction where the vectorizer computes the "simd > array" in a vector temporary (or collection of them), it would be nice if we > were able to pattern recognize simple cases and turn those into vector > reduction patterns.
[Bug libstdc++/80826] Compilation Time for many of std::map insertions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80826 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #9 from Jan Hubicka --- will take a look at the cgraph issue
[Bug c++/80864] Brace-initialization of a constexpr variable of an array in a POD triggers ICE from templates
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80864 Richard Biener changed: What|Removed |Added Keywords||ice-on-valid-code Status|UNCONFIRMED |NEW Last reconfirmed||2017-05-24 Ever confirmed|0 |1 Known to fail||7.1.0
[Bug bootstrap/80867] [7 Regression] gnat bootstrap broken on powerpc64le-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80867 Richard Biener changed: What|Removed |Added Target Milestone|--- |7.2
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 Richard Biener changed: What|Removed |Added Keywords||missed-optimization, openmp Status|UNCONFIRMED |WAITING Last reconfirmed||2017-05-24 Ever confirmed|0 |1
[Bug c++/80856] [7/8 Regression] ICE from template local overload resolution
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80856 Richard Biener changed: What|Removed |Added Priority|P3 |P2 Target Milestone|--- |7.2
[Bug middle-end/80853] [6/7/8 Regression] OpenMP ICE in build_outer_var_ref with array reduction
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80853 Richard Biener changed: What|Removed |Added Priority|P3 |P2 Target Milestone|--- |6.4
[Bug middle-end/80823] [8 Regression] ICE: verify_flow_info failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80823 Peter Bergner changed: What|Removed |Added Status|RESOLVED|CLOSED --- Comment #7 from Peter Bergner --- Closing as fixed.
[Bug middle-end/80823] [8 Regression] ICE: verify_flow_info failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80823 Peter Bergner changed: What|Removed |Added Status|NEW |RESOLVED URL||https://gcc.gnu.org/ml/gcc- ||patches/2017-05/msg01791.ht ||ml Resolution|--- |FIXED --- Comment #6 from Peter Bergner --- Fixed in revision r248408.
[Bug middle-end/80823] [8 Regression] ICE: verify_flow_info failed
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80823 --- Comment #5 from Peter Bergner --- Author: bergner Date: Wed May 24 12:10:54 2017 New Revision: 248408 URL: https://gcc.gnu.org/viewcvs?rev=248408&root=gcc&view=rev Log: gcc/ PR middle-end/80823 * tree-cfg.c (group_case_labels_stmt): Delete increment of "i"; gcc/testsuite/ PR middle-end/80823 * gcc.dg/pr80823.c: New test. Added: trunk/gcc/testsuite/gcc.dg/pr80823.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-cfg.c
[Bug target/80725] [7/8 Regression] s390x ICE on alsa-lib
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80725 --- Comment #5 from Andreas Krebbel --- Author: krebbel Date: Wed May 24 11:36:54 2017 New Revision: 248407 URL: https://gcc.gnu.org/viewcvs?rev=248407&root=gcc&view=rev Log: S/390: Fix PR80725. gcc/ChangeLog: 2017-05-24 Andreas Krebbel PR target/80725 * config/s390/s390.c (s390_check_qrst_address): Check incoming address against address_operand predicate. * config/s390/s390.md ("*indirect_jump"): Swap alternatives. gcc/testsuite/ChangeLog: 2017-05-24 Andreas Krebbel * gcc.target/s390/pr80725.c: New test. Added: trunk/gcc/testsuite/gcc.target/s390/pr80725.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/s390/s390.c trunk/gcc/config/s390/s390.md trunk/gcc/testsuite/ChangeLog
[Bug c++/80851] All versions that support C++11 are confused by combination of inherited constructors with member initializer that captures this
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80851 Richard Biener changed: What|Removed |Added Keywords||rejects-valid Status|UNCONFIRMED |NEW Last reconfirmed||2017-05-24 Ever confirmed|0 |1 --- Comment #2 from Richard Biener --- Confirmed. clang accepts this.
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Richard Biener changed: What|Removed |Added Status|UNCONFIRMED |ASSIGNED Last reconfirmed||2017-05-24 CC||uros at gcc dot gnu.org Blocks||53947 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Richard Biener --- So the vectorizer uses "whole vector shift" to do the final reduction: vect_sum_11.8_5 = VEC_PERM_EXPR ; vect_sum_11.8_20 = vect_sum_11.8_5 + vect_sum_11.6_6; vect_sum_11.8_19 = VEC_PERM_EXPR ; vect_sum_11.8_18 = vect_sum_11.8_19 + vect_sum_11.8_20; vect_sum_11.8_13 = VEC_PERM_EXPR ; vect_sum_11.8_26 = vect_sum_11.8_13 + vect_sum_11.8_18; stmp_sum_11.7_27 = BIT_FIELD_REF ; I can see that for Zen that is bad (even for avx256 in general eventually because it crosses lanes). That is, it was supposed to end up using pslldq, not the vperm + palign combos. That said, the vectorizer could "easily" demote this to first add the two halves and then continue with the reduction scheme. The GIMPLE representation of this is BIT_FIELD_REFs which I hope would end up being expanded in a way the x86 backend can handle (hi/lo subregs?). I'll see to handle this better in the vectorizer. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug libstdc++/71579] type_traits miss checks for type completeness in some traits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71579 --- Comment #6 from Antony Polukhin --- C++ LWG related issue: http://cplusplus.github.io/LWG/lwg-active.html#2797
[Bug c++/80873] New: ICE in tsubst_copy when trying to use an overloaded function without a definition in a lambda
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80873 Bug ID: 80873 Summary: ICE in tsubst_copy when trying to use an overloaded function without a definition in a lambda Product: gcc Version: 7.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: hafnermorris at gmail dot com Target Milestone: --- Created attachment 41411 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41411&action=edit Minimal example code The following invalid code causes an ICE: struct S {}; auto overloaded(S &); template int overloaded(T &) { return 0; } template auto returns_lambda(T ¶m) { return [&] { overloaded(param); }; } int main() { S s; returns_lambda(s); } On Wandbox: https://wandbox.org/permlink/bU36doHcn0MoXWrK Only gcc versions 7.1 and up seem to be affected. No compiler flags are required.
[Bug c++/79583] ICE (internal compiler error) upon instantiation of class template with `auto` template parameter containing inner class template
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79583 --- Comment #2 from Paolo Carlini --- The released 7.1.0, current gcc-7-branch and trunk are fine. I'm adding the testcase and closing the bug.
[Bug c/80868] "Duplicate const" warning emitted in `const typeof(foo) bar;`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80868 Marek Polacek changed: What|Removed |Added CC||mpolacek at gcc dot gnu.org --- Comment #2 from Marek Polacek --- We're supposed to complain for const const int x; and typedef const int t; const t x; and I think we should thus also warn for this (-std=gnu89 -pedantic only). const int a; const __typeof(a) x; because __typeof() doesn't strip outermost type qualifications. There were discussion about adding __nonqual_typeof() but that hasn't been added yet.
[Bug c++/68578] [5 Regression] ICE on invalid template declaration and instantiation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68578 Paolo Carlini changed: What|Removed |Added CC||paolo.carlini at oracle dot com --- Comment #6 from Paolo Carlini --- I have just confirmed that we don't ICE in 6.x and 7.x. Frankly, at this point it seems highly unlikely to me that we are going to fix this ICE on invalid in the gcc-5-branch, thus I'm adding a testcase and then in a few hours I will resolve the bug.
[Bug c/80872] New: There is no warning on accidental infinite loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80872 Bug ID: 80872 Summary: There is no warning on accidental infinite loops Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: david at westcontrol dot com Target Milestone: --- Would it be possible to add warnings on accidental infinite loops, such as: void foo(void) { for (int i = 0; i <= 0x7fff; i++) { // ... } } The compiler (correctly) translates this to an infinite loop, in all versions of gcc that I tested, and in both C and C++, with optimisation enabled. But no warning is given, even with -Wall -Wextra. That includes the -Wtype-limits warning, which I thought should trigger here. Perhaps the order of passes is such that the code is simplified to an infinite loop before the type limits checking is done? Replacing " <= 0x7fff" with "< 0x8000" gives triggers -Wsign-compare but not -Wtype-limits, which is relevant because in C -Wsign-compare is in -Wextra but not -Wall. Exceeding the limits of unsigned int in the literal here correctly triggers -Wtype-limits.
[Bug c++/80396] New builtin to make std::make_integer_sequence efficient and scalable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80396 Christophe Lyon changed: What|Removed |Added CC||clyon at gcc dot gnu.org --- Comment #3 from Christophe Lyon --- Hi Jason, One of the new tests (integer-pack2.C) fails on arm* targets. The log says: Excess errors: /testsuite/g++.dg/ext/integer-pack2.C:10:48: error: overflow in constant expression [-fpermissive] /testsuite/g++.dg/ext/integer-pack2.C:10:48: error: overflow in constant expression [-fpermissive]
[Bug c++/80812] [8 Regression] ICE: in build_value_init_noctor, at cp/init.c:483
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80812 Ville Voutilainen changed: What|Removed |Added Status|NEW |ASSIGNED CC||ville.voutilainen at gmail dot com Assignee|unassigned at gcc dot gnu.org |ville.voutilainen at gmail dot com --- Comment #3 from Ville Voutilainen --- Mine.
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 --- Comment #2 from Jakub Jelinek --- Also, even for host fallback there is a separate set of ICVs and many other properties, the target region can't be just ignored for many reasons even if there is no data sharing. Of course, if you provide small testcases then we can discuss on what can and what can't be optimized in detail.
[Bug tree-optimization/80844] OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844 --- Comment #4 from Jakub Jelinek --- What we should do is first vectorize the main simd loop and then, once we've determined the vectorization factor thereof etc., see if there is any related preparation and finalization loop around it and try to vectorize those with the same vectorization factor.
[Bug bootstrap/80867] [7 Regression] gnat bootstrap broken on powerpc64le-linux-gnu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80867 Eric Botcazou changed: What|Removed |Added Status|UNCONFIRMED |WAITING Last reconfirmed||2017-05-24 Ever confirmed|0 |1 --- Comment #1 from Eric Botcazou --- Do you build only PowerPC64le with "-gnatn -g -O3" or other platforms? If the latter, do they still build correctly at the same revision? I'm trying to find out whether this was introduced by the only Ada patch in the range: 2017-05-22 Eric Botcazou * gcc-interface/decl.c (gnat_to_gnu_entity): Skip regular processing for Itypes that are E_Access_Subtype. : Use the DECL of the base type directly. --- Comment #2 from Eric Botcazou --- Btw, can you post a backtrace?
[Bug c++/80859] Performance Problems with OpenMP 4.5 support
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80859 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #1 from Jakub Jelinek --- I don't see any attachments. The target directives can't be ignored, some clauses have significant role even when doing host fallback (data sharing). Plus the question is what do you mean by no offloading. Do you mean compiler configured without any offloading support, or compiler configured with offloading support, but with offloading not selected, or by compiler with offloading support and offloading generated, but at runtime deciding it has to use host fallback? E.g. the second case, the decision whether offloading will be supported or not is not done at compile time, but at link time, where one can choose whether to emit offloading or not and to which subset of the configured offloading targets.
[Bug tree-optimization/78969] bogus snprintf truncation warning due to missing range info
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78969 --- Comment #8 from Jakub Jelinek --- idx_10 addition is a consequence of TODO_update_ssa in vrp1's todo_flags, triggered by jump threading creating the bb6.
[Bug tree-optimization/78969] bogus snprintf truncation warning due to missing range info
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78969 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #7 from Jakub Jelinek --- This is because the PHI at that point is only created during CFG changes at the end of VRP1 pass. After creating ASSERT_EXPRs, we still have: [1.00%]: goto ; [100.00%] [99.00%]: # RANGE [0, 1000] NONZERO 1023 idx_7 = ASSERT_EXPR ; __builtin_snprintf (p_4(D), 4, "%d", idx_7); # RANGE [1, 1000] NONZERO 1023 idx_6 = idx_7 + 1; [100.00%]: # RANGE [0, 1000] NONZERO 1023 # idx_1 = PHI <0(2), idx_6(3)> if (idx_1 <= 999) goto ; [99.00%] else goto ; [1.00%] [1.00%]: return; Then VRP1 correctly determines: idx_1: [0, 1000] .MEM_2: VARYING idx_6: [1, 1000] idx_7: [0, 999] EQUIVALENCES: { idx_1 } (1 elements) then the ASSERT_EXPRs are removed, which means whenever idx_7 is used we now use idx_1, and finally the loop is changed, idx_8 and idx_10 is created: [1.00%]: goto ; [100.00%] [99.00%]: # RANGE [0, 1000] NONZERO 1023 # idx_10 = PHI __builtin_snprintf (p_4(D), 4, "%d", idx_10); # RANGE [1, 1000] NONZERO 1023 idx_6 = idx_10 + 1; [99.00%]: # RANGE [0, 1000] NONZERO 1023 # idx_1 = PHI if (idx_1 != 1000) goto ; [98.99%] else goto ; [1.01%] [1.00%]: return; [1.00%]: # RANGE [0, 1000] NONZERO 1023 # idx_8 = PHI <0(2)> goto ; [100.00%] But there is no obvious connection between idx_10 (which indeed nicely could hold the [0, 999] range) and idx_7 (the already removed ASSERT_EXPR); similarly, idx_8 could very well have RANGE [0, 0] but doesn't (though in that case it isn't that big a deal, because it is going to be removed immediately afterwards).