[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #26 from GCC Commits --- The master branch has been updated by Jakub Jelinek : https://gcc.gnu.org/g:8fc5593df8e0d36cc5bd8ea21097a491a634a866 commit r14-9639-g8fc5593df8e0d36cc5bd8ea21097a491a634a866 Author: Jakub Jelinek Date: Sat Mar 23 11:17:44 2024 +0100 predcom: Punt for steps which aren't multiples of access size [PR111683] On the following testcases, there is no overlap between data references within a single iteration, but the data references have size which is twice as large as the step, which means the data references overlap with the next iteration which predcom doesn't take into account. As discussed in the PR, even if the reference size is smaller than step, if step isn't a multiple of the reference size, there could be overlaps with some other iteration later on. The initial version of the patch regressed (test still passed, but predcom didn't optimize anymore) pr71083.c which has a packed char, short structure and was reading/writing the short 2 bytes in there with step 3. The following patch deals with that by retrying for COMPONENT_REFs also the aggregate sizes etc., so that it then compares 3 bytes against step 3. In make check-gcc/check-g++ this patch I believe affects code generation for only the 2 new testcases according to statistics I've gathered. 2024-03-23 Jakub Jelinek PR middle-end/111683 * tree-predcom.cc (pcom_worker::suitable_component_p): If has_write and comp_step is RS_NONZERO, return false if any reference in the component doesn't have DR_STEP a multiple of access size. * gcc.dg/pr111683-1.c: New test. * gcc.dg/pr111683-2.c: New test.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #25 from rguenther at suse dot de --- On Fri, 22 Mar 2024, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 > > --- Comment #24 from Jakub Jelinek --- > (In reply to Richard Biener from comment #23) > > It looks like this could go to suitable_reference_p instead? > > You mean return false for those making them not suitable at all? > I thought without a write such references would act more like RS_ANY, but > a reason I didn't just treat such references as RS_ANY rather than RS_NONZERO > in suitable_reference_p was because of the assert that all refs in a component > have the same ref_step_type but nothing actually comparing it before the > assertion. Hmm, true. > But if you think I should just return false from suitable_reference_p if the > step isn't a multiple of the sizes, I can surely try that. > > > That said, I do wonder why with my patch split_data_refs_to_components > > doesn't fixup. I think it's supposed to handle the case where > > dependences are unknown conservatively... > > I'm afraid I'm not familiar with data refs enough to know what was going on. I tried to follow what happens there and I'm also somewhat lost. Anyway, I think fixing this in predcom in a convenient place even if it might be not the true fix is OK. You might want to put a comment before any such fix indicating there might be more latent issues in predcom or dependence analysis in general.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #24 from Jakub Jelinek --- (In reply to Richard Biener from comment #23) > It looks like this could go to suitable_reference_p instead? You mean return false for those making them not suitable at all? I thought without a write such references would act more like RS_ANY, but a reason I didn't just treat such references as RS_ANY rather than RS_NONZERO in suitable_reference_p was because of the assert that all refs in a component have the same ref_step_type but nothing actually comparing it before the assertion. But if you think I should just return false from suitable_reference_p if the step isn't a multiple of the sizes, I can surely try that. > That said, I do wonder why with my patch split_data_refs_to_components > doesn't fixup. I think it's supposed to handle the case where > dependences are unknown conservatively... I'm afraid I'm not familiar with data refs enough to know what was going on.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #23 from Richard Biener --- It looks like this could go to suitable_reference_p instead? That said, I do wonder why with my patch split_data_refs_to_components doesn't fixup. I think it's supposed to handle the case where dependences are unknown conservatively...
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Jakub Jelinek changed: What|Removed |Added Attachment #57768|0 |1 is obsolete|| --- Comment #22 from Jakub Jelinek --- Created attachment 57769 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57769=edit gcc14-pr111683.patch Actually for bit-fields, TREE_TYPE on the ref is the underlying type, so either we'd need to somehow better try to understand the actual reference size in that case, or simply look throught the BIT_FIELD_REF COMPONENT_REFs.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #21 from Jakub Jelinek --- Created attachment 57768 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57768=edit gcc14-pr111683.patch Updated patch which uses wi::multiple_of_p but doesn't regress pr71083.c. To be precise, it actually wasn't about foo with the bitfield, that used to be optimized before/after, but bar which uses packed struct with char and short members, step was 3 but the short access had size 2. But given that the 2 sized access is a COMPONENT_REF of the 3 byte sized struct, if step is a multiple of 3, that is fine as well.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #20 from Jakub Jelinek --- That is true. I've been also wondering whether e.g. for the pr71083.c case we couldn't just look through all COMPONENT_REFs of the DR_REF (and maybe ARRAY_REFs with constant indexes) and check type size of the aggregate it is in against the size. That could perhaps fix the predcom of pr71083.c, because it is a 24-bit bitfield in 4 byte structure. Will you take this over, or should I tweak my patch? Note, statistics I've gathered with the above posted patch was that the patch did something in /home/jakub/src/gcc/gcc/testsuite/c-c++-common/torture/pr101636.c bar /home/jakub/src/gcc/gcc/testsuite/gcc.c-torture/compile/pr32399.c f /home/jakub/src/gcc/gcc/testsuite/gcc.dg/autopar/pr69109-2.c f /home/jakub/src/gcc/gcc/testsuite/gcc.dg/graphite/pr71575-2.c fn1 /home/jakub/src/gcc/gcc/testsuite/gcc.dg/loop-versioning-6.c f1 /home/jakub/src/gcc/gcc/testsuite/gcc.dg/pr111683-1.c main /home/jakub/src/gcc/gcc/testsuite/gcc.dg/pr111683-2.c main but only changed the generated code in the last 2 (those are intentional). The older patch with wi::multiple_of_p affected /home/jakub/src/gcc/gcc/testsuite/c-c++-common/torture/pr101636.c bar /home/jakub/src/gcc/gcc/testsuite/gcc.c-torture/compile/pr32399.c f /home/jakub/src/gcc/gcc/testsuite/gcc.c-torture/execute/pr71083.c bar /home/jakub/src/gcc/gcc/testsuite/gcc.dg/autopar/pr69109-2.c f /home/jakub/src/gcc/gcc/testsuite/gcc.dg/graphite/pr71575-2.c fn1 /home/jakub/src/gcc/gcc/testsuite/gcc.dg/loop-versioning-6.c f1 /home/jakub/src/gcc/gcc/testsuite/gcc.dg/pr111683-1.c main /home/jakub/src/gcc/gcc/testsuite/gcc.dg/pr111683-2.c main /home/jakub/src/gcc/gcc/testsuite/gcc.dg/torture/pr112736.c fn1 /home/jakub/src/gcc/gcc/testsuite/gcc.dg/torture/pr68379.c fn1 /home/jakub/src/gcc/gcc/testsuite/gcc.dg/tree-ssa/pr87022.c main /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr101445.c foo /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr68502-2.c reset_nodes /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/pr81740-2.c main /home/jakub/src/gcc/gcc/testsuite/gcc.dg/vect/vect-avg-16.c f /home/jakub/src/gcc/gcc/testsuite/g++.dg/vect/pr95297.cc test and changed code generation of the 2 expected tests and pr71083.c. All this statistics was solely from make check-gcc check-g++.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #19 from Richard Biener --- But note that {0, +, 3 } and {2, + , 3} with size 2 will still eventually overlap. See dr_analyze_indices where we attempt to make DR_INIT a multiple of the element size (probably also not a perfect solution). In the end representing the access function for the pointer with byte granularity is a mistake and it should be translated to be based on the access size. That of course makes a byte step of 3 with an access size of 2 not representable (but it avoids these kind of issues). The other possibility is to filter the remaining problematic cases during dependence analysis itself where we see both DR_INITs. As said the index based dependence analysis doesn't know about such thing as partial element overlaps. For the PR at hand your patch looks reasonable, but I think there's a problem in how we handle the pointer-based access functions.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #18 from Jakub Jelinek --- Using wi::multiple_of_p is what I've tried first but that regressed the generated code for pr71083.c (the test still PASSed, but predcom no longer triggered on it). I think it has access size 3 and step 4, so step is not a multiple of access size, but that is not a problem, all we actually care about is just if access size isn't bigger than step.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #17 from Jakub Jelinek --- Created attachment 57767 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57767=edit gcc14-pr111683.patch Patch I've tested overnight for this.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #16 from Richard Biener --- Created attachment 57766 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57766=edit patch Oddly enough the following patch doesn't fix it but it correctly prevents us from recording the access functions and data dependence analysis now says (Data Dep: #(Data Ref: # bb: 3 # stmt: _2 = MEM[(U * {ref-all})_1]; # ref: MEM[(U * {ref-all})_1]; # base_object: MEM[(U * {ref-all})_1]; #) #(Data Ref: # bb: 3 # stmt: _7 = MEM[(U * {ref-all})_6]; # ref: MEM[(U * {ref-all})_6]; # base_object: MEM[(U * {ref-all})_6]; #) (don't know) but predictive commoning still does Store-loads chain 0x48229b0 max distance 2, may reuse first inits MEM[(U * {ref-all})] MEM[(U * {ref-all}) + 8B] references: MEM[(U * {ref-all})_6] (id 3, write) offset -2 distance 0 MEM[(U * {ref-all})_1] (id 0) offset 0 distance 2 Executing predictive commoning without unrolling. predcom has its own analysis it seems and it makes a similar mistake.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #15 from Richard Biener --- the problem is in dependence analysis itself - by design classic dependence analysis relies on a[i] and a[i+1] not aliasing but for DRs like Creating dr for MEM[(U * {ref-all})_1] analyze_innermost: success. base_address: offset from base address: 0 constant offset from base address: 0 step: 8 base alignment: 32 base misalignment: 0 offset alignment: 128 step alignment: 8 base_object: MEM[(U * {ref-all})] Access function 0: {0B, +, 8}_1 The access function evolves in a way that this doesn't hold. It might be tempting to fail data-ref analysis itself but it's really subscript dependence testing that doesn't work. OTOH we already have code in dr_analyze_indices to deal with similar cases concerning overlap by adjusted base offset. Note in theory we handle non-constant increments, so we eventually would have to ditch support for those at all.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Jakub Jelinek changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #14 from Jakub Jelinek --- It is indeed pcom. Slightly better testcase which doesn't use double arithmetics but long long: long long b[6] = { 3, 4, 5, 6, 7, 8 }; long long c[16]; typedef long long U __attribute__ ((vector_size(16), may_alias, aligned(1))); typedef long long V __attribute__ ((vector_size(16), may_alias)); int main () { for (int f = 0; f < 6; f++) { *(U *) [f] = *(U *) [f] + (V) { b[f], b[f] }; *(U *) [f + 2] = *(U *) [f + 2] + (V) { b[f], b[f] }; } if (c[1] != 7) __builtin_abort (); return 0; } And it doesn't need to be about vector types either, nor use any aligned(1) loads/stores in the source: int b[6] = { 3, 4, 5, 6, 7, 8 }, c[12]; int d[16] = { 0, 1, 3, 6, 10, 14, 12, 9, 5, 0, 0, 0 }; int main () { int i; if (sizeof (int) * 2 != sizeof (long long)) return 0; for (i = 0; i < 6; i++) { long long a; __builtin_memcpy (, [i], sizeof (a)); a += (((long long) i) << (sizeof (int) * __CHAR_BIT__)) + i; __builtin_memcpy ([i], , sizeof (a)); __builtin_memcpy (, [i + 2], sizeof (a)); a += (((long long) i) << (sizeof (int) * __CHAR_BIT__)) + i; __builtin_memcpy ([i + 2], , sizeof (a)); } if (__builtin_memcmp ([0], [0], sizeof (c))) __builtin_abort (); return 0; } On the last testcase the *.pcom change against previous dump is: int main () { + unsigned long D__lsm1.7; + unsigned long D__lsm0.6; long long int a; int i; int * _1; @@ -21,18 +31,24 @@ int main () unsigned long _20; sizetype _25; sizetype _28; + unsigned long _33; + unsigned long _37; unsigned int ivtmp_38; unsigned int ivtmp_39; [local count: 153437704]: + _37 = MEM [(char * {ref-all})]; + _33 = MEM [(char * {ref-all}) + 4B]; [local count: 920304121]: # i_26 = PHI # ivtmp_39 = PHI + # D__lsm0.6_34 = PHI + # D__lsm1.7_35 = PHI _28 = (sizetype) i_26; _9 = _28 * 4; _1 = + _9; - _12 = MEM [(char * {ref-all})_1]; + _12 = D__lsm0.6_34; a_13 = (long long int) _12; _19 = _28 * 4294967297; _2 = (long long int) _19; @@ -48,6 +64,7 @@ int main () _6 = _2 + a_18; _20 = (unsigned long) _6; MEM [(char * {ref-all})_5] = _20; + D__lsm0.6_36 = _20; i_23 = i_26 + 1; ivtmp_38 = ivtmp_39 - 1; if (ivtmp_38 != 0) The data refs in the loop have access fns {0B, +, 4}_1 and {8B, +, 4}_1, so within the same iteration there is no overlap between the two, but because they are actually 8 byte loads/stores, there is partial overlap between the adjacent iterations in them that pcom doesn't take into account. So, shall we somewhere punt for references with larger access sizes than their DR_STEP? Something else?
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #13 from Andrew Pinski --- (In reply to Jakub Jelinek from comment #12) > Further simplified without any headers, -O1 works, -O2 fails. For this testcase (at -O3), GCC 4.7.4 works but 4.8.1 fails.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #12 from Jakub Jelinek --- Further simplified without any headers, -O1 works, -O2 fails. double a[2] = { 0.8147, 0.9058 }; double b[6] = { 0.1576, 0.9706, 1, 1, 1, 1 }; double c[16]; typedef double U __attribute__ ((vector_size(16), may_alias, aligned(1))); typedef double V __attribute__ ((vector_size(16), may_alias)); int main () { for (int f = 0; f < 6; f++) { *(U *) [f] = *(U *) [f] + (V) { b[f], b[f] } * *(U *) a; *(U *) [f + 2] = *(U *) [f + 2] + (V) { b[f], b[f] } * *(U *) a; } if (c[1] <= 0.93 || c[1] >= 0.94) __builtin_abort (); return 0; }
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Jakub Jelinek changed: What|Removed |Added Priority|P1 |P2 --- Comment #11 from Jakub Jelinek --- GCC 7-13 has been released with this bug, so P2.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #10 from Andrew Pinski --- Created attachment 57678 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57678=edit testcase that fails all the way back to GCC 4.8.x This one fails until GCC 4.8.1, works in GCC 4.7.4 but that is because the outer loop is unrolled in GCC 4.7.4.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #9 from Andrew Pinski --- (In reply to Andrew Pinski from comment #8) > Note with the reduced testcase in comment #7, -fno-tree-vrp has no effect on > it any more so it might make sense to do another bisection on that testcase. I am suspecting r7-104-gfc9cf6da84a899 exposed the issue now. in GCC 6, _mm_storeu_pd/_mm_loadu_pd were still builtins, but in GCC 7, they become directly unaligned loads (starting with r7-104-gfc9cf6da84a899 ). I suspect if manually change _mm_storeu_pd/_mm_loadu_pd to be similar to what GCC 7 had, then we might even up with this failing before GCC 7.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Andrew Pinski changed: What|Removed |Added Keywords|needs-reduction | --- Comment #8 from Andrew Pinski --- Note with the reduced testcase in comment #7, -fno-tree-vrp has no effect on it any more so it might make sense to do another bisection on that testcase.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #7 from Andrew Pinski --- Created attachment 57677 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57677=edit Slightly more reduced Removes the extra induction variables and manually unroll the inner loop. What I am seeing is the first store is being removed.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #6 from Sam James --- Created attachment 57660 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57660=edit reduced.c Bit smaller.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3 since r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 --- Comment #5 from kugan at gcc dot gnu.org --- -O3 -fno-tree-vectorize and -O3 -fno-tree-vrp works. I looked at the ever dump and it is not doing anything suspicious. Looks like range_info usage in vectoriser is causing the problem.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Jeffrey A. Law changed: What|Removed |Added CC||law at gcc dot gnu.org Priority|P3 |P1
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org Keywords|needs-bisection | --- Comment #4 from Jakub Jelinek --- Started with r7-3163-g973625a04b3d9351f2485e37f7d3382af2aed87e Slightly reduced: #include void convn_script(const double in9[10], const double in10[7], double out5[16]) { int iB = 0; int iC = 0; __builtin_memset([0], 0, 16U * sizeof(double)); for (int i = 0; i < 7; i++) { int b_i; int vectorUB; if (i + 10 <= 15) b_i = 9; else b_i = 15 - i; vectorUB = (((b_i + 1) / 2) << 1) - 2; for (int r = 0; r <= vectorUB; r += 2) { __m128d b_r; b_i = iC + r; b_r = _mm_loadu_pd([b_i]); _mm_storeu_pd([b_i], _mm_add_pd(b_r, _mm_mul_pd(_mm_set1_pd(in10[iB]), _mm_loadu_pd([r]; } iC = iB + 1; iB++; } } int main() { double in9[10] = {0.8147, 0.9058, 0.1270, 0.9134, 0.6324, 0.0975, 0.2785, 0.5469, 0.9575, 0.9649}; double in10[7] = {0.1576, 0.9706, 0.9572, 0.4854, 0.8003, 0.1419, 0.4218}; double out5[16]; convn_script(in9, in10, out5); for(int i = 0; i < 16; i++) __builtin_printf ("%d %f\n", i, out5[i]); }
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #3 from Alexander Monakov --- This is predcom. It's easier to see what's going wrong with #pragma GCC unroll 99 added to the innermost loop so it's unrolled at -O2 and comparing g++ -O2 -fpredictive-commoning --param=max-unroll-times=1 vs. plain g++ -O2 (or diffing 'pcom' dump against the preceding pass).
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Richard Biener changed: What|Removed |Added Known to fail||13.2.1 --- Comment #2 from Richard Biener --- Just to say, -fno-strict-aliasing -fstack-reuse=none doesn't help. convn_script can be made 'noipa' and still reproduces the failure. -fno-tree-vectorize changes code and result but still differs from -O2.
[Bug middle-end/111683] [11/12/13/14 Regression] Incorrect answer when using SSE2 intrinsics with -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111683 Andrew Pinski changed: What|Removed |Added Ever confirmed|0 |1 Known to fail||7.1.0 Keywords||needs-bisection, ||needs-reduction Last reconfirmed||2023-10-03 Target Milestone|--- |11.5 Summary|Incorrect answer when using |[11/12/13/14 Regression] |SSE2 intrinsics with -O3|Incorrect answer when using ||SSE2 intrinsics with -O3 Status|UNCONFIRMED |NEW Known to work||6.1.0, 6.4.0 --- Comment #1 from Andrew Pinski --- Confirmed. Started between GCC 6 and GCC 7. Looks unrolling related too.