Re: [PATCH] PR88497 - Extend reassoc for vector bit_field_ref
on 2019/3/11 下午6:24, Richard Biener wrote: > On Mon, 11 Mar 2019, Kewen.Lin wrote: > >> on 2019/3/8 下午6:57, Richard Biener wrote: >>> On Fri, Mar 8, 2019 at 2:40 AM Kewen.Lin wrote: Hi, As PR88497 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497), when we meet some code pattern like: V1[0] + V1[1] + ... + V1[k] + V2[0] + ... + V2[k] + ... Vn[k] // V1...Vn of VECTOR_TYPE We can teach reassoc to transform it to: Vs = (V1 + V2 + ... + Vn) Vs[0] + Vs[1] + ... + Vs[k] It saves addition and bit_field_ref operations and exposes more opportunities for downstream passes, I notice that even on one target doesn't support vector type and vector type gets expanded in veclower, it's still better to have it, since the generated sequence is more friendly for widening_mul. (If one more time DCE after forwprop, it should be the same. ) Bootstrapped/regtested on powerpc64le-linux-gnu, ok for trunk? >>> >>> Hmm. You are basically vectorizing the reduction partially here (note >>> basic-block vectorization doesn't handle reductions yet). So I'm not sure >>> whether we should eventually just change op sort order to sort >>> v1[1] after v2[0] to sort v1[0] and v2[0] together. That would be also >>> maybe >>> an intermediate step that makes implementing the vectorization part >>> "easier"? I also imagine that the "vectorization" would be profitable even >>> if there's just two elements of the vectors summed and the real vectors are >>> larger. >>> >> >> Hi Richard, >> >> Sorry, I have no idea about basic-block vectorization (SLP?), did you >> suggest >> supporting this there rather than in reassoc? Changing op sort order for >> its expected pattern? > > I was thinking you're smashing two things together here - one part > suitable for reassocation and one that's on the border. The reassoc > pass already gathered quite some stuff that doesn't really belong there, > we should at least think twice adding to that. > Got it! Since the discussion context of the PR is mainly about reassocation, I started to look from it. :) >>> Note that the code you transform contains no vector operations (just >>> element extracts) and thus you need to check for target support before >>> creating vector add. >>> >> >> I had the same concern before. But I thought that if this VECTOR type is >> valid >> on the target, the addition operation should be fundamental on the target, >> then >> it's ok; if it's invalid, then veclower will expand it into scalar type as >> well >> as the addition operation. So it looks fine to guard it. Does it make sense? > > But there's a reassoc phase after veclower. Generally we avoid generating > vector code not supported by the target. You are not checking whether > the vector type is valid on the target either btw. The user might have > written an optimal scalar sequence for a non-natively supported vector > type (and veclower wouldn't touch the original scalar code) and veclower > generally makes a mess of things, likely worse than the original code. > Thanks for your explanation! I'll update the code to check vector supported by target or not. >> >>> This is for GCC 10. Also it doesn't fix PR88497, does it? >>> >> >> Yes, it's in low priority and for GCC10. I think it does. Did I miss >> something? > > Wasn't the original testcase in the PR for _vectorized_ code? The > PR then got an additional testcase which you indeed fix. > The original case is actually optimal with -Ofast, but additional case isn't. :) >> For comment 1 and comment 7 cases, trunk gcc generates the expected code >> with -Ofast. There isn't any issues for the original loop. But for the >> expanded code in comment 1 (manually expanded case), gcc can't generate >> better code with multiply-add. >> >> From comment 9 case (same as comment 1 expanded version): >> >> without this patch, optimized tree and final asm: >> >>[local count: 1073741824]: >> _1 = *arg2_26(D); >> _2 = *arg3_27(D); >> _3 = _1 * _2; >> _4 = BIT_FIELD_REF <_3, 64, 0>; >> _5 = BIT_FIELD_REF <_3, 64, 64>; >> _18 = _4 + _5; >> _7 = MEM[(__vector double *)arg2_26(D) + 16B]; >> _8 = MEM[(__vector double *)arg3_27(D) + 16B]; >> _9 = _7 * _8; >> _10 = BIT_FIELD_REF <_9, 64, 0>; >> _11 = BIT_FIELD_REF <_9, 64, 64>; >> _31 = _10 + _11; >> _29 = _18 + _31; >> _13 = MEM[(__vector double *)arg2_26(D) + 32B]; >> _14 = MEM[(__vector double *)arg3_27(D) + 32B]; >> _15 = _13 * _14; >> _16 = BIT_FIELD_REF <_15, 64, 0>; >> _17 = BIT_FIELD_REF <_15, 64, 64>; >> _6 = _16 + _17; >> _12 = _6 + accumulator_28(D); >> _37 = _12 + _29; >> _19 = MEM[(__vector double *)arg2_26(D) + 48B]; >> _20 = MEM[(__vector double *)arg3_27(D) + 48B]; >> _21 = _19 * _20; >> _22 = BIT_FIELD_REF <_21, 64, 0>; >> _23 = BIT_FIELD_REF <_21, 64, 64>; >> _24 = _22 + _23; >> accumulator_32 = _24 + _37; >>
Re: [PATCH] PR88497 - Extend reassoc for vector bit_field_ref
On Mon, 11 Mar 2019, Kewen.Lin wrote: > on 2019/3/8 下午6:57, Richard Biener wrote: > > On Fri, Mar 8, 2019 at 2:40 AM Kewen.Lin wrote: > >> > >> Hi, > >> > >> As PR88497 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497), > >> when we meet some code pattern like: > >> > >>V1[0] + V1[1] + ... + V1[k] + V2[0] + ... + V2[k] + ... Vn[k] > >>// V1...Vn of VECTOR_TYPE > >> > >> We can teach reassoc to transform it to: > >> > >>Vs = (V1 + V2 + ... + Vn) > >>Vs[0] + Vs[1] + ... + Vs[k] > >> > >> It saves addition and bit_field_ref operations and exposes more > >> opportunities for downstream passes, I notice that even on one > >> target doesn't support vector type and vector type gets expanded > >> in veclower, it's still better to have it, since the generated > >> sequence is more friendly for widening_mul. (If one more time > >> DCE after forwprop, it should be the same. ) > >> > >> Bootstrapped/regtested on powerpc64le-linux-gnu, ok for trunk? > > > > Hmm. You are basically vectorizing the reduction partially here (note > > basic-block vectorization doesn't handle reductions yet). So I'm not sure > > whether we should eventually just change op sort order to sort > > v1[1] after v2[0] to sort v1[0] and v2[0] together. That would be also > > maybe > > an intermediate step that makes implementing the vectorization part > > "easier"? I also imagine that the "vectorization" would be profitable even > > if there's just two elements of the vectors summed and the real vectors are > > larger. > > > > Hi Richard, > > Sorry, I have no idea about basic-block vectorization (SLP?), did you suggest > supporting this there rather than in reassoc? Changing op sort order for > its expected pattern? I was thinking you're smashing two things together here - one part suitable for reassocation and one that's on the border. The reassoc pass already gathered quite some stuff that doesn't really belong there, we should at least think twice adding to that. > > Note that the code you transform contains no vector operations (just > > element extracts) and thus you need to check for target support before > > creating vector add. > > > > I had the same concern before. But I thought that if this VECTOR type is > valid > on the target, the addition operation should be fundamental on the target, > then > it's ok; if it's invalid, then veclower will expand it into scalar type as > well > as the addition operation. So it looks fine to guard it. Does it make sense? But there's a reassoc phase after veclower. Generally we avoid generating vector code not supported by the target. You are not checking whether the vector type is valid on the target either btw. The user might have written an optimal scalar sequence for a non-natively supported vector type (and veclower wouldn't touch the original scalar code) and veclower generally makes a mess of things, likely worse than the original code. > > > This is for GCC 10. Also it doesn't fix PR88497, does it? > > > > Yes, it's in low priority and for GCC10. I think it does. Did I miss > something? Wasn't the original testcase in the PR for _vectorized_ code? The PR then got an additional testcase which you indeed fix. > For comment 1 and comment 7 cases, trunk gcc generates the expected code > with -Ofast. There isn't any issues for the original loop. But for the > expanded code in comment 1 (manually expanded case), gcc can't generate > better code with multiply-add. > > From comment 9 case (same as comment 1 expanded version): > > without this patch, optimized tree and final asm: > >[local count: 1073741824]: > _1 = *arg2_26(D); > _2 = *arg3_27(D); > _3 = _1 * _2; > _4 = BIT_FIELD_REF <_3, 64, 0>; > _5 = BIT_FIELD_REF <_3, 64, 64>; > _18 = _4 + _5; > _7 = MEM[(__vector double *)arg2_26(D) + 16B]; > _8 = MEM[(__vector double *)arg3_27(D) + 16B]; > _9 = _7 * _8; > _10 = BIT_FIELD_REF <_9, 64, 0>; > _11 = BIT_FIELD_REF <_9, 64, 64>; > _31 = _10 + _11; > _29 = _18 + _31; > _13 = MEM[(__vector double *)arg2_26(D) + 32B]; > _14 = MEM[(__vector double *)arg3_27(D) + 32B]; > _15 = _13 * _14; > _16 = BIT_FIELD_REF <_15, 64, 0>; > _17 = BIT_FIELD_REF <_15, 64, 64>; > _6 = _16 + _17; > _12 = _6 + accumulator_28(D); > _37 = _12 + _29; > _19 = MEM[(__vector double *)arg2_26(D) + 48B]; > _20 = MEM[(__vector double *)arg3_27(D) + 48B]; > _21 = _19 * _20; > _22 = BIT_FIELD_REF <_21, 64, 0>; > _23 = BIT_FIELD_REF <_21, 64, 64>; > _24 = _22 + _23; > accumulator_32 = _24 + _37; > return accumulator_32; > > : >0: 01 00 e5 f4 lxv vs7,0(r5) >4: 11 00 05 f4 lxv vs0,16(r5) >8: 21 00 24 f5 lxv vs9,32(r4) >c: 21 00 c5 f4 lxv vs6,32(r5) > 10: 01 00 44 f5 lxv vs10,0(r4) > 14: 11 00 64 f5 lxv vs11,16(r4) > 18: 31 00 05 f5 lxv vs8,48(r5) > 1c: 31 00 84 f5 lxv vs12,48(r4) > 20: 80 33 29
Re: [PATCH] PR88497 - Extend reassoc for vector bit_field_ref
on 2019/3/8 下午6:57, Richard Biener wrote: > On Fri, Mar 8, 2019 at 2:40 AM Kewen.Lin wrote: >> >> Hi, >> >> As PR88497 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497), >> when we meet some code pattern like: >> >>V1[0] + V1[1] + ... + V1[k] + V2[0] + ... + V2[k] + ... Vn[k] >>// V1...Vn of VECTOR_TYPE >> >> We can teach reassoc to transform it to: >> >>Vs = (V1 + V2 + ... + Vn) >>Vs[0] + Vs[1] + ... + Vs[k] >> >> It saves addition and bit_field_ref operations and exposes more >> opportunities for downstream passes, I notice that even on one >> target doesn't support vector type and vector type gets expanded >> in veclower, it's still better to have it, since the generated >> sequence is more friendly for widening_mul. (If one more time >> DCE after forwprop, it should be the same. ) >> >> Bootstrapped/regtested on powerpc64le-linux-gnu, ok for trunk? > > Hmm. You are basically vectorizing the reduction partially here (note > basic-block vectorization doesn't handle reductions yet). So I'm not sure > whether we should eventually just change op sort order to sort > v1[1] after v2[0] to sort v1[0] and v2[0] together. That would be also maybe > an intermediate step that makes implementing the vectorization part > "easier"? I also imagine that the "vectorization" would be profitable even > if there's just two elements of the vectors summed and the real vectors are > larger. > Hi Richard, Sorry, I have no idea about basic-block vectorization (SLP?), did you suggest supporting this there rather than in reassoc? Changing op sort order for its expected pattern? > Note that the code you transform contains no vector operations (just > element extracts) and thus you need to check for target support before > creating vector add. > I had the same concern before. But I thought that if this VECTOR type is valid on the target, the addition operation should be fundamental on the target, then it's ok; if it's invalid, then veclower will expand it into scalar type as well as the addition operation. So it looks fine to guard it. Does it make sense? > This is for GCC 10. Also it doesn't fix PR88497, does it? > Yes, it's in low priority and for GCC10. I think it does. Did I miss something? For comment 1 and comment 7 cases, trunk gcc generates the expected code with -Ofast. There isn't any issues for the original loop. But for the expanded code in comment 1 (manually expanded case), gcc can't generate better code with multiply-add. >From comment 9 case (same as comment 1 expanded version): without this patch, optimized tree and final asm: [local count: 1073741824]: _1 = *arg2_26(D); _2 = *arg3_27(D); _3 = _1 * _2; _4 = BIT_FIELD_REF <_3, 64, 0>; _5 = BIT_FIELD_REF <_3, 64, 64>; _18 = _4 + _5; _7 = MEM[(__vector double *)arg2_26(D) + 16B]; _8 = MEM[(__vector double *)arg3_27(D) + 16B]; _9 = _7 * _8; _10 = BIT_FIELD_REF <_9, 64, 0>; _11 = BIT_FIELD_REF <_9, 64, 64>; _31 = _10 + _11; _29 = _18 + _31; _13 = MEM[(__vector double *)arg2_26(D) + 32B]; _14 = MEM[(__vector double *)arg3_27(D) + 32B]; _15 = _13 * _14; _16 = BIT_FIELD_REF <_15, 64, 0>; _17 = BIT_FIELD_REF <_15, 64, 64>; _6 = _16 + _17; _12 = _6 + accumulator_28(D); _37 = _12 + _29; _19 = MEM[(__vector double *)arg2_26(D) + 48B]; _20 = MEM[(__vector double *)arg3_27(D) + 48B]; _21 = _19 * _20; _22 = BIT_FIELD_REF <_21, 64, 0>; _23 = BIT_FIELD_REF <_21, 64, 64>; _24 = _22 + _23; accumulator_32 = _24 + _37; return accumulator_32; : 0: 01 00 e5 f4 lxv vs7,0(r5) 4: 11 00 05 f4 lxv vs0,16(r5) 8: 21 00 24 f5 lxv vs9,32(r4) c: 21 00 c5 f4 lxv vs6,32(r5) 10: 01 00 44 f5 lxv vs10,0(r4) 14: 11 00 64 f5 lxv vs11,16(r4) 18: 31 00 05 f5 lxv vs8,48(r5) 1c: 31 00 84 f5 lxv vs12,48(r4) 20: 80 33 29 f1 xvmuldp vs9,vs9,vs6 24: 80 3b 4a f1 xvmuldp vs10,vs10,vs7 28: 80 03 6b f1 xvmuldp vs11,vs11,vs0 2c: 80 43 8c f1 xvmuldp vs12,vs12,vs8 30: 50 4b 09 f1 xxspltd vs8,vs9,1 34: 50 5b eb f0 xxspltd vs7,vs11,1 38: 50 53 0a f0 xxspltd vs0,vs10,1 3c: 2a 48 28 fd faddf9,f8,f9 40: 2a 58 67 fd faddf11,f7,f11 44: 2a 50 00 fc faddf0,f0,f10 48: 50 63 4c f1 xxspltd vs10,vs12,1 4c: 2a 60 8a fd faddf12,f10,f12 50: 2a 08 29 fd faddf9,f9,f1 54: 2a 58 00 fc faddf0,f0,f11 58: 2a 48 20 fc faddf1,f0,f9 5c: 2a 08 2c fc faddf1,f12,f1 60: 20 00 80 4e blr with this patch, optimized tree and final asm: _1 = *arg2_26(D); _2 = *arg3_27(D); _7 = MEM[(__vector double *)arg2_26(D) + 16B]; _8 = MEM[(__vector double *)arg3_27(D) + 16B]; _9 = _7 * _8; _5 = .FMA (_1, _2, _9); _13 = MEM[(__vector double *)arg2_26(D) + 32B]; _14 = MEM[(__vector double *)arg3_27(D) + 32B]; _19 = MEM[(__vector double *)arg2_26(D) + 48B];
Re: [PATCH] PR88497 - Extend reassoc for vector bit_field_ref
Hi Kewen, Just one quick note: On Fri, Mar 08, 2019 at 09:40:30AM +0800, Kewen.Lin wrote: > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ffast-math -fdump-tree-reassoc1" } */ > +typedef double v2df __attribute__ ((vector_size (16))); > +double > +test (double accumulator, v2df arg1[], v2df arg2[]) > +{ > + v2df temp; > + temp = arg1[0] * arg2[0]; > + accumulator += temp[0] + temp[1]; > + temp = arg1[1] * arg2[1]; > + accumulator += temp[0] + temp[1]; > + temp = arg1[2] * arg2[2]; > + accumulator += temp[0] + temp[1]; > + temp = arg1[3] * arg2[3]; > + accumulator += temp[0] + temp[1]; > + return accumulator; > +} > +/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 2 "reassoc1" } } */ Please write a word or two about what that test is testing? This helps a lot when a test starts failing. Segher
Re: [PATCH] PR88497 - Extend reassoc for vector bit_field_ref
On Fri, Mar 8, 2019 at 2:40 AM Kewen.Lin wrote: > > Hi, > > As PR88497 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497), > when we meet some code pattern like: > >V1[0] + V1[1] + ... + V1[k] + V2[0] + ... + V2[k] + ... Vn[k] >// V1...Vn of VECTOR_TYPE > > We can teach reassoc to transform it to: > >Vs = (V1 + V2 + ... + Vn) >Vs[0] + Vs[1] + ... + Vs[k] > > It saves addition and bit_field_ref operations and exposes more > opportunities for downstream passes, I notice that even on one > target doesn't support vector type and vector type gets expanded > in veclower, it's still better to have it, since the generated > sequence is more friendly for widening_mul. (If one more time > DCE after forwprop, it should be the same. ) > > Bootstrapped/regtested on powerpc64le-linux-gnu, ok for trunk? Hmm. You are basically vectorizing the reduction partially here (note basic-block vectorization doesn't handle reductions yet). So I'm not sure whether we should eventually just change op sort order to sort v1[1] after v2[0] to sort v1[0] and v2[0] together. That would be also maybe an intermediate step that makes implementing the vectorization part "easier"? I also imagine that the "vectorization" would be profitable even if there's just two elements of the vectors summed and the real vectors are larger. Note that the code you transform contains no vector operations (just element extracts) and thus you need to check for target support before creating vector add. This is for GCC 10. Also it doesn't fix PR88497, does it? Richard. > Thanks in advance! > > > gcc/ChangeLog > > 2019-03-08 Kewen Lin > > PR target/88497 > * tree-ssa-reassoc.c (reassociate_bb): Swap the positions of > GIMPLE_BINARY_RHS check and gimple_visited_p check, call new > function undistribute_bitref_for_vector. > (undistribute_bitref_for_vector): New function. > (cleanup_vinfo_map): Likewise. > (unsigned_cmp): Likewise. > > gcc/testsuite/ChangeLog > > 2019-03-08 Kewen Lin > > * gcc.dg/tree-ssa/pr88497.c: New test. > > --- > gcc/testsuite/gcc.dg/tree-ssa/pr88497.c | 18 +++ > gcc/tree-ssa-reassoc.c | 274 > +++- > 2 files changed, 287 insertions(+), 5 deletions(-) > create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr88497.c > > diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c > b/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c > new file mode 100644 > index 000..4d9ac67 > --- /dev/null > +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ffast-math -fdump-tree-reassoc1" } */ > +typedef double v2df __attribute__ ((vector_size (16))); > +double > +test (double accumulator, v2df arg1[], v2df arg2[]) > +{ > + v2df temp; > + temp = arg1[0] * arg2[0]; > + accumulator += temp[0] + temp[1]; > + temp = arg1[1] * arg2[1]; > + accumulator += temp[0] + temp[1]; > + temp = arg1[2] * arg2[2]; > + accumulator += temp[0] + temp[1]; > + temp = arg1[3] * arg2[3]; > + accumulator += temp[0] + temp[1]; > + return accumulator; > +} > +/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 2 "reassoc1" } } */ > diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c > index e1c4dfe..fc0e297 100644 > --- a/gcc/tree-ssa-reassoc.c > +++ b/gcc/tree-ssa-reassoc.c > @@ -1772,6 +1772,263 @@ undistribute_ops_list (enum tree_code opcode, >return changed; > } > > +/* Hold the information of one specific VECTOR_TYPE SSA_NAME. > +- offsets: for different BIT_FIELD_REF offsets accessing same VECTOR. > +- ops_indexes: the index of vec ops* for each relavant BIT_FIELD_REF. */ > +struct v_info > +{ > + auto_vec offsets; > + auto_vec ops_indexes; > +}; > + > +typedef struct v_info *v_info_ptr; > + > +/* Comparison function for qsort on unsigned BIT_FIELD_REF offsets. */ > +static int > +unsigned_cmp (const void *p_i, const void *p_j) > +{ > + if (*(const unsigned *) p_i >= *(const unsigned *) p_j) > +return 1; > + else > +return -1; > +} > + > +/* Cleanup hash map for VECTOR information. */ > +static void > +cleanup_vinfo_map (hash_map _map) > +{ > + for (hash_map::iterator it = info_map.begin (); > + it != info_map.end (); ++it) > +{ > + v_info_ptr info = (*it).second; > + delete info; > + (*it).second = NULL; > +} > +} > + > +/* Perform un-distribution of BIT_FIELD_REF on VECTOR_TYPE. > + V1[0] + V1[1] + ... + V1[k] + V2[0] + V2[1] + ... + V2[k] + ... Vn[k] > + is transformed to > + Vs = (V1 + V2 + ... + Vn) > + Vs[0] + Vs[1] + ... + Vs[k] > + > + The basic steps are listed below: > + > +1) Check the addition chain *OPS by looking those summands coming from > + VECTOR bit_field_ref on VECTOR type. Put the information into > + v_info_map for each satisfied summand, using VECTOR SSA_NAME as key. > + > +2) For each key (VECTOR SSA_NAME), validate all its BIT_FIELD_REFs are >
[PATCH] PR88497 - Extend reassoc for vector bit_field_ref
Hi, As PR88497 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88497), when we meet some code pattern like: V1[0] + V1[1] + ... + V1[k] + V2[0] + ... + V2[k] + ... Vn[k] // V1...Vn of VECTOR_TYPE We can teach reassoc to transform it to: Vs = (V1 + V2 + ... + Vn) Vs[0] + Vs[1] + ... + Vs[k] It saves addition and bit_field_ref operations and exposes more opportunities for downstream passes, I notice that even on one target doesn't support vector type and vector type gets expanded in veclower, it's still better to have it, since the generated sequence is more friendly for widening_mul. (If one more time DCE after forwprop, it should be the same. ) Bootstrapped/regtested on powerpc64le-linux-gnu, ok for trunk? Thanks in advance! gcc/ChangeLog 2019-03-08 Kewen Lin PR target/88497 * tree-ssa-reassoc.c (reassociate_bb): Swap the positions of GIMPLE_BINARY_RHS check and gimple_visited_p check, call new function undistribute_bitref_for_vector. (undistribute_bitref_for_vector): New function. (cleanup_vinfo_map): Likewise. (unsigned_cmp): Likewise. gcc/testsuite/ChangeLog 2019-03-08 Kewen Lin * gcc.dg/tree-ssa/pr88497.c: New test. --- gcc/testsuite/gcc.dg/tree-ssa/pr88497.c | 18 +++ gcc/tree-ssa-reassoc.c | 274 +++- 2 files changed, 287 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/pr88497.c diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c b/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c new file mode 100644 index 000..4d9ac67 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/pr88497.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ffast-math -fdump-tree-reassoc1" } */ +typedef double v2df __attribute__ ((vector_size (16))); +double +test (double accumulator, v2df arg1[], v2df arg2[]) +{ + v2df temp; + temp = arg1[0] * arg2[0]; + accumulator += temp[0] + temp[1]; + temp = arg1[1] * arg2[1]; + accumulator += temp[0] + temp[1]; + temp = arg1[2] * arg2[2]; + accumulator += temp[0] + temp[1]; + temp = arg1[3] * arg2[3]; + accumulator += temp[0] + temp[1]; + return accumulator; +} +/* { dg-final { scan-tree-dump-times "BIT_FIELD_REF" 2 "reassoc1" } } */ diff --git a/gcc/tree-ssa-reassoc.c b/gcc/tree-ssa-reassoc.c index e1c4dfe..fc0e297 100644 --- a/gcc/tree-ssa-reassoc.c +++ b/gcc/tree-ssa-reassoc.c @@ -1772,6 +1772,263 @@ undistribute_ops_list (enum tree_code opcode, return changed; } +/* Hold the information of one specific VECTOR_TYPE SSA_NAME. +- offsets: for different BIT_FIELD_REF offsets accessing same VECTOR. +- ops_indexes: the index of vec ops* for each relavant BIT_FIELD_REF. */ +struct v_info +{ + auto_vec offsets; + auto_vec ops_indexes; +}; + +typedef struct v_info *v_info_ptr; + +/* Comparison function for qsort on unsigned BIT_FIELD_REF offsets. */ +static int +unsigned_cmp (const void *p_i, const void *p_j) +{ + if (*(const unsigned *) p_i >= *(const unsigned *) p_j) +return 1; + else +return -1; +} + +/* Cleanup hash map for VECTOR information. */ +static void +cleanup_vinfo_map (hash_map _map) +{ + for (hash_map::iterator it = info_map.begin (); + it != info_map.end (); ++it) +{ + v_info_ptr info = (*it).second; + delete info; + (*it).second = NULL; +} +} + +/* Perform un-distribution of BIT_FIELD_REF on VECTOR_TYPE. + V1[0] + V1[1] + ... + V1[k] + V2[0] + V2[1] + ... + V2[k] + ... Vn[k] + is transformed to + Vs = (V1 + V2 + ... + Vn) + Vs[0] + Vs[1] + ... + Vs[k] + + The basic steps are listed below: + +1) Check the addition chain *OPS by looking those summands coming from + VECTOR bit_field_ref on VECTOR type. Put the information into + v_info_map for each satisfied summand, using VECTOR SSA_NAME as key. + +2) For each key (VECTOR SSA_NAME), validate all its BIT_FIELD_REFs are + continous, they can cover the whole VECTOR perfectly without any holes. + Obtain one VECTOR list which contain candidates to be transformed. + +3) Build the addition statements for all VECTOR candidates, generate + BIT_FIELD_REFs accordingly. + + TODO: Now the implementation restrict all candidate VECTORs should have the + same VECTOR type, it can be extended into different groups by VECTOR types + in future if any profitable cases found. */ +static bool +undistribute_bitref_for_vector (enum tree_code opcode, vec *ops, +struct loop *loop) +{ + if (ops->length () <= 1 || opcode != PLUS_EXPR) +return false; + + hash_map v_info_map; + operand_entry *oe1; + unsigned i; + + /* Find those summands from VECTOR BIT_FIELD_REF in addition chain, put the + information into map. */ + FOR_EACH_VEC_ELT (*ops, i, oe1) +{ + enum tree_code dcode; + gimple *oe1def; + + if (TREE_CODE (oe1->op) != SSA_NAME) + continue; + oe1def =