Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Friday, February 10, 2023 3:57 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: Tamar Christina via Gcc-patches <gcc-patches@gcc.gnu.org>; nd >> <n...@arm.com>; rguent...@suse.de; j...@ventanamicro.com >> Subject: Re: [PATCH 1/2]middle-end: Fix wrong overmatching of div-bitmask >> by using new optabs [PR108583] >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> > a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index >> >> > >> >> >> 6934aebc69f231af24668f0a1c3d140e97f55487..e39d7e6b362ef44eb2fc467f33 >> >> 69 >> >> > de2afea139d6 100644 >> >> > --- a/gcc/tree-vect-patterns.cc >> >> > +++ b/gcc/tree-vect-patterns.cc >> >> > @@ -3914,12 +3914,82 @@ vect_recog_divmod_pattern (vec_info >> *vinfo, >> >> > return pattern_stmt; >> >> > } >> >> > else if ((cst = uniform_integer_cst_p (oprnd1)) >> >> > - && targetm.vectorize.can_special_div_by_const (rhs_code, >> >> vectype, >> >> > - wi::to_wide >> >> > (cst), >> >> > - NULL, >> >> > NULL_RTX, >> >> > - NULL_RTX)) >> >> > + && TYPE_UNSIGNED (itype) >> >> > + && rhs_code == TRUNC_DIV_EXPR >> >> > + && vectype >> >> > + && direct_internal_fn_supported_p (IFN_ADDH, vectype, >> >> > + OPTIMIZE_FOR_SPEED)) >> >> > { >> >> > - return NULL; >> >> > + /* div optimizations using narrowings >> >> > + we can do the division e.g. shorts by 255 faster by calculating >> >> > it as >> >> > + (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in >> >> > + double the precision of x. >> >> > + >> >> > + If we imagine a short as being composed of two blocks of bytes >> then >> >> > + adding 257 or 0b0000_0001_0000_0001 to the number is equivalent >> to >> >> > + adding 1 to each sub component: >> >> > + >> >> > + short value of 16-bits >> >> > + ┌──────────────┬────────────────┐ >> >> > + │ │ │ >> >> > + └──────────────┴────────────────┘ >> >> > + 8-bit part1 ▲ 8-bit part2 ▲ >> >> > + │ │ >> >> > + │ │ >> >> > + +1 +1 >> >> > + >> >> > + after the first addition, we have to shift right by 8, and >> >> > narrow the >> >> > + results back to a byte. Remember that the addition must be done >> in >> >> > + double the precision of the input. However if we know that >> >> > + the >> >> addition >> >> > + `x + 257` does not overflow then we can do the operation in >> >> > + the >> >> current >> >> > + precision. In which case we don't need the pack and unpacks. >> >> > */ >> >> > + auto wcst = wi::to_wide (cst); >> >> > + int pow = wi::exact_log2 (wcst + 1); >> >> > + if (pow == (int) (element_precision (vectype) / 2)) >> >> > + { >> >> > + wide_int min,max; >> >> > + /* If we're in a pattern we need to find the orginal >> >> > definition. */ >> >> > + tree op0 = oprnd0; >> >> > + gimple *stmt = SSA_NAME_DEF_STMT (oprnd0); >> >> > + stmt_vec_info stmt_info = vinfo->lookup_stmt (stmt); >> >> > + if (is_pattern_stmt_p (stmt_info)) >> >> > + { >> >> > + auto orig_stmt = STMT_VINFO_RELATED_STMT (stmt_info); >> >> > + if (is_gimple_assign (STMT_VINFO_STMT (orig_stmt))) >> >> > + op0 = gimple_assign_lhs (STMT_VINFO_STMT (orig_stmt)); >> >> > + } >> >> >> >> If this is generally safe (I'm skipping thinking about it in the >> >> interests of a quick review :-)), then I think it should be done in >> >> vect_get_range_info instead. Using gimple_get_lhs would be more >> >> general than handling just assignments. >> >> >> >> > + >> >> > + /* Check that no overflow will occur. If we don't have range >> >> > + information we can't perform the optimization. */ >> >> > + if (vect_get_range_info (op0, &min, &max)) >> >> > + { >> >> > + wide_int one = wi::to_wide (build_one_cst (itype)); >> >> > + wide_int adder = wi::add (one, wi::lshift (one, pow)); >> >> > + wi::overflow_type ovf; >> >> > + /* We need adder and max in the same precision. */ >> >> > + wide_int zadder >> >> > + = wide_int_storage::from (adder, wi::get_precision >> >> > (max), >> >> > + UNSIGNED); >> >> > + wi::add (max, zadder, UNSIGNED, &ovf); >> >> >> >> Could you explain this a bit more? When do we have mismatched >> >> precisions? >> > >> > C promotion rules will promote e.g. >> > >> > void fun2(uint8_t* restrict pixel, uint8_t level, int n) { >> > for (int i = 0; i < n; i+=1) >> > pixel[i] = (pixel[i] + level) / 0xff; } >> > >> > And have the addition be done as a 32 bit integer. The vectorizer >> > will demote this down to a short, but range information is not stored >> > for patterns. So In the above the range will correctly be 0x1fe but >> > the precision will be that of the original expression, so 32. This >> > will be a mismatch with itype which is derived from the size the vectorizer >> will perform the operation in. >> >> Gah, missed this first time round, sorry. >> >> Richi would know better than me, but I think it's dangerous to rely on the >> orig/pattern link for range information. The end result of a pattern >> (vect_stmt_to_vectorize) has to have the same type as the lhs of the original >> statement. But the other statements in the pattern sequence can do >> arbitrary things. Their range isn't predictable from the range of the >> original >> statement result. >> >> IIRC, the addition above is converted to: >> >> a' = (uint16_t) pixel[i] >> b' = (uint16_t) level >> sum' = a' + b' >> sum = (int) sum' >> >> where sum is the direct replacement of "pixel[i] + level", with the same type >> and range. The division then uses sum' instead of sum. >> >> But the fact that sum' is part of the same pattern as sum doesn't guarantee >> that sum' has the same range as sum. E.g. the pattern statements added by >> the division optimisation wouldn't have this property. > > So my assumption is that no pattern would replace a statement with something > That has higher precision than the C statement. The pattern above is demoted > By the vectorizer based on range information already. My assumption was that > the precision can only ever be smaller, because otherwise the pattern has > violated > the semantics of the C code, which would be dangerous if e.g. the expression > escapes?
IMO the difference in precisions was a symptom of the problem rather than the direct cause. The point is more that "B = vect_orig_stmt(A)" just says "A is used somehow in a new calculation of B". A might equal B (if A replaces B), or A might be an arbitrary temporary result. The code above is instead using it to mean "A equals B, expressed in a different type". That happens to be true for sum' in the sequence above, but it isn't true of non-final pattern statements in general. In other words, the code hasn't proved that the path from A to vect_stmt_to_vectorize(B) just involves conversions. Applying the range of a pattern result to all temporary results in the pattern could lead to wrong results even when the precisions are all the same. >> Is it possible to tell ranger to compute the range of expressions that >> haven't >> been added to the IL? (Genuine question, haven't looked. >> It seems pretty powerful though.) > > I don't know either, I guess for things it has explicit knowledge about it's > ok, so > +w or *w would be fine, but with a random IFN_ it'll likely have to punt as > varying. Yeah. But sum' above involves simple arithmetic and conversions, so IFNs shouldn't be a problem. Thanks, Richard