[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #36 from Wilco --- (In reply to Orr Shalom Dvory from comment #35) > Hi, thanks for your respond. can someone mark this bug as need to be > improved? > Does anyone agree/disagree with my new proposed method? It's best to create a new bug if there are still easy cases that could be optimized. Also it seems the constants it uses are quite complex - it may be feasible to simplify them. Eg. long f(long x) { return x % 35 == 0; }
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #35 from Orr Shalom Dvory --- Hi, thanks for your respond. can someone mark this bug as need to be improved? Does anyone agree/disagree with my new proposed method?
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 krux changed: What|Removed |Added CC||hoganmeier at gmail dot com --- Comment #34 from krux --- Also fixes the duplicate https://gcc.gnu.org/bugzilla/show_bug.cgi?id=12849. Can't close it though.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Orr Shalom Dvory changed: What|Removed |Added CC||dvoreader at gmail dot com --- Comment #33 from Orr Shalom Dvory --- (In reply to Jakub Jelinek from comment #25) > Created attachment 44657 [details] > gcc9-pr82853.patch > > Full untested patch. For x % C1 == C2 it handles all unsigned cases where > C1 is odd, if C1 is even, just cases where C2 <= -1U % C1, if signed modulo, > just x % C1 == 0 cases (where C1 is not INT_MIN). Hi, I will try to help. I've got a more accurate formula. look at my comment over here https://reviews.llvm.org/D50222 I'm copying it here for the sake of all. Hi guys, I found the magical formula for unsigned integers that works also with even numbers without the need to check for overflows with any remainder: from divisor d and reminder r, I calculate 4 constants. any d!=0 should fit. void calculate_constants64(uint64_t d, uint64_t r, uint64_t ,uint64_t , uint64_t ,uint64_t& u) { k=__builtin_ctzll(d);/* the power of 2 */ uint64_t d_odd=d>>k; mmi=find_mod_mul_inverse(d_odd,64); /* 64 is word size*/ s=(r*mmi); u=(ULLONG_MAX-r)/d;/* note that I divide by d, not d_odd */ } A little bit background: the constant (u +1) is the count of the possible values in the range of 64 bit number that will yield the correct modulo. The constant s should zero the first k bits if the given value have modulo of r. it will also zero the modulo of d_odd. then the compiler should generate the following code with the given constants: int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x) { uint64_t o=((x*mmi)-s); o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/ return o<=u; } this replace the following: /* d is the divisor, r is the remainder */ int checkrem64(uint64_t x) { return x%d==r; } this is the code to find modular inverse.. uint64_t find_mod_mul_inverse(uint64_t x, uint64_t bits) { if (bits > 64 || ((x&1)==0)) return 0;// invalid parameters uint64_t mask; if (bits == 64) mask = -1; else { mask = 1; mask<<=bits; mask--; } x&=mask; uint64_t result=1, state=x, ctz=0; while(state!=1ULL) { ctz=__builtin_ctzll(state^1); result|=1ULL<>k; mmi=find_mod_mul_inverse(d_odd,64); /* 64 is word size*/ //this is the added line to make it work with signed integers r+=0x8000 % d; s=(r*mmi); u=(ULLONG_MAX-r)/d; } int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x) { //this is the added line to make it work with signed integers //x came as signed number but was casted to unsigned x^=0x8000 ;// this is addition simplified to xor, spaces for clarity. uint64_t o=((x*mmi)-s); o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/ return o<=u; } There is a way to tweak u and s to make it work on negative only numbers or positive only numbers, when r is negative or positive... but for r=0, this should work. please tell me the exact requirements, and I will do the math.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Alexander Monakov changed: What|Removed |Added CC||vermaelen.wouter at gmail dot com --- Comment #32 from Alexander Monakov --- *** Bug 49552 has been marked as a duplicate of this bug. ***
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Jakub Jelinek changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #31 from Jakub Jelinek --- Fixed on the trunk. We might still try to improve some == non-zero cases with conditional code, and certainly should repeat this in the vectorizer's pattern recognizer.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #30 from Jakub Jelinek --- Author: jakub Date: Wed Sep 12 18:28:20 2018 New Revision: 264248 URL: https://gcc.gnu.org/viewcvs?rev=264248=gcc=rev Log: PR middle-end/82853 * expr.h (maybe_optimize_mod_cmp): Declare. * expr.c (mod_inv): New function. (maybe_optimize_mod_cmp): New function. (do_store_flag): Use it. * cfgexpand.c (expand_gimple_cond): Likewise. * gcc.target/i386/pr82853-1.c: New test. * gcc.target/i386/pr82853-2.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr82853-1.c trunk/gcc/testsuite/gcc.target/i386/pr82853-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/cfgexpand.c trunk/gcc/expr.c trunk/gcc/expr.h trunk/gcc/testsuite/ChangeLog
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Jakub Jelinek changed: What|Removed |Added CC||antoshkka at gmail dot com --- Comment #29 from Jakub Jelinek --- *** Bug 87232 has been marked as a duplicate of this bug. ***
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #28 from Jakub Jelinek --- On i9-7960X I get (cc1 is -O0 checking build, so bootstrapped compiler might be much faster), will repeat that with bootstrapped compiler if it succeeds. The __int128 and unsigned __int128 tests are clearly too expensive. for i in 1 2 3 4 5 6; do time ./cc1 -quiet -O2 -o pr82853-$i.{s,c}; gcc -o pr82853-$i{,.s}; time ./pr82853-$i; echo $?; done real0m11.273s user0m11.182s sys 0m0.039s real0m2.997s user0m2.993s sys 0m0.001s 0 real0m8.145s user0m8.082s sys 0m0.026s real0m2.166s user0m2.165s sys 0m0.000s 0 real0m11.683s user0m11.597s sys 0m0.033s real0m5.315s user0m5.312s sys 0m0.000s 0 real0m7.972s user0m7.903s sys 0m0.032s real0m3.801s user0m3.798s sys 0m0.001s 0 real0m12.846s user0m12.762s sys 0m0.028s real0m17.471s user0m17.458s sys 0m0.001s 0 real0m8.546s user0m8.486s sys 0m0.022s real0m13.738s user0m13.728s sys 0m0.000s
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #27 from Richard Earnshaw --- (In reply to Jakub Jelinek from comment #26) > A test generator for x % c1 == c2 expansion for unsigned, int, unsigned long > long, long long, unsigned int128 and int128 types (assuming ilp32 or lp64) > plus tests for those types. Takes about 2 minutes to compile + run on a fast > box and uses random (), so not really sure the tests should go into the > testsuite. Thoughts on that? After all, the generator isn't extra smart and > doesn't try to find problematic corner cases. If you work on the 'guess' of a 10x slowdown when using simulators, that's probably too long particularly if it's not delivering high value. Ultimately, it will depend on how much of that 2 minutes is during generate/compile and how much during run time.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #26 from Jakub Jelinek --- Created attachment 44658 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44658=edit pr82853-tests.tar.xz A test generator for x % c1 == c2 expansion for unsigned, int, unsigned long long, long long, unsigned int128 and int128 types (assuming ilp32 or lp64) plus tests for those types. Takes about 2 minutes to compile + run on a fast box and uses random (), so not really sure the tests should go into the testsuite. Thoughts on that? After all, the generator isn't extra smart and doesn't try to find problematic corner cases.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Jakub Jelinek changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |jakub at gcc dot gnu.org --- Comment #25 from Jakub Jelinek --- Created attachment 44657 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44657=edit gcc9-pr82853.patch Full untested patch. For x % C1 == C2 it handles all unsigned cases where C1 is odd, if C1 is even, just cases where C2 <= -1U % C1, if signed modulo, just x % C1 == 0 cases (where C1 is not INT_MIN).
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #24 from Jakub Jelinek --- (In reply to Wilco from comment #23) > (In reply to Jakub Jelinek from comment #22) > > So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, > > because > > for x == 0 the test will yield a different value, as 0xU % 3U is 0 > > and > > 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaabU <= > > 0x5554U, but x % 3U == 0 is equivalent to x * 0xaaabU <= > > 0xU. > > Now to see if something useful can be used also for the even divisors. > > Yes for this case it is safe to do (x - 1) * 0xaaab < 0x, but > you can also do x * 0xaaab >= 0xaaab which is even simpler. Yeah, a special case, todo++. > Basically (x % C) == N is simpler for 2 values of N. I meant that for C odd it works for any value, x % C == D for C odd and D <= -1 % C then (x - D) * mod_inv (C) <= -1 / C, otherwise if C is odd and D > -1 % C then (x - D) * mod_inv (C) < -1 / C. Just if C is even it is more complicated. For C positive odd and signed modulo with D == 0 it can be done easily too (will implement tomorrow).
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Wilco changed: What|Removed |Added CC||wdijkstr at arm dot com --- Comment #23 from Wilco --- (In reply to Jakub Jelinek from comment #22) > So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, > because > for x == 0 the test will yield a different value, as 0xU % 3U is 0 > and > 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaabU <= > 0x5554U, but x % 3U == 0 is equivalent to x * 0xaaabU <= 0xU. > Now to see if something useful can be used also for the even divisors. Yes for this case it is safe to do (x - 1) * 0xaaab < 0x, but you can also do x * 0xaaab >= 0xaaab which is even simpler. Basically (x % C) == N is simpler for 2 values of N.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #22 from Jakub Jelinek --- So, while it isn't correct to replace x % 3U == 1 by (x - 1) % 3U == 0, because for x == 0 the test will yield a different value, as 0xU % 3U is 0 and 0 % 3U is also 0, x % 3U == 1 is equivalent to (x - 1) * 0xaaabU <= 0x5554U, but x % 3U == 0 is equivalent to x * 0xaaabU <= 0xU. Now to see if something useful can be used also for the even divisors.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #21 from Jakub Jelinek --- Probably should punt early if integer_onep (treeop1), that case should have been optimized earlier, but if it isn't, we shouldn't miscompile. Another thing is if *arg1 is < 0 or >= treeop1, again, I'd hope we have optimized that already to false (or true for NE_EXPR), but if not, we shouldn't miscompile. Another thing is that C4 needs to be adjusted, if *arg1 is bigger than all_ones%treeop1, then we need to subtract 1 from C4 - e.g. for x % 3 == 0 we want to expand it as x * 0xaaab <= 0x, while for x % 3 == 1 we should expand as x * 0xaaab - 1 * 0xaaab <= 0x - 1, because 0x % 3 == 0 and thus there is one more case where x % 3 == 0 compared to x % 3 == 1 and x % 3 == 2 cases. Hacker's Delight 10-17 mentions <= (2^prec - 1) / c1 rather than <= (2^prec) / c1, probably should do that too.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #20 from Jakub Jelinek --- Testcase I've been eyeballing so far: unsigned f1 (unsigned x) { return (x % 679U) == 0; } unsigned f2 (unsigned x, unsigned *y) { *y = x / 679U; return (x % 679U) == 0; } unsigned f3 (unsigned x) { return (x % 1738U) == 0; } void bar (void); void f4 (unsigned x) { if (x % 3 == 0) bar (); } void f5 (unsigned x) { if (x % 3 == 1) bar (); } void f6 (unsigned x) { if (x % 3 == 2) bar (); }
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #19 from Jakub Jelinek --- Created attachment 44656 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=44656=edit gcc9-pr82853-wip.patch Untested WIP patch which does this during expansion if it is cheaper according to target costs. As the FIXMEs say, I'm not yet trying Alex' proposal with doing the subtraction on X (can't be used unconditionally, need to prove that values from -C2 to -1 u% C1 aren't 0), and am not trying even C1 when rotates aren't available (either could just expand the rotate anyway using 2 shifts + or, or comparison + mask + comparison + bitwise and (though that would complicate the callers equite a bit).
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #18 from Wilco --- (In reply to Alexander Monakov from comment #17) > (In reply to Jakub Jelinek from comment #16) > > For unsigned x % y == z if y is odd constant we can handle it for any > > constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and > > using x * m - (z * m) < d . > > Is that preferable to testing (x - z) % y == 0? Why? That would require checking that it when x - z underflows if returns the correct answer (or adding a && x >= z).
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #17 from Alexander Monakov --- (In reply to Jakub Jelinek from comment #16) > For unsigned x % y == z if y is odd constant we can handle it for any > constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and > using x * m - (z * m) < d . Is that preferable to testing (x - z) % y == 0? Why?
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #16 from Jakub Jelinek --- For unsigned x % y == z if y is odd constant we can handle it for any constant z, by computing m = mul_inv (y, 2^prec) and d = (2^prec / y) and using x * m - (z * m) < d . For even y, not sure if it can work for anything but z == 0; for z == 0 we can do s = ctz (y); y_ = y >> s; m = mul_inv (y_, 2^prec); d = (2^prec / y_); and use ((x * m) r>> s) < d .
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 ktkachov at gcc dot gnu.org changed: What|Removed |Added CC||ktkachov at gcc dot gnu.org --- Comment #15 from ktkachov at gcc dot gnu.org --- I tried to do something similar at https://gcc.gnu.org/ml/gcc-patches/2015-07/msg02005.html and the feedback was that we should do this at expand time
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #14 from Jakub Jelinek --- http://duriansoftware.com/joe/Optimizing-is-multiple-checks-with-modular-arithmetic.html Do we want to do this at GIMPLE time ignoring costs, or during expansion time? Doing it later has the benefit that we can compare costs and could avoid breaking say divmod recognition, or finding out multiple uses of the modulo, etc. Doing it earlier has the benefit for vectorization I guess, otherwise we need a pattern recognizer that will work like the expansion.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #13 from Jakub Jelinek --- *** Bug 87203 has been marked as a duplicate of this bug. ***
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #12 from Wilco --- (In reply to Marc Glisse from comment #11) > (In reply to Wilco from comment #9) > > It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). > > For x%3==0, i.e. z==0 for x==3*y+z with 0<=y<5556 and 0<=z<3. > Indeed, x*0xaaab==y+z*0xaaab is in the right range precisely for > z==0 and the same can be done for any odd number. You can make it work for even numbers as well, eg. (x % 10) == 0 -> (x % 5) == 0 && (x & 1) == 0. If you can avoid branches then it may still be faster. > > You can support any kind of comparison, it doesn't need to be with 0 (but > > zero is the easiest). > > Any ==cst will yield a range test. It is less obvious that inequalities are > transformed to a contiguous range... (try x%7<3 maybe) Indeed, expanding into x % 7 == 0 || x % 7 == 1 || x % 7 == 2 is unlikely to be a good idea. However you can change x % 7 >= 1 into x % 7 != 0 or x % 7 < 6 into x % 7 != 6 which are single ranges.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #11 from Marc Glisse --- (In reply to Wilco from comment #9) > It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). For x%3==0, i.e. z==0 for x==3*y+z with 0<=y<5556 and 0<=z<3. Indeed, x*0xaaab==y+z*0xaaab is in the right range precisely for z==0 and the same can be done for any odd number. > You can support any kind of comparison, it doesn't need to be with 0 (but > zero is the easiest). Any ==cst will yield a range test. It is less obvious that inequalities are transformed to a contiguous range... (try x%7<3 maybe) > I forgot whether I made it work for signed too, but it's certainly > possible to skip the sign handling in x % 4 == 0 even if x is signed. 4 is a completely different story, as a power of 2.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #10 from Richard Biener --- Might be even more special-cased with x % 3 == 0 ? C1 : C2 aka in if-conversion context.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Wilco changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #9 from Wilco --- (In reply to Andi Kleen from comment #8) > I'm not sure if it works with other numbers too. > > (need to dig through Hacker's delight & Matters Computational to see if they > have anything on it) > > But it could be extended for other word lengths at least > > BTW there are some other cases, will file a bug shortly on those too. It works for any C where (divisor*C) MOD 2^32 == 1 (or -1). You can support any kind of comparison, it doesn't need to be with 0 (but zero is the easiest). I forgot whether I made it work for signed too, but it's certainly possible to skip the sign handling in x % 4 == 0 even if x is signed. While this is faster in general than computing the full modulo result, it wouldn't be if you need the division result as well. So this works best on single-use modulos used in comparisons, at the same time as general divisions are expanded.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #8 from Andi Kleen --- I'm not sure if it works with other numbers too. (need to dig through Hacker's delight & Matters Computational to see if they have anything on it) But it could be extended for other word lengths at least BTW there are some other cases, will file a bug shortly on those too.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #7 from Marc Glisse --- Is that a special case of a more generic transformation, which might apply for other values of 3, 0, == etc, or is this meant only literally for x%3==0?
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-11-05 Ever confirmed|0 |1 --- Comment #6 from Andrew Pinski --- (In reply to Andi Kleen from comment #5) > Also I'm not sure why you would want it in the middle end. It should all > work at the tree level Because we lose meaning and there is already code to handle the %N (where N is a constant) to a multiplication in the middle-end so reusing that code would be easy. Plus 0 is simpler than 0x5556.
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #5 from Andi Kleen --- Also I'm not sure why you would want it in the middle end. It should all work at the tree level
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 --- Comment #4 from Andi Kleen --- Right it's about special casing the complete expression
[Bug middle-end/82853] Optimize x % 3 == 0 without modulo
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82853 Andrew Pinski changed: What|Removed |Added Component|target |middle-end --- Comment #3 from Andrew Pinski --- For your code aarch64 produces: mod3: mov w1, 43691 movkw1, 0x, lsl 16 umull x1, w0, w1 lsr x1, x1, 32 lsr w1, w1, 1 add w1, w1, w1, lsl 1 cmp w0, w1 csetw0, eq ret Oh you mean not to produce a%3 followed by == part :). Anyways this code should go into the middle-end expanders and not be part of match.pd.