Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Christophe Lyon wrote: > This causes my builds (all arm and aarch64 targets) to fail: Richard Biener already committed a fix in r254498 (thanks). It seems constants in match.pd now need wi::to_wide. Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Hi Wilco On 7 November 2017 at 13:28, Wilco Dijkstrawrote: > Sudi Das wrote: > >> Thanks, I have made the changes to the patch. >> Also can someone please apply it for me. I do not have commit access. >> >> 2017-10-10 Sudakshina Das >> >>PR middle-end/80131 >>* match.pd: Simplify 1 << (C - x) where C = precision (x) - 1. >> >> 2017-10-10 Sudakshina Das >> >>PR middle-end/80131 >>* testsuite/gcc.dg/pr80131-1.c: New Test. >> >> >> With regards to the existing missed optimizations needed to the x86 RTL >> expansion, >> I think the discussions can take place on the bug report that I created and >> maybe someone will pick it up. > > I've committed this as r254496. This causes my builds (all arm and aarch64 targets) to fail: g++ -fno-PIE -c -g -O2 -DIN_GCC -DCROSS_DIRECTORY_STRUCTURE -fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall -Wwrite-strings -Wcast-qual -Wmissing-format-attribute -Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros -Wno-overlength-strings -fno-common -Wno-unused -DHAVE_CONFIG_H -I. -I. -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/. -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../include -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libcpp/include -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-aarch64-none-linux-gnu/gcc1/./gmp -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gmp -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-aarch64-none-linux-gnu/gcc1/./mpfr/src -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/mpfr/src -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/mpc/src -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libdecnumber -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libdecnumber/dpd -I../libdecnumber -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libbacktrace -o gimple-match.o -MT gimple-match.o -MMD -MP -MF ./.deps/gimple-match.TPo gimple-match.c In file included from /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/coretypes.h:397, from /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/gimple-match-head.c:22, from gimple-match.c:4: /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h: In function ‘bool wi::eq_p(const T1&, const T2&) [with T1 = tree_node*, T2 = int]’: gimple-match.c:49533: instantiated from here /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier gimple-match.c:49533: instantiated from here /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764: error: incomplete type ‘wi::int_traits ’ used in nested name specifier /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h: In function ‘unsigned int wi::get_binary_precision(const T1&, const T2&) [with T1 = tree_node*, T2 = int]’: /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1763: instantiated from ‘bool wi::eq_p(const T1&, const T2&) [with T1 = tree_node*, T2 = int]’ gimple-match.c:49533: instantiated from here /tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1696: error: incomplete type ‘wi::int_traits ’ used in nested name specifier make[2]: *** [gimple-match.o] Error 1 Can you have a look? > > Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Sudi Das wrote: > Thanks, I have made the changes to the patch. > Also can someone please apply it for me. I do not have commit access. > > 2017-10-10 Sudakshina Das> > PR middle-end/80131 > * match.pd: Simplify 1 << (C - x) where C = precision (x) - 1. > > 2017-10-10 Sudakshina Das > > PR middle-end/80131 > * testsuite/gcc.dg/pr80131-1.c: New Test. > > > With regards to the existing missed optimizations needed to the x86 RTL > expansion, > I think the discussions can take place on the bug report that I created and > maybe someone will pick it up. I've committed this as r254496. Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Thanks, I have made the changes to the patch. Also can someone please apply it for me. I do not have commit access. 2017-10-10 Sudakshina Das <sudi@arm.com> PR middle-end/80131 * match.pd: Simplify 1 << (C - x) where C = precision (x) - 1. 2017-10-10 Sudakshina Das <sudi@arm.com> PR middle-end/80131 * testsuite/gcc.dg/pr80131-1.c: New Test. With regards to the existing missed optimizations needed to the x86 RTL expansion, I think the discussions can take place on the bug report that I created and maybe someone will pick it up. Thanks Sudi From: Wilco Dijkstra Sent: Monday, October 9, 2017 2:02 PM To: Richard Biener; Sudi Das Cc: Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James Greenhalgh Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) Richard Biener wrote: > I think the patch is ok with these changes but obviously we should try > to address > the code-generation issue on x86 at RTL expansion time. They are sort-of > existing missing optimizations. Note the only x64 specific issue is the backend expansion of 64-bit immediates which could be improved like I suggested. However what we're really missing is a generic optimization pass that tries to simplify immediates using accurate target costs. In eg. x >= C or x > C we can use either C or C-1 - on many targets one option may be a single instruction, while the other might take 2 or even 3. Wilcodiff --git a/gcc/match.pd b/gcc/match.pd index e58a65a..7a25a1b 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -600,6 +600,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && tree_nop_conversion_p (type, TREE_TYPE (@1))) (lshift @0 @2))) +/* Fold (1 << (C - x)) where C = precision(type) - 1 + into ((1 << C) >> x). */ +(simplify + (lshift integer_onep@0 (minus@1 INTEGER_CST@2 @3)) + (if (INTEGRAL_TYPE_P (type) + && wi::eq_p (@2, TYPE_PRECISION (type) - 1) + && single_use (@1)) + (if (TYPE_UNSIGNED (type)) + (rshift (lshift @0 @2) @3) + (with +{ tree utype = unsigned_type_for (type); } +(convert (rshift (lshift (convert:utype @0) @2) @3)) + /* Fold (C1/X)*C2 into (C1*C2)/X. */ (simplify (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2) diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c new file mode 100644 index 000..0bfe1f4 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr80131-1.c @@ -0,0 +1,31 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target int32plus } */ +/* { dg-options "-fdump-tree-gimple" } */ + +/* Checks the simplification of: + 1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1 + f1 is not simplified but f2, f3 and f4 are. */ + +__INT64_TYPE__ f1 (__INT64_TYPE__ i) +{ + return (__INT64_TYPE__)1 << (31 - i); +} + +__INT64_TYPE__ f2 (__INT64_TYPE__ i) +{ + return (__INT64_TYPE__)1 << (63 - i); +} + +__UINT64_TYPE__ f3 (__INT64_TYPE__ i) +{ + return (__UINT64_TYPE__)1 << (63 - i); +} + +__INT32_TYPE__ f4 (__INT32_TYPE__ i) +{ + return (__INT32_TYPE__)1 << (31 - i); +} + +/* { dg-final { scan-tree-dump-times "= 31 -" 1 "gimple" } } */ +/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */ +/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Richard Biener wrote: > I think the patch is ok with these changes but obviously we should try > to address > the code-generation issue on x86 at RTL expansion time. They are sort-of > existing missing optimizations. Note the only x64 specific issue is the backend expansion of 64-bit immediates which could be improved like I suggested. However what we're really missing is a generic optimization pass that tries to simplify immediates using accurate target costs. In eg. x >= C or x > C we can use either C or C-1 - on many targets one option may be a single instruction, while the other might take 2 or even 3. Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Tue, Sep 26, 2017 at 2:44 PM, Sudi Das <sudi@arm.com> wrote: > > Still waiting on Jakub's comment on whether there are more things needed at > the backend. But I have updated the patch according to Richard's comments. + (if (TYPE_UNSIGNED(type)) space before '(type)'. + (rshift (lshift @0 @2) @3) + (with +{ tree utype = unsigned_type_for (type); } +(convert (rshift (lshift (convert:utype @0) @2) @3)) I think your testcase needs a { dg-require-effective-target int32plus } as otherwise it'll fail on AVR. I think the patch is ok with these changes but obviously we should try to address the code-generation issue on x86 at RTL expansion time. They are sort-of existing missing optimizations. Richard. > Thanks > Sudi > > > > From: Richard Biener <richard.guent...@gmail.com> > Sent: Friday, August 4, 2017 11:16 AM > To: Sudi Das > Cc: Wilco Dijkstra; Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James > Greenhalgh > Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) > > On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <sudi@arm.com> wrote: >> >> >> >> >> Sorry about the delayed response but looking at the above discussion, should >> I conclude that this is a valid tree simplification? > > Yes, I think so. Jakub requested code to undo this at RTL expansion > based on target costs, not sure if we really should > require that from you given the user could have written the target > sequence himself. > > Few comments about the patch: > > +/* Fold (1 << (C - x)) where C = precision(type) - 1 > + into ((1 << C) >> x). */ > +(simplify > + (lshift integer_onep@0 (minus INTEGER_CST@1 @2)) > > I think this warrants a single_use check on the minus (note :s isn't enough > as with the unsigned case we'd happily ignore it by design). > > + (if (INTEGRAL_TYPE_P (type) > + && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT > + && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1)) > > You can relax this with using > > && wi::eq_p (@1, TYPE_PRECISION (type) - 1) > > + (if (TYPE_UNSIGNED(type)) > + (rshift (lshift @0 @1) @2) > + (with > +{ tree utype = unsigned_type_for (type); } > +(convert:type (rshift (lshift (convert:utype @0) @1) @2)) > + > > You can write (convert (rshift ...)), without the :type. > > I'm leaving it to Jakub whether you need to write that RTL expansion tweak. > > Thanks, > Richard. > >> I am pasting the diff of the assembly that AArch64 generates with the test >> case that I added. I see fewer instructions generated with the patch. >> >> --- pr80131-1.s2017-08-01 10:02:43.243374174 +0100 >> +++ pr80131-1.s-patched2017-08-01 10:00:54.776455630 +0100 >> @@ -24,10 +24,8 @@ >> strx0, [sp, 8] >> ldrx0, [sp, 8] >> movw1, w0 >> -movw0, 63 >> -subw0, w0, w1 >> -movx1, 1 >> -lslx0, x1, x0 >> +movx0, -9223372036854775808 >> +lsrx0, x0, x1 >> addsp, sp, 16 >> ret >> .sizef2, .-f2 >> @@ -39,10 +37,8 @@ >> strx0, [sp, 8] >> ldrx0, [sp, 8] >> movw1, w0 >> -movw0, 63 >> -subw0, w0, w1 >> -movx1, 1 >> -lslx0, x1, x0 >> +movx0, -9223372036854775808 >> +lsrx0, x0, x1 >> addsp, sp, 16 >> ret >> .sizef3, .-f3 >> @@ -52,11 +48,9 @@ >> f4: >> subsp, sp, #16 >> strw0, [sp, 12] >> -movw1, 31 >> ldrw0, [sp, 12] >> -subw0, w1, w0 >> -movw1, 1 >> -lslw0, w1, w0 >> +movw1, -2147483648 >> +lsrw0, w1, w0 >> addsp, sp, 16 >> ret >> .sizef4, .-f4 >> >> >> Thanks >> >> Sudi >> >> >> >> >> From: Wilco Dijkstra >> Sent: Thursday, April 13, 2017 1:01 PM >> To: Richard Biener; Jakub Jelinek >> Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh >> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) >> >> Richard Biener wrote: >>> It is IMHO a valid GIMPLE optimization / canonicalization. >>> >>>movabsq $-9223372036854775808, %rax >>> >>> so this should then have been generated as 1<<63? >>> >>> At some point variable shifts were quite expensive as well.. >> >> Yes I don't see a major difference between movabsq and >> >> movl$1, %eax >> salq$63, %rax >> >> on my Sandy Bridge, but if the above is faster then that is what the x64 >> backend should emit - it's 1 byte smaller as well, so probably better in all >> cases. >> >> Wilco >
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Hi Jakub As per the discussions, I have a created a bug report for the possible regression this may cause. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82454 Sudi From: Wilco Dijkstra Sent: Tuesday, September 26, 2017 2:20 PM To: Sudi Das; Jakub Jelinek Cc: Richard Biener; GCC Patches; nd; Richard Earnshaw; James Greenhalgh Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) Jakub Jelinek wrote: > Well, we don't want to regress performance wise on one of the most important > primary targets. I don't care that much if the RTL/backend work is done > together with the patch, or as a follow-up during stage1/3, but it should be > done, the testcases I've posted can be used as a basis of a P1 runtime > performance regression. It should be sufficient to file a bug about inefficient 64-bit constant expansions on x64. I didn't see a significant difference in my benchmarking of it on x64, so I'd say it's only a performance regression if large benchmarks regress measurably (quite unlikely). Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Jakub Jelinek wrote: > Well, we don't want to regress performance wise on one of the most important > primary targets. I don't care that much if the RTL/backend work is done > together with the patch, or as a follow-up during stage1/3, but it should be > done, the testcases I've posted can be used as a basis of a P1 runtime > performance regression. It should be sufficient to file a bug about inefficient 64-bit constant expansions on x64. I didn't see a significant difference in my benchmarking of it on x64, so I'd say it's only a performance regression if large benchmarks regress measurably (quite unlikely). Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Tue, Sep 26, 2017 at 12:44:10PM +, Sudi Das wrote: > > Still waiting on Jakub's comment on whether there are more things needed > at the backend. But I have updated the patch according to Richard's > comments. Well, we don't want to regress performance wise on one of the most important primary targets. I don't care that much if the RTL/backend work is done together with the patch, or as a follow-up during stage1/3, but it should be done, the testcases I've posted can be used as a basis of a P1 runtime performance regression. Jakub
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Still waiting on Jakub's comment on whether there are more things needed at the backend. But I have updated the patch according to Richard's comments. Thanks Sudi From: Richard Biener <richard.guent...@gmail.com> Sent: Friday, August 4, 2017 11:16 AM To: Sudi Das Cc: Wilco Dijkstra; Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James Greenhalgh Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <sudi@arm.com> wrote: > > > > > Sorry about the delayed response but looking at the above discussion, should > I conclude that this is a valid tree simplification? Yes, I think so. Jakub requested code to undo this at RTL expansion based on target costs, not sure if we really should require that from you given the user could have written the target sequence himself. Few comments about the patch: +/* Fold (1 << (C - x)) where C = precision(type) - 1 + into ((1 << C) >> x). */ +(simplify + (lshift integer_onep@0 (minus INTEGER_CST@1 @2)) I think this warrants a single_use check on the minus (note :s isn't enough as with the unsigned case we'd happily ignore it by design). + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT + && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1)) You can relax this with using && wi::eq_p (@1, TYPE_PRECISION (type) - 1) + (if (TYPE_UNSIGNED(type)) + (rshift (lshift @0 @1) @2) + (with + { tree utype = unsigned_type_for (type); } + (convert:type (rshift (lshift (convert:utype @0) @1) @2)) + You can write (convert (rshift ...)), without the :type. I'm leaving it to Jakub whether you need to write that RTL expansion tweak. Thanks, Richard. > I am pasting the diff of the assembly that AArch64 generates with the test > case that I added. I see fewer instructions generated with the patch. > > --- pr80131-1.s 2017-08-01 10:02:43.243374174 +0100 > +++ pr80131-1.s-patched 2017-08-01 10:00:54.776455630 +0100 > @@ -24,10 +24,8 @@ > str x0, [sp, 8] > ldr x0, [sp, 8] > mov w1, w0 > - mov w0, 63 > - sub w0, w0, w1 > - mov x1, 1 > - lsl x0, x1, x0 > + mov x0, -9223372036854775808 > + lsr x0, x0, x1 > add sp, sp, 16 > ret > .size f2, .-f2 > @@ -39,10 +37,8 @@ > str x0, [sp, 8] > ldr x0, [sp, 8] > mov w1, w0 > - mov w0, 63 > - sub w0, w0, w1 > - mov x1, 1 > - lsl x0, x1, x0 > + mov x0, -9223372036854775808 > + lsr x0, x0, x1 > add sp, sp, 16 > ret > .size f3, .-f3 > @@ -52,11 +48,9 @@ > f4: > sub sp, sp, #16 > str w0, [sp, 12] > - mov w1, 31 > ldr w0, [sp, 12] > - sub w0, w1, w0 > - mov w1, 1 > - lsl w0, w1, w0 > + mov w1, -2147483648 > + lsr w0, w1, w0 > add sp, sp, 16 > ret > .size f4, .-f4 > > > Thanks > > Sudi > > > > > From: Wilco Dijkstra > Sent: Thursday, April 13, 2017 1:01 PM > To: Richard Biener; Jakub Jelinek > Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh > Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) > > Richard Biener wrote: >> It is IMHO a valid GIMPLE optimization / canonicalization. >> >> movabsq $-9223372036854775808, %rax >> >> so this should then have been generated as 1<<63? >> >> At some point variable shifts were quite expensive as well.. > > Yes I don't see a major difference between movabsq and > > movl $1, %eax > salq $63, %rax > > on my Sandy Bridge, but if the above is faster then that is what the x64 > backend should emit - it's 1 byte smaller as well, so probably better in all > cases. > > Wilco diff --git a/gcc/match.pd b/gcc/match.pd index e9017e4..160c12d 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -600,6 +600,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && tree_nop_conversion_p (type, TREE_TYPE (@1))) (lshift @0 @2))) +/* Fold (1 << (C - x)) where C = precision(type) - 1 + into ((1 << C) >> x). */ +(simplify + (lshift integer_onep@0 (minus@1 INTEGER_CST@2 @3)) + (if (INTEGRAL_TYPE_P (type) + && wi::eq_p (@2, TYPE_PRECISION (type) - 1) + && single_use (@1)) + (if (TYPE_UNSIGNED(type)) + (rshift (lshift @0 @2) @3) + (with +{ tree utype = unsigned_type_for (type); } +(convert (rshift (lshift (convert:utype @0) @2) @3)) + /* Fold (C1/X)*C2 into (C1*C2)/X. */ (simplify (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2) diff --git a/gcc/testsuite/gcc.dg/pr80131-1
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <sudi@arm.com> wrote: > > > > > Sorry about the delayed response but looking at the above discussion, should > I conclude that this is a valid tree simplification? Yes, I think so. Jakub requested code to undo this at RTL expansion based on target costs, not sure if we really should require that from you given the user could have written the target sequence himself. Few comments about the patch: +/* Fold (1 << (C - x)) where C = precision(type) - 1 + into ((1 << C) >> x). */ +(simplify + (lshift integer_onep@0 (minus INTEGER_CST@1 @2)) I think this warrants a single_use check on the minus (note :s isn't enough as with the unsigned case we'd happily ignore it by design). + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT + && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1)) You can relax this with using && wi::eq_p (@1, TYPE_PRECISION (type) - 1) + (if (TYPE_UNSIGNED(type)) + (rshift (lshift @0 @1) @2) + (with +{ tree utype = unsigned_type_for (type); } +(convert:type (rshift (lshift (convert:utype @0) @1) @2)) + You can write (convert (rshift ...)), without the :type. I'm leaving it to Jakub whether you need to write that RTL expansion tweak. Thanks, Richard. > I am pasting the diff of the assembly that AArch64 generates with the test > case that I added. I see fewer instructions generated with the patch. > > --- pr80131-1.s2017-08-01 10:02:43.243374174 +0100 > +++ pr80131-1.s-patched2017-08-01 10:00:54.776455630 +0100 > @@ -24,10 +24,8 @@ > strx0, [sp, 8] > ldrx0, [sp, 8] > movw1, w0 > -movw0, 63 > -subw0, w0, w1 > -movx1, 1 > -lslx0, x1, x0 > +movx0, -9223372036854775808 > +lsrx0, x0, x1 > addsp, sp, 16 > ret > .sizef2, .-f2 > @@ -39,10 +37,8 @@ > strx0, [sp, 8] > ldrx0, [sp, 8] > movw1, w0 > -movw0, 63 > -subw0, w0, w1 > -movx1, 1 > -lslx0, x1, x0 > +movx0, -9223372036854775808 > +lsrx0, x0, x1 > addsp, sp, 16 > ret > .sizef3, .-f3 > @@ -52,11 +48,9 @@ > f4: > subsp, sp, #16 > strw0, [sp, 12] > -movw1, 31 > ldrw0, [sp, 12] > -subw0, w1, w0 > -movw1, 1 > -lslw0, w1, w0 > +movw1, -2147483648 > +lsrw0, w1, w0 > addsp, sp, 16 > ret > .sizef4, .-f4 > > > Thanks > > Sudi > > > > > From: Wilco Dijkstra > Sent: Thursday, April 13, 2017 1:01 PM > To: Richard Biener; Jakub Jelinek > Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh > Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) > > Richard Biener wrote: >> It is IMHO a valid GIMPLE optimization / canonicalization. >> >>movabsq $-9223372036854775808, %rax >> >> so this should then have been generated as 1<<63? >> >> At some point variable shifts were quite expensive as well.. > > Yes I don't see a major difference between movabsq and > > movl$1, %eax > salq$63, %rax > > on my Sandy Bridge, but if the above is faster then that is what the x64 > backend should emit - it's 1 byte smaller as well, so probably better in all > cases. > > Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Sorry about the delayed response but looking at the above discussion, should I conclude that this is a valid tree simplification? I am pasting the diff of the assembly that AArch64 generates with the test case that I added. I see fewer instructions generated with the patch. --- pr80131-1.s 2017-08-01 10:02:43.243374174 +0100 +++ pr80131-1.s-patched 2017-08-01 10:00:54.776455630 +0100 @@ -24,10 +24,8 @@ str x0, [sp, 8] ldr x0, [sp, 8] mov w1, w0 - mov w0, 63 - sub w0, w0, w1 - mov x1, 1 - lsl x0, x1, x0 + mov x0, -9223372036854775808 + lsr x0, x0, x1 add sp, sp, 16 ret .size f2, .-f2 @@ -39,10 +37,8 @@ str x0, [sp, 8] ldr x0, [sp, 8] mov w1, w0 - mov w0, 63 - sub w0, w0, w1 - mov x1, 1 - lsl x0, x1, x0 + mov x0, -9223372036854775808 + lsr x0, x0, x1 add sp, sp, 16 ret .size f3, .-f3 @@ -52,11 +48,9 @@ f4: sub sp, sp, #16 str w0, [sp, 12] - mov w1, 31 ldr w0, [sp, 12] - sub w0, w1, w0 - mov w1, 1 - lsl w0, w1, w0 + mov w1, -2147483648 + lsr w0, w1, w0 add sp, sp, 16 ret .size f4, .-f4 Thanks Sudi From: Wilco Dijkstra Sent: Thursday, April 13, 2017 1:01 PM To: Richard Biener; Jakub Jelinek Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x) Richard Biener wrote: > It is IMHO a valid GIMPLE optimization / canonicalization. > > movabsq $-9223372036854775808, %rax > > so this should then have been generated as 1<<63? > > At some point variable shifts were quite expensive as well.. Yes I don't see a major difference between movabsq and movl $1, %eax salq $63, %rax on my Sandy Bridge, but if the above is faster then that is what the x64 backend should emit - it's 1 byte smaller as well, so probably better in all cases. Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Richard Biener wrote: > It is IMHO a valid GIMPLE optimization / canonicalization. > > movabsq $-9223372036854775808, %rax > > so this should then have been generated as 1<<63? > > At some point variable shifts were quite expensive as well.. Yes I don't see a major difference between movabsq and movl$1, %eax salq$63, %rax on my Sandy Bridge, but if the above is faster then that is what the x64 backend should emit - it's 1 byte smaller as well, so probably better in all cases. Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Thu, Apr 13, 2017 at 01:49:01PM +0200, Richard Biener wrote: > It is IMHO a valid GIMPLE optimization / canonicalization. As I said, we can do it as GIMPLE canonicalization, but we should have code to undo it if beneficial at RTL level. And the patch has not included that. > > movabsq $-9223372036854775808, %rax > > so this should then have been generated as 1<<63? Maybe. But it seems to be still more expensive than the original code, though better than movabsq: __attribute__((noinline, noclone)) unsigned long long int foo (int x) { asm volatile ("" : : : "memory"); return 1ULL << (63 - x); } __attribute__((noinline, noclone)) unsigned long long int bar (int x) { asm volatile ("" : : : "memory"); return (1ULL << 63) >> x; } __attribute__((noinline, noclone)) unsigned long long int baz (int x) { unsigned long long int y = 1; asm volatile ("" : "+r" (y) : : "memory"); return (y << 63) >> x; } int main (int argc, const char **argv) { int i; if (argc == 1) for (i = 0; i < 10; i++) asm volatile ("" : : "r" (foo (13))); else if (argc == 2) for (i = 0; i < 10; i++) asm volatile ("" : : "r" (bar (13))); else if (argc == 3) for (i = 0; i < 10; i++) asm volatile ("" : : "r" (baz (13))); return 0; } Jakub
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Thu, Apr 13, 2017 at 1:41 PM, Jakub Jelinekwrote: > On Thu, Apr 13, 2017 at 11:33:12AM +, Wilco Dijkstra wrote: >> Jakub Jelinek wrote: >> >> > No. Some constants sometimes even 7 instructions (e.g. sparc64; not >> > talking >> > in particular about 1ULL << 63 constant), or have one instruction >> > that is more expensive than normal small constant load. Compare say x86_64 >> > movl/movq vs. movabsq, I think the latter has 3 times longer latency on >> > many >> > CPUs. So no, I think it isn't an unconditional win. >> >> We're specifically only talking about the constants (1L << 63), (1 << 31) >> and (1 << 15). >> On all targets these need at most 2 simple instructions. That makes it an >> unconditional win. > > It is not a win on at least Haswell-E: > __attribute__((noinline, noclone)) unsigned long long int > foo (int x) > { > asm volatile ("" : : : "memory"); > return 1ULL << (63 - x); > } > > __attribute__((noinline, noclone)) unsigned long long int > bar (int x) > { > asm volatile ("" : : : "memory"); > return (1ULL << 63) >> x; > } > > int > main (int argc, const char **argv) > { > int i; > if (argc == 1) > for (i = 0; i < 10; i++) > asm volatile ("" : : "r" (foo (13))); > else > for (i = 0; i < 10; i++) > asm volatile ("" : : "r" (bar (13))); > return 0; > } > > $ time /tmp/test > > real0m1.290s > user0m1.288s > sys 0m0.002s > $ time /tmp/test 1 > > real0m1.542s > user0m1.540s > sys 0m0.002s > > As I said, movabsq is expensive. It is IMHO a valid GIMPLE optimization / canonicalization. movabsq $-9223372036854775808, %rax so this should then have been generated as 1<<63? At some point variable shifts were quite expensive as well.. Richard. > Jakub
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Thu, Apr 13, 2017 at 11:33:12AM +, Wilco Dijkstra wrote: > Jakub Jelinek wrote: > > > No. Some constants sometimes even 7 instructions (e.g. sparc64; not talking > > in particular about 1ULL << 63 constant), or have one instruction > > that is more expensive than normal small constant load. Compare say x86_64 > > movl/movq vs. movabsq, I think the latter has 3 times longer latency on many > > CPUs. So no, I think it isn't an unconditional win. > > We're specifically only talking about the constants (1L << 63), (1 << 31) and > (1 << 15). > On all targets these need at most 2 simple instructions. That makes it an > unconditional win. It is not a win on at least Haswell-E: __attribute__((noinline, noclone)) unsigned long long int foo (int x) { asm volatile ("" : : : "memory"); return 1ULL << (63 - x); } __attribute__((noinline, noclone)) unsigned long long int bar (int x) { asm volatile ("" : : : "memory"); return (1ULL << 63) >> x; } int main (int argc, const char **argv) { int i; if (argc == 1) for (i = 0; i < 10; i++) asm volatile ("" : : "r" (foo (13))); else for (i = 0; i < 10; i++) asm volatile ("" : : "r" (bar (13))); return 0; } $ time /tmp/test real0m1.290s user0m1.288s sys 0m0.002s $ time /tmp/test 1 real0m1.542s user0m1.540s sys 0m0.002s As I said, movabsq is expensive. Jakub
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Jakub Jelinek wrote: > No. Some constants sometimes even 7 instructions (e.g. sparc64; not talking > in particular about 1ULL << 63 constant), or have one instruction > that is more expensive than normal small constant load. Compare say x86_64 > movl/movq vs. movabsq, I think the latter has 3 times longer latency on many > CPUs. So no, I think it isn't an unconditional win. We're specifically only talking about the constants (1L << 63), (1 << 31) and (1 << 15). On all targets these need at most 2 simple instructions. That makes it an unconditional win. Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Thu, Apr 13, 2017 at 11:16:08AM +, Wilco Dijkstra wrote: > >On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote: > > > Hi all > > > > > > This is a fix for PR 80131 > > > Currently the code A << (B - C) is not simplified. > >> However at least a more specific case of 1U << (C -x) where C = > >> precision(type) - 1 can be simplified to (1 << C) >> x. > > > > Is that always a win though? > > Yes assuming left and right shift have the same performance. > > > Some constants have higher costs than others on various targets, some > > significantly higher. This change might be beneficial only > > if if C is as expensive as 1, then you get rid of a one (typically cheap) > > operation. > > Most targets can create the constant cheaply. Targets that can't would need 2 No. Some constants sometimes even 7 instructions (e.g. sparc64; not talking in particular about 1ULL << 63 constant), or have one instruction that is more expensive than normal small constant load. Compare say x86_64 movl/movq vs. movabsq, I think the latter has 3 times longer latency on many CPUs. So no, I think it isn't an unconditional win. Jakub
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote: > > Hi all > > > > This is a fix for PR 80131 > > Currently the code A << (B - C) is not simplified. >> However at least a more specific case of 1U << (C -x) where C = >> precision(type) - 1 can be simplified to (1 << C) >> x. > > Is that always a win though? Yes assuming left and right shift have the same performance. > Some constants have higher costs than others on various targets, some > significantly higher. This change might be beneficial only > if if C is as expensive as 1, then you get rid of a one (typically cheap) > operation. Most targets can create the constant cheaply. Targets that can't would need 2 instructions (move+shift) or a literal load. That's not worse compared to the original sequence (3 operations). The constant can be shared or lifted out of a loop, so we're saving 1 subtract per use of the sequence. > Which makes me wonder whether this should be done at GIMPLE time and not > at RTL time (expansion or combine etc.) when one can compare the RTX costs. > Or do this at match.pd as canonicalization and then have RTL transformation > to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or > shorter). I can't see why that would be useful for just this pattern. There are lots more useful cases where GCC should simplify immediates to fit the target but it doesn't currently. For example it changes x < 0x10 to x <= 0xf which is worse on many targets. Wilco
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Wed, Apr 12, 2017 at 08:59:34PM +0200, Jakub Jelinek wrote: > On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote: > > On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote: > > > On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote: > > > > This is a fix for PR 80131 > > > > Currently the code A << (B - C) is not simplified. > > > > However at least a more specific case of 1U << (C -x) where C = > > > > precision(type) - 1 can be simplified to (1 << C) >> x. > > > > > > Is that always a win though? > > > Some constants have higher costs than others on various targets, some > > > significantly higher. This change might be beneficial only > > > if if C is as expensive as 1, then you get rid of a one (typically cheap) > > > operation. > > > Which makes me wonder whether this should be done at GIMPLE time and not > > > at RTL time (expansion or combine etc.) when one can compare the RTX > > > costs. > > > > Yeah, either combine or simplify-rtx I'd say. > > > > The transformA << (B - C) ---> (A << B) >> C > > only works if A << B does not overflow but A << (B + 1) does (and then > > Is that really true? The A << B does not overflow is obvious precondition. > > But consider say A 5U, B 29 and C (not compile time known) -2. Ah yes, A unsigned. Bah. > 5U << 31 is valid 0x8000U, but (5U << 29) >> (-2) is UB. > Isn't the other condition instead that either C must be non-negative, or > B must be number of bits in A's type - 1, i.e. that for negative C > A << (B - C) would already be always UB? > But then unless we know C is non-negative, A must be really just 1, > otherwise A << B overflows. Yeah. Segher
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote: > Hi, > > On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote: > > On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote: > > > This is a fix for PR 80131 > > > Currently the code A << (B - C) is not simplified. > > > However at least a more specific case of 1U << (C -x) where C = > > > precision(type) - 1 can be simplified to (1 << C) >> x. > > > > Is that always a win though? > > Some constants have higher costs than others on various targets, some > > significantly higher. This change might be beneficial only > > if if C is as expensive as 1, then you get rid of a one (typically cheap) > > operation. > > Which makes me wonder whether this should be done at GIMPLE time and not > > at RTL time (expansion or combine etc.) when one can compare the RTX costs. > > Yeah, either combine or simplify-rtx I'd say. > > The transformA << (B - C) ---> (A << B) >> C > only works if A << B does not overflow but A << (B + 1) does (and then Is that really true? The A << B does not overflow is obvious precondition. But consider say A 5U, B 29 and C (not compile time known) -2. 5U << 31 is valid 0x8000U, but (5U << 29) >> (-2) is UB. Isn't the other condition instead that either C must be non-negative, or B must be number of bits in A's type - 1, i.e. that for negative C A << (B - C) would already be always UB? But then unless we know C is non-negative, A must be really just 1, otherwise A << B overflows. > always does work afaics). Or if we know C is non-negative and A << B > does not overflow. So realistically A and B need to be constant. > > > Or do this at match.pd as canonicalization and then have RTL transformation > > to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or > > shorter). > > The inverse transform only works for A=1, not for the more general case. Jakub
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
Hi, On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote: > On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote: > > This is a fix for PR 80131 > > Currently the code A << (B - C) is not simplified. > > However at least a more specific case of 1U << (C -x) where C = > > precision(type) - 1 can be simplified to (1 << C) >> x. > > Is that always a win though? > Some constants have higher costs than others on various targets, some > significantly higher. This change might be beneficial only > if if C is as expensive as 1, then you get rid of a one (typically cheap) > operation. > Which makes me wonder whether this should be done at GIMPLE time and not > at RTL time (expansion or combine etc.) when one can compare the RTX costs. Yeah, either combine or simplify-rtx I'd say. The transformA << (B - C) ---> (A << B) >> C only works if A << B does not overflow but A << (B + 1) does (and then always does work afaics). Or if we know C is non-negative and A << B does not overflow. So realistically A and B need to be constant. > Or do this at match.pd as canonicalization and then have RTL transformation > to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or > shorter). The inverse transform only works for A=1, not for the more general case. Segher
Re: [PATCH][GCC] Simplification of 1U << (31 - x)
On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote: > Hi all > > This is a fix for PR 80131 > Currently the code A << (B - C) is not simplified. > However at least a more specific case of 1U << (C -x) where C = > precision(type) - 1 can be simplified to (1 << C) >> x. Is that always a win though? Some constants have higher costs than others on various targets, some significantly higher. This change might be beneficial only if if C is as expensive as 1, then you get rid of a one (typically cheap) operation. Which makes me wonder whether this should be done at GIMPLE time and not at RTL time (expansion or combine etc.) when one can compare the RTX costs. Or do this at match.pd as canonicalization and then have RTL transformation to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or shorter). Jakub
[PATCH][GCC] Simplification of 1U << (31 - x)
Hi all This is a fix for PR 80131 Currently the code A << (B - C) is not simplified. However at least a more specific case of 1U << (C -x) where C = precision(type) - 1 can be simplified to (1 << C) >> x. This is done by adding a new simplification rule in match.pd So for a test case : unsigned int f1(unsigned int i) { return 1U << (31 - i); } We see a gimple dump of f1 (unsigned int i) { unsigned int D.3121; D.3121 = 2147483648 >> i; return D.3121; } instead of f1 (unsigned int i) { unsigned int D.3121; _1 = 31 - i; D.3121 = 1 << _1; return D.3121; } Add a new test case and checked for regressions on bootstrapped aarch64-none-linux-gnu. Ok for stage 1? Thanks Sudi 2017-03-23 Sudakshina DasPR middle-end/80131 * match.pd: Simplify 1 << (C - x) where C = precision (type) - 1. 2017-03-23 Sudakshina Das PR middle-end/80131 * testsuite/gcc.dg/pr80131-1.c: New Test.diff --git a/gcc/match.pd b/gcc/match.pd index 7b96800..be20fb7 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -508,6 +508,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) && tree_nop_conversion_p (type, TREE_TYPE (@1))) (lshift @0 @2))) +/* Fold (1 << (C - x)) where C = precision(type) - 1 + into ((1 << C) >> x). */ +(simplify + (lshift integer_onep@0 (minus INTEGER_CST@1 @2)) + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT + && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1)) + (if (TYPE_UNSIGNED(type)) + (rshift (lshift @0 @1) @2) + (with +{ tree utype = unsigned_type_for (type); } +(convert:type (rshift (lshift (convert:utype @0) @1) @2)) + /* Fold (C1/X)*C2 into (C1*C2)/X. */ (simplify (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2) diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c new file mode 100644 index 000..2bb6ff3 --- /dev/null +++ b/gcc/testsuite/gcc.dg/pr80131-1.c @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-options "-fdump-tree-gimple" } */ + +/* Checks the simplification of: + 1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1 + f1 is not simplified but f2, f3 and f4 are. */ + +__INT64_TYPE__ f1 (__INT64_TYPE__ i) +{ + return (__INT64_TYPE__)1 << (31 - i); +} + +__INT64_TYPE__ f2 (__INT64_TYPE__ i) +{ + return (__INT64_TYPE__)1 << (63 - i); +} + +__UINT64_TYPE__ f3 (__INT64_TYPE__ i) +{ + return (__UINT64_TYPE__)1 << (63 - i); +} + +__INT32_TYPE__ f4 (__INT32_TYPE__ i) +{ + return (__INT32_TYPE__)1 << (31 - i); +} + +/* { dg-final { scan-tree-dump-times "= 31 -" 1 "gimple" } } */ +/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */ +/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */