Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-11-07 Thread Wilco Dijkstra
Christophe Lyon wrote:

> This causes my builds (all arm and aarch64 targets) to fail:

Richard Biener already committed a fix in r254498 (thanks).
It seems constants in match.pd now need wi::to_wide.

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-11-07 Thread Christophe Lyon
Hi Wilco

On 7 November 2017 at 13:28, Wilco Dijkstra  wrote:
> Sudi Das wrote:
>
>> Thanks, I have made the changes to the patch.
>> Also can someone please apply it for me. I do not have commit access.
>>
>> 2017-10-10  Sudakshina Das  
>>
>>PR middle-end/80131
>>* match.pd: Simplify 1 << (C - x) where C = precision (x) - 1.
>>
>> 2017-10-10  Sudakshina Das  
>>
>>PR middle-end/80131
>>* testsuite/gcc.dg/pr80131-1.c: New Test.
>>
>>
>> With regards to the existing missed optimizations needed to the x86 RTL 
>> expansion,
>> I think the discussions can take place on the bug report that I created and 
>> maybe someone will pick it up.
>
> I've committed this as r254496.

This causes my builds (all arm and aarch64 targets) to fail:
g++ -fno-PIE -c   -g -O2 -DIN_GCC  -DCROSS_DIRECTORY_STRUCTURE
-fno-exceptions -fno-rtti -fasynchronous-unwind-tables -W -Wall
-Wwrite-strings -Wcast-qual -Wmissing-format-attribute
-Woverloaded-virtual -pedantic -Wno-long-long -Wno-variadic-macros
-Wno-overlength-strings -fno-common -Wno-unused -DHAVE_CONFIG_H -I.
-I. -I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/.
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../include
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libcpp/include
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-aarch64-none-linux-gnu/gcc1/./gmp
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gmp
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/builds/gcc-fsf-gccsrc/obj-aarch64-none-linux-gnu/gcc1/./mpfr/src
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/mpfr/src
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/mpc/src
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libdecnumber
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libdecnumber/dpd
-I../libdecnumber
-I/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/../libbacktrace
  -o gimple-match.o -MT gimple-match.o -MMD -MP -MF
./.deps/gimple-match.TPo gimple-match.c
In file included from
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/coretypes.h:397,
 from
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/gimple-match-head.c:22,
 from gimple-match.c:4:
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:
In function ‘bool wi::eq_p(const T1&, const T2&) [with T1 =
tree_node*, T2 = int]’:
gimple-match.c:49533:   instantiated from here
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
gimple-match.c:49533:   instantiated from here
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1764:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:
In function ‘unsigned int wi::get_binary_precision(const T1&, const
T2&) [with T1 = tree_node*, T2 = int]’:
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1763:
  instantiated from ‘bool wi::eq_p(const T1&, const T2&) [with T1 =
tree_node*, T2 = int]’
gimple-match.c:49533:   instantiated from here
/tmp/1717606_6.tmpdir/aci-gcc-fsf/sources/gcc-fsf/gccsrc/gcc/wide-int.h:1696:
error: incomplete type ‘wi::int_traits’ used in nested
name specifier
make[2]: *** [gimple-match.o] Error 1

Can you have a look?

>
> Wilco


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-11-07 Thread Wilco Dijkstra
Sudi Das wrote:

> Thanks, I have made the changes to the patch.
> Also can someone please apply it for me. I do not have commit access.
>
> 2017-10-10  Sudakshina Das  
>
>    PR middle-end/80131
>    * match.pd: Simplify 1 << (C - x) where C = precision (x) - 1.
>
> 2017-10-10  Sudakshina Das  
>
>    PR middle-end/80131
>    * testsuite/gcc.dg/pr80131-1.c: New Test.
>
>
> With regards to the existing missed optimizations needed to the x86 RTL 
> expansion, 
> I think the discussions can take place on the bug report that I created and 
> maybe someone will pick it up.

I've committed this as r254496.

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-10-10 Thread Sudi Das


Thanks, I have made the changes to the patch.
Also can someone please apply it for me. I do not have commit access.

2017-10-10  Sudakshina Das  <sudi@arm.com>

PR middle-end/80131
* match.pd: Simplify 1 << (C - x) where C = precision (x) - 1.

2017-10-10  Sudakshina Das  <sudi@arm.com>

PR middle-end/80131
* testsuite/gcc.dg/pr80131-1.c: New Test.


With regards to the existing missed optimizations needed to the x86 RTL 
expansion, I think the discussions can take place on the bug report that I 
created and maybe someone will pick it up.

Thanks
Sudi




From: Wilco Dijkstra
Sent: Monday, October 9, 2017 2:02 PM
To: Richard Biener; Sudi Das
Cc: Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
Richard Biener wrote:

> I think the patch is ok with these changes but obviously we should try
> to address
> the code-generation issue on x86 at RTL expansion time.  They are sort-of
> existing missing optimizations.

Note the only x64 specific issue is the backend expansion of 64-bit immediates
which could be improved like I suggested. However what we're really missing
is a generic optimization pass that tries to simplify immediates using accurate
target costs. In eg. x >= C or x > C we can use either C or C-1 - on many 
targets
one option may be a single instruction, while the other might take 2 or even 3.

Wilcodiff --git a/gcc/match.pd b/gcc/match.pd
index e58a65a..7a25a1b 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -600,6 +600,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
&& tree_nop_conversion_p (type, TREE_TYPE (@1)))
(lshift @0 @2)))
 
+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus@1 INTEGER_CST@2 @3))
+  (if (INTEGRAL_TYPE_P (type)
+   && wi::eq_p (@2, TYPE_PRECISION (type) - 1)
+   && single_use (@1))
+   (if (TYPE_UNSIGNED (type))
+ (rshift (lshift @0 @2) @3)
+   (with
+{ tree utype = unsigned_type_for (type); }
+(convert (rshift (lshift (convert:utype @0) @2) @3))
+
 /* Fold (C1/X)*C2 into (C1*C2)/X.  */
 (simplify
  (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2)
diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c
new file mode 100644
index 000..0bfe1f4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr80131-1.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target int32plus } */
+/* { dg-options "-fdump-tree-gimple" } */
+
+/* Checks the simplification of:
+   1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1
+   f1 is not simplified but f2, f3 and f4 are. */
+
+__INT64_TYPE__ f1 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (31 - i);
+}
+
+__INT64_TYPE__ f2 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (63 - i);
+}
+
+__UINT64_TYPE__ f3 (__INT64_TYPE__ i)
+{
+  return (__UINT64_TYPE__)1 << (63 - i);
+}
+
+__INT32_TYPE__ f4 (__INT32_TYPE__ i)
+{
+  return (__INT32_TYPE__)1 << (31 - i);
+}
+
+/* { dg-final { scan-tree-dump-times "= 31 -"  1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-10-09 Thread Wilco Dijkstra
Richard Biener wrote:

> I think the patch is ok with these changes but obviously we should try
> to address
> the code-generation issue on x86 at RTL expansion time.  They are sort-of
> existing missing optimizations.

Note the only x64 specific issue is the backend expansion of 64-bit immediates
which could be improved like I suggested. However what we're really missing
is a generic optimization pass that tries to simplify immediates using accurate
target costs. In eg. x >= C or x > C we can use either C or C-1 - on many 
targets
one option may be a single instruction, while the other might take 2 or even 3.

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-10-09 Thread Richard Biener
On Tue, Sep 26, 2017 at 2:44 PM, Sudi Das <sudi@arm.com> wrote:
>
> Still waiting on Jakub's comment on whether there are more things needed at 
> the backend. But I have updated the patch according to Richard's comments.

+   (if (TYPE_UNSIGNED(type))

space before '(type)'.

+ (rshift (lshift @0 @2) @3)
+   (with
+{ tree utype = unsigned_type_for (type); }
+(convert (rshift (lshift (convert:utype @0) @2) @3))

I think your testcase needs a

{ dg-require-effective-target int32plus }

as otherwise it'll fail on AVR.

I think the patch is ok with these changes but obviously we should try
to address
the code-generation issue on x86 at RTL expansion time.  They are sort-of
existing missing optimizations.

Richard.

> Thanks
> Sudi
>
>
>
> From: Richard Biener <richard.guent...@gmail.com>
> Sent: Friday, August 4, 2017 11:16 AM
> To: Sudi Das
> Cc: Wilco Dijkstra; Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James 
> Greenhalgh
> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>
> On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <sudi@arm.com> wrote:
>>
>>
>>
>>
>> Sorry about the delayed response but looking at the above discussion, should 
>> I conclude that this is a valid tree simplification?
>
> Yes, I think so.  Jakub requested code to undo this at RTL expansion
> based on target costs, not sure if we really should
> require that from you given the user could have written the target
> sequence himself.
>
> Few comments about the patch:
>
> +/* Fold (1 << (C - x)) where C = precision(type) - 1
> +   into ((1 << C) >> x). */
> +(simplify
> + (lshift integer_onep@0 (minus INTEGER_CST@1 @2))
>
> I think this warrants a single_use check on the minus (note :s isn't enough
> as with the unsigned case we'd happily ignore it by design).
>
> +  (if (INTEGRAL_TYPE_P (type)
> +   && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
> +   && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))
>
> You can relax this with using
>
>   && wi::eq_p (@1, TYPE_PRECISION (type) - 1)
>
> +   (if (TYPE_UNSIGNED(type))
> + (rshift (lshift @0 @1) @2)
> +   (with
> +{ tree utype = unsigned_type_for (type); }
> +(convert:type (rshift (lshift (convert:utype @0) @1) @2))
> +
>
> You can write (convert (rshift ...)), without the :type.
>
> I'm leaving it to Jakub whether you need to write that RTL expansion tweak.
>
> Thanks,
> Richard.
>
>> I am pasting the diff of the assembly that AArch64 generates with the test 
>> case that I added. I see fewer instructions generated with the patch.
>>
>> --- pr80131-1.s2017-08-01 10:02:43.243374174 +0100
>> +++ pr80131-1.s-patched2017-08-01 10:00:54.776455630 +0100
>> @@ -24,10 +24,8 @@
>>  strx0, [sp, 8]
>>  ldrx0, [sp, 8]
>>  movw1, w0
>> -movw0, 63
>> -subw0, w0, w1
>> -movx1, 1
>> -lslx0, x1, x0
>> +movx0, -9223372036854775808
>> +lsrx0, x0, x1
>>  addsp, sp, 16
>>  ret
>>  .sizef2, .-f2
>> @@ -39,10 +37,8 @@
>>  strx0, [sp, 8]
>>  ldrx0, [sp, 8]
>>  movw1, w0
>> -movw0, 63
>> -subw0, w0, w1
>> -movx1, 1
>> -lslx0, x1, x0
>> +movx0, -9223372036854775808
>> +lsrx0, x0, x1
>>  addsp, sp, 16
>>  ret
>>  .sizef3, .-f3
>> @@ -52,11 +48,9 @@
>>  f4:
>>  subsp, sp, #16
>>  strw0, [sp, 12]
>> -movw1, 31
>>  ldrw0, [sp, 12]
>> -subw0, w1, w0
>> -movw1, 1
>> -lslw0, w1, w0
>> +movw1, -2147483648
>> +lsrw0, w1, w0
>>  addsp, sp, 16
>>  ret
>>  .sizef4, .-f4
>>
>>
>> Thanks
>>
>> Sudi
>>
>>
>>
>>
>> From: Wilco Dijkstra
>> Sent: Thursday, April 13, 2017 1:01 PM
>> To: Richard Biener; Jakub Jelinek
>> Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
>> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>>
>> Richard Biener wrote:
>>> It is IMHO a valid GIMPLE optimization / canonicalization.
>>>
>>>movabsq $-9223372036854775808, %rax
>>>
>>> so this should then have been generated as 1<<63?
>>>
>>> At some point variable shifts were quite expensive as well..
>>
>> Yes I don't see a major difference between movabsq and
>>
>> movl$1, %eax
>> salq$63, %rax
>>
>> on my Sandy Bridge, but if the above is faster then that is what the x64
>> backend should emit - it's 1 byte smaller as well, so probably better in all
>> cases.
>>
>> Wilco
>


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-10-06 Thread Sudi Das

Hi Jakub

As per the discussions, I have a created a bug report for the possible 
regression this may cause.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82454

Sudi


From: Wilco Dijkstra
Sent: Tuesday, September 26, 2017 2:20 PM
To: Sudi Das; Jakub Jelinek
Cc: Richard Biener; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
Jakub Jelinek wrote:

> Well, we don't want to regress performance wise on one of the most important
> primary targets.  I don't care that much if the RTL/backend work is done
> together with the patch, or as a follow-up during stage1/3, but it should be
> done, the testcases I've posted can be used as a basis of a P1 runtime
> performance regression.

It should be sufficient to file a bug about inefficient 64-bit constant 
expansions on
x64. I didn't see a significant difference in my benchmarking of it on x64, so 
I'd say
it's only a performance regression if large benchmarks regress measurably (quite
unlikely).

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-09-26 Thread Wilco Dijkstra
Jakub Jelinek wrote:

> Well, we don't want to regress performance wise on one of the most important
> primary targets.  I don't care that much if the RTL/backend work is done
> together with the patch, or as a follow-up during stage1/3, but it should be
> done, the testcases I've posted can be used as a basis of a P1 runtime
> performance regression.

It should be sufficient to file a bug about inefficient 64-bit constant 
expansions on
x64. I didn't see a significant difference in my benchmarking of it on x64, so 
I'd say
it's only a performance regression if large benchmarks regress measurably (quite
unlikely).

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-09-26 Thread Jakub Jelinek
On Tue, Sep 26, 2017 at 12:44:10PM +, Sudi Das wrote:
> 
> Still waiting on Jakub's comment on whether there are more things needed
> at the backend.  But I have updated the patch according to Richard's
> comments.

Well, we don't want to regress performance wise on one of the most important
primary targets.  I don't care that much if the RTL/backend work is done
together with the patch, or as a follow-up during stage1/3, but it should be
done, the testcases I've posted can be used as a basis of a P1 runtime
performance regression.

Jakub


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-09-26 Thread Sudi Das

Still waiting on Jakub's comment on whether there are more things needed at the 
backend. But I have updated the patch according to Richard's comments.

Thanks
Sudi



From: Richard Biener <richard.guent...@gmail.com>
Sent: Friday, August 4, 2017 11:16 AM
To: Sudi Das
Cc: Wilco Dijkstra; Jakub Jelinek; GCC Patches; nd; Richard Earnshaw; James 
Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <sudi@arm.com> wrote:
>
>
>
>
> Sorry about the delayed response but looking at the above discussion, should 
> I conclude that this is a valid tree simplification?

Yes, I think so.  Jakub requested code to undo this at RTL expansion
based on target costs, not sure if we really should
require that from you given the user could have written the target
sequence himself.

Few comments about the patch:

+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus INTEGER_CST@1 @2))

I think this warrants a single_use check on the minus (note :s isn't enough
as with the unsigned case we'd happily ignore it by design).

+  (if (INTEGRAL_TYPE_P (type)
+   && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
+   && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))

You can relax this with using

  && wi::eq_p (@1, TYPE_PRECISION (type) - 1)

+   (if (TYPE_UNSIGNED(type))
+ (rshift (lshift @0 @1) @2)
+   (with
+    { tree utype = unsigned_type_for (type); }
+    (convert:type (rshift (lshift (convert:utype @0) @1) @2))
+

You can write (convert (rshift ...)), without the :type.

I'm leaving it to Jakub whether you need to write that RTL expansion tweak.

Thanks,
Richard.

> I am pasting the diff of the assembly that AArch64 generates with the test 
> case that I added. I see fewer instructions generated with the patch.
>
> --- pr80131-1.s    2017-08-01 10:02:43.243374174 +0100
> +++ pr80131-1.s-patched    2017-08-01 10:00:54.776455630 +0100
> @@ -24,10 +24,8 @@
>  str    x0, [sp, 8]
>  ldr    x0, [sp, 8]
>  mov    w1, w0
> -    mov    w0, 63
> -    sub    w0, w0, w1
> -    mov    x1, 1
> -    lsl    x0, x1, x0
> +    mov    x0, -9223372036854775808
> +    lsr    x0, x0, x1
>  add    sp, sp, 16
>  ret
>  .size    f2, .-f2
> @@ -39,10 +37,8 @@
>  str    x0, [sp, 8]
>  ldr    x0, [sp, 8]
>  mov    w1, w0
> -    mov    w0, 63
> -    sub    w0, w0, w1
> -    mov    x1, 1
> -    lsl    x0, x1, x0
> +    mov    x0, -9223372036854775808
> +    lsr    x0, x0, x1
>  add    sp, sp, 16
>  ret
>  .size    f3, .-f3
> @@ -52,11 +48,9 @@
>  f4:
>  sub    sp, sp, #16
>  str    w0, [sp, 12]
> -    mov    w1, 31
>  ldr    w0, [sp, 12]
> -    sub    w0, w1, w0
> -    mov    w1, 1
> -    lsl    w0, w1, w0
> +    mov    w1, -2147483648
> +    lsr    w0, w1, w0
>  add    sp, sp, 16
>  ret
>  .size    f4, .-f4
>
>
> Thanks
>
> Sudi
>
>
>
>
> From: Wilco Dijkstra
> Sent: Thursday, April 13, 2017 1:01 PM
> To: Richard Biener; Jakub Jelinek
> Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>
> Richard Biener wrote:
>> It is IMHO a valid GIMPLE optimization / canonicalization.
>>
>>    movabsq $-9223372036854775808, %rax
>>
>> so this should then have been generated as 1<<63?
>>
>> At some point variable shifts were quite expensive as well..
>
> Yes I don't see a major difference between movabsq and
>
> movl    $1, %eax
> salq    $63, %rax
>
> on my Sandy Bridge, but if the above is faster then that is what the x64
> backend should emit - it's 1 byte smaller as well, so probably better in all
> cases.
>
> Wilco
diff --git a/gcc/match.pd b/gcc/match.pd
index e9017e4..160c12d 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -600,6 +600,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
&& tree_nop_conversion_p (type, TREE_TYPE (@1)))
(lshift @0 @2)))
 
+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus@1 INTEGER_CST@2 @3))
+  (if (INTEGRAL_TYPE_P (type)
+   && wi::eq_p (@2, TYPE_PRECISION (type) - 1)
+   && single_use (@1))
+   (if (TYPE_UNSIGNED(type))
+ (rshift (lshift @0 @2) @3)
+   (with
+{ tree utype = unsigned_type_for (type); }
+(convert (rshift (lshift (convert:utype @0) @2) @3))
+
 /* Fold (C1/X)*C2 into (C1*C2)/X.  */
 (simplify
  (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2)
diff --git a/gcc/testsuite/gcc.dg/pr80131-1

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-08-04 Thread Richard Biener
On Tue, Aug 1, 2017 at 11:14 AM, Sudi Das <sudi@arm.com> wrote:
>
>
>
>
> Sorry about the delayed response but looking at the above discussion, should 
> I conclude that this is a valid tree simplification?

Yes, I think so.  Jakub requested code to undo this at RTL expansion
based on target costs, not sure if we really should
require that from you given the user could have written the target
sequence himself.

Few comments about the patch:

+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus INTEGER_CST@1 @2))

I think this warrants a single_use check on the minus (note :s isn't enough
as with the unsigned case we'd happily ignore it by design).

+  (if (INTEGRAL_TYPE_P (type)
+   && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
+   && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))

You can relax this with using

  && wi::eq_p (@1, TYPE_PRECISION (type) - 1)

+   (if (TYPE_UNSIGNED(type))
+ (rshift (lshift @0 @1) @2)
+   (with
+{ tree utype = unsigned_type_for (type); }
+(convert:type (rshift (lshift (convert:utype @0) @1) @2))
+

You can write (convert (rshift ...)), without the :type.

I'm leaving it to Jakub whether you need to write that RTL expansion tweak.

Thanks,
Richard.

> I am pasting the diff of the assembly that AArch64 generates with the test 
> case that I added. I see fewer instructions generated with the patch.
>
> --- pr80131-1.s2017-08-01 10:02:43.243374174 +0100
> +++ pr80131-1.s-patched2017-08-01 10:00:54.776455630 +0100
> @@ -24,10 +24,8 @@
>  strx0, [sp, 8]
>  ldrx0, [sp, 8]
>  movw1, w0
> -movw0, 63
> -subw0, w0, w1
> -movx1, 1
> -lslx0, x1, x0
> +movx0, -9223372036854775808
> +lsrx0, x0, x1
>  addsp, sp, 16
>  ret
>  .sizef2, .-f2
> @@ -39,10 +37,8 @@
>  strx0, [sp, 8]
>  ldrx0, [sp, 8]
>  movw1, w0
> -movw0, 63
> -subw0, w0, w1
> -movx1, 1
> -lslx0, x1, x0
> +movx0, -9223372036854775808
> +lsrx0, x0, x1
>  addsp, sp, 16
>  ret
>  .sizef3, .-f3
> @@ -52,11 +48,9 @@
>  f4:
>  subsp, sp, #16
>  strw0, [sp, 12]
> -movw1, 31
>  ldrw0, [sp, 12]
> -subw0, w1, w0
> -movw1, 1
> -lslw0, w1, w0
> +movw1, -2147483648
> +lsrw0, w1, w0
>  addsp, sp, 16
>  ret
>  .sizef4, .-f4
>
>
> Thanks
>
> Sudi
>
>
>
>
> From: Wilco Dijkstra
> Sent: Thursday, April 13, 2017 1:01 PM
> To: Richard Biener; Jakub Jelinek
> Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
> Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
>
> Richard Biener wrote:
>> It is IMHO a valid GIMPLE optimization / canonicalization.
>>
>>movabsq $-9223372036854775808, %rax
>>
>> so this should then have been generated as 1<<63?
>>
>> At some point variable shifts were quite expensive as well..
>
> Yes I don't see a major difference between movabsq and
>
> movl$1, %eax
> salq$63, %rax
>
> on my Sandy Bridge, but if the above is faster then that is what the x64
> backend should emit - it's 1 byte smaller as well, so probably better in all
> cases.
>
> Wilco


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-08-01 Thread Sudi Das


 

Sorry about the delayed response but looking at the above discussion, should I 
conclude that this is a valid tree simplification?
I am pasting the diff of the assembly that AArch64 generates with the test case 
that I added. I see fewer instructions generated with the patch.

--- pr80131-1.s    2017-08-01 10:02:43.243374174 +0100
+++ pr80131-1.s-patched    2017-08-01 10:00:54.776455630 +0100
@@ -24,10 +24,8 @@
     str    x0, [sp, 8]
     ldr    x0, [sp, 8]
     mov    w1, w0
-    mov    w0, 63
-    sub    w0, w0, w1
-    mov    x1, 1
-    lsl    x0, x1, x0
+    mov    x0, -9223372036854775808
+    lsr    x0, x0, x1
     add    sp, sp, 16
     ret
     .size    f2, .-f2
@@ -39,10 +37,8 @@
     str    x0, [sp, 8]
     ldr    x0, [sp, 8]
     mov    w1, w0
-    mov    w0, 63
-    sub    w0, w0, w1
-    mov    x1, 1
-    lsl    x0, x1, x0
+    mov    x0, -9223372036854775808
+    lsr    x0, x0, x1
     add    sp, sp, 16
     ret
     .size    f3, .-f3
@@ -52,11 +48,9 @@
 f4:
     sub    sp, sp, #16
     str    w0, [sp, 12]
-    mov    w1, 31
     ldr    w0, [sp, 12]
-    sub    w0, w1, w0
-    mov    w1, 1
-    lsl    w0, w1, w0
+    mov    w1, -2147483648
+    lsr    w0, w1, w0
     add    sp, sp, 16
     ret
     .size    f4, .-f4


Thanks

Sudi 




From: Wilco Dijkstra
Sent: Thursday, April 13, 2017 1:01 PM
To: Richard Biener; Jakub Jelinek
Cc: Sudi Das; GCC Patches; nd; Richard Earnshaw; James Greenhalgh
Subject: Re: [PATCH][GCC] Simplification of 1U << (31 - x)
    
Richard Biener wrote:
> It is IMHO a valid GIMPLE optimization / canonicalization.
>
>    movabsq $-9223372036854775808, %rax
>
> so this should then have been generated as 1<<63?
>
> At some point variable shifts were quite expensive as well..

Yes I don't see a major difference between movabsq and

    movl    $1, %eax
    salq    $63, %rax

on my Sandy Bridge, but if the above is faster then that is what the x64
backend should emit - it's 1 byte smaller as well, so probably better in all
cases.

Wilco 

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-13 Thread Wilco Dijkstra
Richard Biener wrote:
> It is IMHO a valid GIMPLE optimization / canonicalization.
>
>    movabsq $-9223372036854775808, %rax
>
> so this should then have been generated as 1<<63?
>
> At some point variable shifts were quite expensive as well..

Yes I don't see a major difference between movabsq and

movl$1, %eax
salq$63, %rax

on my Sandy Bridge, but if the above is faster then that is what the x64
backend should emit - it's 1 byte smaller as well, so probably better in all
cases.

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-13 Thread Jakub Jelinek
On Thu, Apr 13, 2017 at 01:49:01PM +0200, Richard Biener wrote:
> It is IMHO a valid GIMPLE optimization / canonicalization.

As I said, we can do it as GIMPLE canonicalization, but
we should have code to undo it if beneficial at RTL level.
And the patch has not included that.
> 
> movabsq $-9223372036854775808, %rax
> 
> so this should then have been generated as 1<<63?

Maybe.  But it seems to be still more expensive than the original code,
though better than movabsq:

__attribute__((noinline, noclone)) unsigned long long int
foo (int x)
{
  asm volatile ("" : : : "memory");
  return 1ULL << (63 - x);
}

__attribute__((noinline, noclone)) unsigned long long int
bar (int x)
{
  asm volatile ("" : : : "memory");
  return (1ULL << 63) >> x;
}

__attribute__((noinline, noclone)) unsigned long long int
baz (int x)
{
  unsigned long long int y = 1;
  asm volatile ("" : "+r" (y) : : "memory");
  return (y << 63) >> x;
}

int
main (int argc, const char **argv)
{
  int i;
  if (argc == 1)
for (i = 0; i < 10; i++)
  asm volatile ("" : : "r" (foo (13)));
  else if (argc == 2)
for (i = 0; i < 10; i++)
  asm volatile ("" : : "r" (bar (13)));
  else if (argc == 3)
for (i = 0; i < 10; i++)
  asm volatile ("" : : "r" (baz (13)));
  return 0;
}

Jakub


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-13 Thread Richard Biener
On Thu, Apr 13, 2017 at 1:41 PM, Jakub Jelinek  wrote:
> On Thu, Apr 13, 2017 at 11:33:12AM +, Wilco Dijkstra wrote:
>> Jakub Jelinek wrote:
>>
>> > No.  Some constants sometimes even 7 instructions (e.g. sparc64; not 
>> > talking
>> > in particular about 1ULL << 63 constant), or have one instruction
>> > that is more expensive than normal small constant load.  Compare say x86_64
>> > movl/movq vs. movabsq, I think the latter has 3 times longer latency on 
>> > many
>> > CPUs.  So no, I think it isn't an unconditional win.
>>
>> We're specifically only talking about the constants (1L << 63), (1 << 31) 
>> and (1 << 15).
>> On all targets these need at most 2 simple instructions. That makes it an 
>> unconditional win.
>
> It is not a win on at least Haswell-E:
> __attribute__((noinline, noclone)) unsigned long long int
> foo (int x)
> {
>   asm volatile ("" : : : "memory");
>   return 1ULL << (63 - x);
> }
>
> __attribute__((noinline, noclone)) unsigned long long int
> bar (int x)
> {
>   asm volatile ("" : : : "memory");
>   return (1ULL << 63) >> x;
> }
>
> int
> main (int argc, const char **argv)
> {
>   int i;
>   if (argc == 1)
> for (i = 0; i < 10; i++)
>   asm volatile ("" : : "r" (foo (13)));
>   else
> for (i = 0; i < 10; i++)
>   asm volatile ("" : : "r" (bar (13)));
>   return 0;
> }
>
> $ time /tmp/test
>
> real0m1.290s
> user0m1.288s
> sys 0m0.002s
> $ time /tmp/test 1
>
> real0m1.542s
> user0m1.540s
> sys 0m0.002s
>
> As I said, movabsq is expensive.

It is IMHO a valid GIMPLE optimization / canonicalization.

movabsq $-9223372036854775808, %rax

so this should then have been generated as 1<<63?

At some point variable shifts were quite expensive as well..

Richard.

> Jakub


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-13 Thread Jakub Jelinek
On Thu, Apr 13, 2017 at 11:33:12AM +, Wilco Dijkstra wrote:
> Jakub Jelinek wrote:
>     
> > No.  Some constants sometimes even 7 instructions (e.g. sparc64; not talking
> > in particular about 1ULL << 63 constant), or have one instruction
> > that is more expensive than normal small constant load.  Compare say x86_64
> > movl/movq vs. movabsq, I think the latter has 3 times longer latency on many
> > CPUs.  So no, I think it isn't an unconditional win.
> 
> We're specifically only talking about the constants (1L << 63), (1 << 31) and 
> (1 << 15).
> On all targets these need at most 2 simple instructions. That makes it an 
> unconditional win.

It is not a win on at least Haswell-E:
__attribute__((noinline, noclone)) unsigned long long int
foo (int x)
{
  asm volatile ("" : : : "memory");
  return 1ULL << (63 - x);
}

__attribute__((noinline, noclone)) unsigned long long int
bar (int x)
{
  asm volatile ("" : : : "memory");
  return (1ULL << 63) >> x;
}

int
main (int argc, const char **argv)
{
  int i;
  if (argc == 1)
for (i = 0; i < 10; i++)
  asm volatile ("" : : "r" (foo (13)));
  else
for (i = 0; i < 10; i++)
  asm volatile ("" : : "r" (bar (13)));
  return 0;
}

$ time /tmp/test

real0m1.290s
user0m1.288s
sys 0m0.002s
$ time /tmp/test 1

real0m1.542s
user0m1.540s
sys 0m0.002s

As I said, movabsq is expensive.

Jakub


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-13 Thread Wilco Dijkstra
Jakub Jelinek wrote:
    
> No.  Some constants sometimes even 7 instructions (e.g. sparc64; not talking
> in particular about 1ULL << 63 constant), or have one instruction
> that is more expensive than normal small constant load.  Compare say x86_64
> movl/movq vs. movabsq, I think the latter has 3 times longer latency on many
> CPUs.  So no, I think it isn't an unconditional win.

We're specifically only talking about the constants (1L << 63), (1 << 31) and 
(1 << 15).
On all targets these need at most 2 simple instructions. That makes it an 
unconditional win.

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-13 Thread Jakub Jelinek
On Thu, Apr 13, 2017 at 11:16:08AM +, Wilco Dijkstra wrote:
> >On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > > Hi all
> > > 
> > > This is a fix for PR 80131 
> > > Currently the code A << (B - C) is not simplified.
> >> However at least a more specific case of 1U << (C -x) where C = 
> >> precision(type) - 1 can be simplified to (1 << C) >> x.
> >
> > Is that always a win though?
> 
> Yes assuming left and right shift have the same performance.
> 
> > Some constants have higher costs than others on various targets, some
> > significantly higher.  This change might be beneficial only
> > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > operation.
> 
> Most targets can create the constant cheaply. Targets that can't would need 2

No.  Some constants sometimes even 7 instructions (e.g. sparc64; not talking
in particular about 1ULL << 63 constant), or have one instruction
that is more expensive than normal small constant load.  Compare say x86_64
movl/movq vs. movabsq, I think the latter has 3 times longer latency on many
CPUs.  So no, I think it isn't an unconditional win.

Jakub


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-13 Thread Wilco Dijkstra
>On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > Hi all
> > 
> > This is a fix for PR 80131 
> > Currently the code A << (B - C) is not simplified.
>> However at least a more specific case of 1U << (C -x) where C = 
>> precision(type) - 1 can be simplified to (1 << C) >> x.
>
> Is that always a win though?

Yes assuming left and right shift have the same performance.

> Some constants have higher costs than others on various targets, some
> significantly higher.  This change might be beneficial only
> if if C is as expensive as 1, then you get rid of a one (typically cheap)
> operation.

Most targets can create the constant cheaply. Targets that can't would need 2
instructions (move+shift) or a literal load. That's not worse compared to the
original sequence (3 operations). The constant can be shared or lifted out of
a loop, so we're saving 1 subtract per use of the sequence.

> Which makes me wonder whether this should be done at GIMPLE time and not
> at RTL time (expansion or combine etc.) when one can compare the RTX costs.
> Or do this at match.pd as canonicalization and then have RTL transformation
> to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> shorter).

I can't see why that would be useful for just this pattern. There are lots more 
useful
cases where GCC should simplify immediates to fit the target but it doesn't 
currently.
For example it changes x < 0x10 to x <= 0xf which is worse on many 
targets.

Wilco

Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Segher Boessenkool
On Wed, Apr 12, 2017 at 08:59:34PM +0200, Jakub Jelinek wrote:
> On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote:
> > On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> > > On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > > > This is a fix for PR 80131 
> > > > Currently the code A << (B - C) is not simplified.
> > > > However at least a more specific case of 1U << (C -x) where C = 
> > > > precision(type) - 1 can be simplified to (1 << C) >> x.
> > > 
> > > Is that always a win though?
> > > Some constants have higher costs than others on various targets, some
> > > significantly higher.  This change might be beneficial only
> > > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > > operation.
> > > Which makes me wonder whether this should be done at GIMPLE time and not
> > > at RTL time (expansion or combine etc.) when one can compare the RTX 
> > > costs.
> > 
> > Yeah, either combine or simplify-rtx I'd say.
> > 
> > The transformA << (B - C)  --->   (A << B) >> C
> > only works if A << B does not overflow but A << (B + 1) does (and then
> 
> Is that really true?  The A << B does not overflow is obvious precondition.
> 
> But consider say A 5U, B 29 and C (not compile time known) -2.

Ah yes, A unsigned.  Bah.

> 5U << 31 is valid 0x8000U, but (5U << 29) >> (-2) is UB.
> Isn't the other condition instead that either C must be non-negative, or
> B must be number of bits in A's type - 1, i.e. that for negative C
> A << (B - C) would already be always UB?
> But then unless we know C is non-negative, A must be really just 1,
> otherwise A << B overflows.

Yeah.


Segher


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Jakub Jelinek
On Wed, Apr 12, 2017 at 01:15:56PM -0500, Segher Boessenkool wrote:
> Hi,
> 
> On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> > On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > > This is a fix for PR 80131 
> > > Currently the code A << (B - C) is not simplified.
> > > However at least a more specific case of 1U << (C -x) where C = 
> > > precision(type) - 1 can be simplified to (1 << C) >> x.
> > 
> > Is that always a win though?
> > Some constants have higher costs than others on various targets, some
> > significantly higher.  This change might be beneficial only
> > if if C is as expensive as 1, then you get rid of a one (typically cheap)
> > operation.
> > Which makes me wonder whether this should be done at GIMPLE time and not
> > at RTL time (expansion or combine etc.) when one can compare the RTX costs.
> 
> Yeah, either combine or simplify-rtx I'd say.
> 
> The transformA << (B - C)  --->   (A << B) >> C
> only works if A << B does not overflow but A << (B + 1) does (and then

Is that really true?  The A << B does not overflow is obvious precondition.

But consider say A 5U, B 29 and C (not compile time known) -2.
5U << 31 is valid 0x8000U, but (5U << 29) >> (-2) is UB.
Isn't the other condition instead that either C must be non-negative, or
B must be number of bits in A's type - 1, i.e. that for negative C
A << (B - C) would already be always UB?
But then unless we know C is non-negative, A must be really just 1,
otherwise A << B overflows.

> always does work afaics).  Or if we know C is non-negative and A << B
> does not overflow.  So realistically A and B need to be constant.
> 
> > Or do this at match.pd as canonicalization and then have RTL transformation
> > to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> > shorter).
> 
> The inverse transform only works for A=1, not for the more general case.

Jakub


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Segher Boessenkool
Hi,

On Wed, Apr 12, 2017 at 07:06:38PM +0200, Jakub Jelinek wrote:
> On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> > This is a fix for PR 80131 
> > Currently the code A << (B - C) is not simplified.
> > However at least a more specific case of 1U << (C -x) where C = 
> > precision(type) - 1 can be simplified to (1 << C) >> x.
> 
> Is that always a win though?
> Some constants have higher costs than others on various targets, some
> significantly higher.  This change might be beneficial only
> if if C is as expensive as 1, then you get rid of a one (typically cheap)
> operation.
> Which makes me wonder whether this should be done at GIMPLE time and not
> at RTL time (expansion or combine etc.) when one can compare the RTX costs.

Yeah, either combine or simplify-rtx I'd say.

The transformA << (B - C)  --->   (A << B) >> C
only works if A << B does not overflow but A << (B + 1) does (and then
always does work afaics).  Or if we know C is non-negative and A << B
does not overflow.  So realistically A and B need to be constant.

> Or do this at match.pd as canonicalization and then have RTL transformation
> to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
> shorter).

The inverse transform only works for A=1, not for the more general case.


Segher


Re: [PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Jakub Jelinek
On Wed, Apr 12, 2017 at 09:29:55AM +, Sudi Das wrote:
> Hi all
> 
> This is a fix for PR 80131 
> Currently the code A << (B - C) is not simplified.
> However at least a more specific case of 1U << (C -x) where C = 
> precision(type) - 1 can be simplified to (1 << C) >> x.

Is that always a win though?
Some constants have higher costs than others on various targets, some
significantly higher.  This change might be beneficial only
if if C is as expensive as 1, then you get rid of a one (typically cheap)
operation.
Which makes me wonder whether this should be done at GIMPLE time and not
at RTL time (expansion or combine etc.) when one can compare the RTX costs.
Or do this at match.pd as canonicalization and then have RTL transformation
to rewrite such (1U << C) >> X as 1U << (C - X) if the latter is faster (or
shorter).

Jakub


[PATCH][GCC] Simplification of 1U << (31 - x)

2017-04-12 Thread Sudi Das
Hi all

This is a fix for PR 80131 
Currently the code A << (B - C) is not simplified.
However at least a more specific case of 1U << (C -x) where C = precision(type) 
- 1 can be simplified to (1 << C) >> x.

This is done by adding a new simplification rule in match.pd

So for a test case :

unsigned int f1(unsigned int i)
{
  return 1U << (31 - i);
}

We see a gimple dump of 

f1 (unsigned int i)
{
  unsigned int D.3121;

  D.3121 = 2147483648 >> i;
  return D.3121;
}

instead of 

f1 (unsigned int i)
{
  unsigned int D.3121;

  _1 = 31 - i;
  D.3121 = 1 << _1;
  return D.3121;
}


Add a new test case and checked for regressions on bootstrapped 
aarch64-none-linux-gnu.
Ok for stage 1?

Thanks
Sudi

2017-03-23  Sudakshina Das  

PR middle-end/80131
* match.pd: Simplify 1 << (C - x) where C = precision (type) - 1.

2017-03-23  Sudakshina Das  

PR middle-end/80131
* testsuite/gcc.dg/pr80131-1.c: New Test.diff --git a/gcc/match.pd b/gcc/match.pd
index 7b96800..be20fb7 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -508,6 +508,19 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
&& tree_nop_conversion_p (type, TREE_TYPE (@1)))
(lshift @0 @2)))
 
+/* Fold (1 << (C - x)) where C = precision(type) - 1
+   into ((1 << C) >> x). */
+(simplify
+ (lshift integer_onep@0 (minus INTEGER_CST@1 @2))
+  (if (INTEGRAL_TYPE_P (type)
+   && TYPE_PRECISION (type) <= HOST_BITS_PER_WIDE_INT
+   && tree_to_uhwi (@1) == (unsigned)(TYPE_PRECISION (type) - 1))
+   (if (TYPE_UNSIGNED(type))
+ (rshift (lshift @0 @1) @2)
+   (with
+{ tree utype = unsigned_type_for (type); }
+(convert:type (rshift (lshift (convert:utype @0) @1) @2))
+
 /* Fold (C1/X)*C2 into (C1*C2)/X.  */
 (simplify
  (mult (rdiv@3 REAL_CST@0 @1) REAL_CST@2)
diff --git a/gcc/testsuite/gcc.dg/pr80131-1.c b/gcc/testsuite/gcc.dg/pr80131-1.c
new file mode 100644
index 000..2bb6ff3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/pr80131-1.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-options "-fdump-tree-gimple" } */
+
+/* Checks the simplification of:
+   1 << (C - x) to (1 << C) >> x, where C = precision (type) - 1
+   f1 is not simplified but f2, f3 and f4 are. */
+
+__INT64_TYPE__ f1 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (31 - i);
+}
+
+__INT64_TYPE__ f2 (__INT64_TYPE__ i)
+{
+  return (__INT64_TYPE__)1 << (63 - i);
+}
+
+__UINT64_TYPE__ f3 (__INT64_TYPE__ i)
+{
+  return (__UINT64_TYPE__)1 << (63 - i);
+}
+
+__INT32_TYPE__ f4 (__INT32_TYPE__ i)
+{
+  return (__INT32_TYPE__)1 << (31 - i);
+}
+
+/* { dg-final { scan-tree-dump-times "= 31 -"  1 "gimple" } } */
+/* { dg-final { scan-tree-dump-times "9223372036854775808 >>" 2 "gimple" } } */
+/* { dg-final { scan-tree-dump "2147483648 >>" "gimple" } } */