RE: GCC nvptx-none Target Testing (was: New page "nvptx" in the GCC wiki to document --target=nvptx-none configurations)

2024-09-23 Thread Prathamesh Kulkarni via Gcc


> -Original Message-
> From: Thomas Schwinge 
> Sent: Monday, September 23, 2024 7:37 PM
> To: gcc@gcc.gnu.org; Prathamesh Kulkarni 
> Cc: Tom de Vries ; Roger Sayle
> 
> Subject: GCC nvptx-none Target Testing (was: New page "nvptx" in the GCC
> wiki to document --target=nvptx-none configurations)
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi!
> 
> On 2017-02-16T21:10:20+0100, I wrote:
> > I created a new page "nvptx" in the GCC wiki to document
> > --target=nvptx-none configurations: .
> 
> (I've revised that one -- it's been a few years... -- and in particular
> then) I've added more details re "GCC nvptx-none Target Testing":
> , which
> Prathamesh was asking for at the recent GNU Tools Cauldron 2024.
Hi Thomas,
This looks great, thanks a lot!

Best Regards,
Prathamesh
> 
> 
> Grüße
>  Thomas


RE: [RFC] Summary of libgomp failures for offloading to nvptx from AArch64

2024-08-12 Thread Prathamesh Kulkarni via Gcc


> -Original Message-
> From: Andrew Pinski 
> Sent: Monday, August 12, 2024 12:28 PM
> To: Prathamesh Kulkarni 
> Cc: Richard Biener ; gcc@gcc.gnu.org;
> tschwi...@baylibre.com
> Subject: Re: [RFC] Summary of libgomp failures for offloading to nvptx
> from AArch64
> 
> External email: Use caution opening links or attachments
> 
> 
> On Sun, Aug 11, 2024 at 11:36 PM Prathamesh Kulkarni via Gcc
>  wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Monday, July 29, 2024 7:18 PM
> > > To: Prathamesh Kulkarni 
> > > Cc: gcc@gcc.gnu.org
> > > Subject: Re: [RFC] Summary of libgomp failures for offloading to
> > > nvptx from AArch64
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Mon, Jul 29, 2024 at 1:35 PM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Richard Biener 
> > > > > Sent: Friday, July 26, 2024 6:51 PM
> > > > > To: Prathamesh Kulkarni 
> > > > > Cc: gcc@gcc.gnu.org
> > > > > Subject: Re: [RFC] Summary of libgomp failures for offloading
> to
> > > > > nvptx from AArch64
> > > > >
> > > > > External email: Use caution opening links or attachments
> > > > >
> > > > >
> > > > > On Thu, Jul 25, 2024 at 3:36 PM Prathamesh Kulkarni via Gcc
> > > > >  wrote:
> > > > > >
> > > > > > Hi,
> > > > > > I am working on enabling offloading to nvptx from AAarch64
> host.
> > > > > > As mentioned on wiki
> > > > > >
> (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27)
> > > > > > , I ran make check-target-libgomp on AAarch64 host (and no
> > > > > > GPU)
> > > with
> > > > > following results:
> > > > > >
> > > > > > === libgomp Summary ===
> > > > > >
> > > > > > # of expected passes14568
> > > > > > # of unexpected failures1023
> > > > > > # of expected failures  309
> > > > > > # of untested testcases 54
> > > > > > # of unresolved testcases   992
> > > > > > # of unsupported tests  644
> > > > > >
> > > > > > It seems majority of the tests fail due to the following 4
> > > issues:
> > > > > >
> > > > > > * Compiling a minimal test-case:
> > > > > >
> > > > > > int main()
> > > > > > {
> > > > > >   int x;
> > > > > >   #pragma omp target map (to: x)
> > > > > >   {
> > > > > > x = 0;
> > > > > >   }
> > > > > >   return x;
> > > > > > }
> > > > > >
> > > > > > Compiling with -fopenmp -foffload=nvptx-none results in
> > > following
> > > > > issues:
> > > > > >
> > > > > > (1) Differing values of NUM_POLY_INT_COEFFS between host and
> > > > > accelerator, which results in following ICE:
> > > > > >
> > > > > > 0x1a6e0a7 pp_quoted_string
> > > > > > ../../gcc/gcc/pretty-print.cc:2277
> > > > > >  0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier
> > > const*)
> > > > > > ../../gcc/gcc/pretty-print.cc:1634
> > > > > >  0x1a4a3f3
> > > diagnostic_context::report_diagnostic(diagnostic_info*)
> > > > > > ../../gcc/gcc/diagnostic.cc:1612
> > > > > >  0x1a4a727 diagnostic_impl
> > > > > > ../../gcc/gcc/diagnostic.cc:1775  0x1a4e20b
> > > > > > fatal_error(unsigned int, char const*, ...)
> > > > > > ../../gcc/gcc/diagnostic.cc:2218  0xb3088f
> > > > > > lto_input_mode_table(lto_file_decl_data*)
> > > > > >  ../../gcc/gcc/lto-streamer-in.cc:2121
> > > > > >  0x6f5cdf lto_file_finalize
> > > > > > ../../gcc/gcc/lto/lto-common.cc:2285
> > > > > >  0x6f5cdf lto_create_files_from_ids
> > > > > > ../../gcc/gcc/lto/lto-c

RE: [RFC] Summary of libgomp failures for offloading to nvptx from AArch64

2024-08-11 Thread Prathamesh Kulkarni via Gcc


> -Original Message-
> From: Richard Biener 
> Sent: Monday, July 29, 2024 7:18 PM
> To: Prathamesh Kulkarni 
> Cc: gcc@gcc.gnu.org
> Subject: Re: [RFC] Summary of libgomp failures for offloading to nvptx
> from AArch64
> 
> External email: Use caution opening links or attachments
> 
> 
> On Mon, Jul 29, 2024 at 1:35 PM Prathamesh Kulkarni
>  wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Richard Biener 
> > > Sent: Friday, July 26, 2024 6:51 PM
> > > To: Prathamesh Kulkarni 
> > > Cc: gcc@gcc.gnu.org
> > > Subject: Re: [RFC] Summary of libgomp failures for offloading to
> > > nvptx from AArch64
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Thu, Jul 25, 2024 at 3:36 PM Prathamesh Kulkarni via Gcc
> > >  wrote:
> > > >
> > > > Hi,
> > > > I am working on enabling offloading to nvptx from AAarch64 host.
> > > > As mentioned on wiki
> > > > (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27),
> > > > I ran make check-target-libgomp on AAarch64 host (and no GPU)
> with
> > > following results:
> > > >
> > > > === libgomp Summary ===
> > > >
> > > > # of expected passes14568
> > > > # of unexpected failures1023
> > > > # of expected failures  309
> > > > # of untested testcases 54
> > > > # of unresolved testcases   992
> > > > # of unsupported tests  644
> > > >
> > > > It seems majority of the tests fail due to the following 4
> issues:
> > > >
> > > > * Compiling a minimal test-case:
> > > >
> > > > int main()
> > > > {
> > > >   int x;
> > > >   #pragma omp target map (to: x)
> > > >   {
> > > > x = 0;
> > > >   }
> > > >   return x;
> > > > }
> > > >
> > > > Compiling with -fopenmp -foffload=nvptx-none results in
> following
> > > issues:
> > > >
> > > > (1) Differing values of NUM_POLY_INT_COEFFS between host and
> > > accelerator, which results in following ICE:
> > > >
> > > > 0x1a6e0a7 pp_quoted_string
> > > > ../../gcc/gcc/pretty-print.cc:2277
> > > >  0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier
> const*)
> > > > ../../gcc/gcc/pretty-print.cc:1634
> > > >  0x1a4a3f3
> diagnostic_context::report_diagnostic(diagnostic_info*)
> > > > ../../gcc/gcc/diagnostic.cc:1612
> > > >  0x1a4a727 diagnostic_impl
> > > > ../../gcc/gcc/diagnostic.cc:1775  0x1a4e20b
> > > > fatal_error(unsigned int, char const*, ...)
> > > > ../../gcc/gcc/diagnostic.cc:2218  0xb3088f
> > > > lto_input_mode_table(lto_file_decl_data*)
> > > >  ../../gcc/gcc/lto-streamer-in.cc:2121
> > > >  0x6f5cdf lto_file_finalize
> > > > ../../gcc/gcc/lto/lto-common.cc:2285
> > > >  0x6f5cdf lto_create_files_from_ids
> > > > ../../gcc/gcc/lto/lto-common.cc:2309
> > > >  0x6f5cdf lto_file_read
> > > > ../../gcc/gcc/lto/lto-common.cc:2364
> > > >  0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**)
> > > > ../../gcc/gcc/lto/lto-common.cc:2812
> > > >  0x6cfb93 lto_main()
> > > > ../../gcc/gcc/lto/lto.cc:658
> > > >
> > > > This is already tracked in https://gcc.gnu.org/PR96265 (and
> > > > related
> > > > PR's)
> > > >
> > > > Streaming out mode_table:
> > > > mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass =
> 2,
> > > size
> > > > = 8, prec = 64
> > > >
> > > > Streaming in mode_table (in lto_input_mode_table):
> > > > mclass = 2, size = 4, prec = 0
> > > > (and then calculates the correct mode value by iterating over
> all
> > > > modes of mclass starting from narrowest mode)
> > > >
> > > > The issue is that the value for prec is not getting streamed-in
> > > > correctly for SImode as seen above. While streaming out from
> > > > AArch64
> > > host, it is 32, but while streaming in for nvptx, it is 0. This
> > > happens because of differing values of NUM_POLY_INT_COEFFS b

RE: [RFC] Summary of libgomp failures for offloading to nvptx from AArch64

2024-07-29 Thread Prathamesh Kulkarni via Gcc


> -Original Message-
> From: Richard Biener 
> Sent: Friday, July 26, 2024 6:51 PM
> To: Prathamesh Kulkarni 
> Cc: gcc@gcc.gnu.org
> Subject: Re: [RFC] Summary of libgomp failures for offloading to nvptx
> from AArch64
> 
> External email: Use caution opening links or attachments
> 
> 
> On Thu, Jul 25, 2024 at 3:36 PM Prathamesh Kulkarni via Gcc
>  wrote:
> >
> > Hi,
> > I am working on enabling offloading to nvptx from AAarch64 host. As
> > mentioned on wiki
> > (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27),
> > I ran make check-target-libgomp on AAarch64 host (and no GPU) with
> following results:
> >
> > === libgomp Summary ===
> >
> > # of expected passes14568
> > # of unexpected failures1023
> > # of expected failures  309
> > # of untested testcases 54
> > # of unresolved testcases   992
> > # of unsupported tests  644
> >
> > It seems majority of the tests fail due to the following 4 issues:
> >
> > * Compiling a minimal test-case:
> >
> > int main()
> > {
> >   int x;
> >   #pragma omp target map (to: x)
> >   {
> > x = 0;
> >   }
> >   return x;
> > }
> >
> > Compiling with -fopenmp -foffload=nvptx-none results in following
> issues:
> >
> > (1) Differing values of NUM_POLY_INT_COEFFS between host and
> accelerator, which results in following ICE:
> >
> > 0x1a6e0a7 pp_quoted_string
> > ../../gcc/gcc/pretty-print.cc:2277
> >  0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*)
> > ../../gcc/gcc/pretty-print.cc:1634
> >  0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*)
> > ../../gcc/gcc/diagnostic.cc:1612
> >  0x1a4a727 diagnostic_impl
> > ../../gcc/gcc/diagnostic.cc:1775  0x1a4e20b
> > fatal_error(unsigned int, char const*, ...)
> > ../../gcc/gcc/diagnostic.cc:2218  0xb3088f
> > lto_input_mode_table(lto_file_decl_data*)
> >  ../../gcc/gcc/lto-streamer-in.cc:2121
> >  0x6f5cdf lto_file_finalize
> > ../../gcc/gcc/lto/lto-common.cc:2285
> >  0x6f5cdf lto_create_files_from_ids
> > ../../gcc/gcc/lto/lto-common.cc:2309
> >  0x6f5cdf lto_file_read
> > ../../gcc/gcc/lto/lto-common.cc:2364
> >  0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**)
> > ../../gcc/gcc/lto/lto-common.cc:2812
> >  0x6cfb93 lto_main()
> > ../../gcc/gcc/lto/lto.cc:658
> >
> > This is already tracked in https://gcc.gnu.org/PR96265 (and related
> > PR's)
> >
> > Streaming out mode_table:
> > mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2,
> size
> > = 8, prec = 64
> >
> > Streaming in mode_table (in lto_input_mode_table):
> > mclass = 2, size = 4, prec = 0
> > (and then calculates the correct mode value by iterating over all
> > modes of mclass starting from narrowest mode)
> >
> > The issue is that the value for prec is not getting streamed-in
> > correctly for SImode as seen above. While streaming out from AArch64
> host, it is 32, but while streaming in for nvptx, it is 0. This
> happens because of differing values of NUM_POLY_INT_COEFFS between
> AArch64 and nvptx backend.
> >
> > Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values
> > for mode, precision would be <4, 0> and <32, 0> respectively
> > (streamed-out in bp_pack_poly_value). Both zeros come from coeffs[1]
> > of size and prec. While streaming in however, NUM_POLY_INT_COEFFS is
> 1 for nvptx, and thus it incorrectly treats <4, 0> as size and
> precision respectively, which is why precision gets streamed in as 0,
> and thus it encounters the above ICE.
> >
> > Supporting non VLA code with offloading:
> >
> > In the general case, it's hard to support offloading for arbitrary
> poly_ints when NUM_POLY_INT_COEFFS differs for host and accelerator.
> > For example, it's not possible to represent a degree-2 poly_int like
> 4 + 4x (as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1.
> >
> > However, IIUC, we can support offloading for restricted set of
> > poly_ints whose degree <= accel's NUM_POLY_INT_COEFFS, since they
> can
> > be represented on accelerator ? For a hypothetical example, if host
> NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I
> suppose we could represent a degree 2 poly_int on accelerator, but not
> a degree 3 poly_int like 3+4x+5x^2 ?
> >
>

[RFC] Summary of libgomp failures for offloading to nvptx from AArch64

2024-07-25 Thread Prathamesh Kulkarni via Gcc
Hi,
I am working on enabling offloading to nvptx from AAarch64 host. As mentioned 
on wiki (https://gcc.gnu.org/wiki/Offloading#Running_.27make_check.27),
I ran make check-target-libgomp on AAarch64 host (and no GPU) with following 
results:

=== libgomp Summary ===

# of expected passes14568
# of unexpected failures1023
# of expected failures  309
# of untested testcases 54
# of unresolved testcases   992
# of unsupported tests  644

It seems majority of the tests fail due to the following 4 issues:

* Compiling a minimal test-case:

int main()
{
  int x;
  #pragma omp target map (to: x)
  {
x = 0;
  }
  return x;
}

Compiling with -fopenmp -foffload=nvptx-none results in following issues:

(1) Differing values of NUM_POLY_INT_COEFFS between host and accelerator, which 
results in following ICE:

0x1a6e0a7 pp_quoted_string
../../gcc/gcc/pretty-print.cc:2277
 0x1a6ffb3 pp_format(pretty_printer*, text_info*, urlifier const*)
../../gcc/gcc/pretty-print.cc:1634
 0x1a4a3f3 diagnostic_context::report_diagnostic(diagnostic_info*)
../../gcc/gcc/diagnostic.cc:1612
 0x1a4a727 diagnostic_impl
../../gcc/gcc/diagnostic.cc:1775
 0x1a4e20b fatal_error(unsigned int, char const*, ...)
../../gcc/gcc/diagnostic.cc:2218
 0xb3088f lto_input_mode_table(lto_file_decl_data*)
 ../../gcc/gcc/lto-streamer-in.cc:2121
 0x6f5cdf lto_file_finalize
../../gcc/gcc/lto/lto-common.cc:2285
 0x6f5cdf lto_create_files_from_ids
../../gcc/gcc/lto/lto-common.cc:2309
 0x6f5cdf lto_file_read
../../gcc/gcc/lto/lto-common.cc:2364
 0x6f5cdf read_cgraph_and_symbols(unsigned int, char const**)
../../gcc/gcc/lto/lto-common.cc:2812
 0x6cfb93 lto_main()
../../gcc/gcc/lto/lto.cc:658

This is already tracked in https://gcc.gnu.org/PR96265 (and related PR's)

Streaming out mode_table:
mode = SI, mclass = 2, size = 4, prec = 32
mode = DI, mclass = 2, size = 8, prec = 64

Streaming in mode_table (in lto_input_mode_table):
mclass = 2, size = 4, prec = 0
(and then calculates the correct mode value by iterating over all modes of 
mclass starting from narrowest mode)

The issue is that the value for prec is not getting streamed-in correctly for 
SImode as seen above. While streaming out from AArch64 host,
it is 32, but while streaming in for nvptx, it is 0. This happens because of 
differing values of NUM_POLY_INT_COEFFS between AArch64 and nvptx backend.

Since NUM_POLY_INT_COEFFS is 2 for aarch64, the streamed-out values for mode, 
precision would be <4, 0> and <32, 0>
respectively (streamed-out in bp_pack_poly_value). Both zeros come from 
coeffs[1] of size and prec. While streaming in however,
NUM_POLY_INT_COEFFS is 1 for nvptx, and thus it incorrectly treats <4, 0> as 
size and precision respectively, which is why precision
gets streamed in as 0, and thus it encounters the above ICE.

Supporting non VLA code with offloading:

In the general case, it's hard to support offloading for arbitrary poly_ints 
when NUM_POLY_INT_COEFFS differs for host and accelerator.
For example, it's not possible to represent a degree-2 poly_int like 4 + 4x 
(as-is) on an accelerator with NUM_POLY_INT_COEFFS == 1.

However, IIUC, we can support offloading for restricted set of poly_ints whose 
degree <= accel's NUM_POLY_INT_COEFFS, since they can be
represented on accelerator ? For a hypothetical example, if host 
NUM_POLY_INT_COEFFS == 3 and accel NUM_POLY_INT_COEFFS == 2, then I suppose
we could represent a degree 2 poly_int on accelerator, but not a degree 3 
poly_int like 3+4x+5x^2 ?

Based on that, I have come up with following approach in attached 
"quick-and-dirty" patch (p-163-2.diff):
Stream-out host NUM_POLY_INT_COEFFS, and while streaming-in during lto1, 
compare it with accelerator's NUM_POLY_INT_COEFFS as follows:

Stream in host_num_poly_int_coeffs;
if (host_num_poly_int_coeffs == NUM_POLY_INT_COEFFS) // NUM_POLY_INT_COEFFS 
represents accelerator's value here.
{
/* Both are equal, proceed to unpacking NUM_POLY_INT_COEFFS words from 
bitstream.  */
}
else if (host_num_poly_int_coeffs < NUM_POLY_INT_COEFFS)
{
/* Unpack host_num_poly_int_coeffs words and zero out remaining higher 
coeffs (similar to zero-extension).  */
}
else
{
/* Unpack host_num_poly_int_coeffs words and ensure that degree of 
streamed-out poly_int <= NUM_POLY_INT_COEFFS.  */
}

For example, with host NUM_POLY_INT_COEFFS == 2 and accel NUM_POLY_INT_COEFFS 
== 1, this will allow streaming of "degree-1" poly_ints
like 4+0x (which will degenerate to constant 4), but give an error for 
streaming degree-2 poly_int like 4+4x.

Following this approach, I am assuming we can support AArch64/nvptx offloading 
for non VLA code, since poly_ints used for representing various
artefacts like mode_size, mode_precision, vector length etc. will be degree-1 
poly_int for scalar variables and fixed-length vectors
(and thus degenerate to constants). With the

Re: [RFC] analyzer: allocation size warning

2022-06-17 Thread Prathamesh Kulkarni via Gcc
On Fri, 17 Jun 2022 at 21:25, Tim Lange  wrote:
>
> Hi everyone,
Hi Tim,
Thanks for posting the POC patch!
Just a couple of comments (inline)
>
> tracked in PR105900 [0], I'd like to add support for a new warning on
> dubious allocation sizes. The new checker emits a warning when the
> allocation size is not a multiple of the type's size. With the checker,
> following mistakes are detected:
>   int *arr = malloc(3); // forgot to multiply by sizeof
>   arr[0] = ...;
>   arr[1] = ...;
> or
>   int *buf = malloc (n + sizeof(int)); // probably should be * instead
> of +
> Because it is implemented inside the analyzer, it also emits warnings
> when the buffer is first of type void* and later on casted to something
> else. Though, this also inherits a limitation. The checker can not
> distinguish 2 * sizeof(short) from sizeof(int) because sizeof is
> resolved and constants are folded at the point when the analyzer runs.
> As a mitigation, I plan to implement a check in the frontend that emits
> a warning if sizeof(lhs pointee type) is not part of the malloc
> argument.
IMHO, warning if sizeof(lhs pointee_type) is not present inside
malloc, might not be a good idea because it
would reject valid calls to malloc.
For eg:
(1)
size_t size = sizeof(int);
int *p = malloc (size);

(2)
void *p = malloc (sizeof(int));
int *q = p;
>
> I'm looking for a first feedback on the phrasing of the diagnostics as
> well on the preliminary patch [1].
>
> On constant buffer sizes, the warnings looks like this:
> warning: Allocated buffer size is not a multiple of the pointee's size
> [CWE-131] [-Wanalyzer-allocation-size]
>22 | int *ptr = malloc (10 + sizeof(int)); /* { dg-line malloc2 } */
>   | ^
>   ‘test_2’: event 1
> |
> | 22 | int *ptr = malloc (10 + sizeof(int)); /* { dg-line malloc2 }
> */
> | | ^
> | | |
> | | (1) Casting a 14 byte buffer to ‘int *’ leaves 2 trailing
> bytes; either the allocated size is bogus or the type on the left-hand
> side is wrong
> |
>
> On symbolic buffer sizes:
> warning: Allocated buffer size is not a multiple of the pointee's size
> [CWE-131] [-Wanalyzer-allocation-size]
>33 | int *ptr = malloc (n + sizeof(int)); /* { dg-line malloc3 } */
>   | ^~~~
>   ‘test_3’: event 1
> |
> | 33 | int *ptr = malloc (n + sizeof(int)); /* { dg-line malloc3 }
> */
> | | ^~~~
> | | |
> | | (1) Allocation is incompatible with ‘int *’; either the
> allocated size is bogus or the type on the left-hand side is wrong
> |
Won't the warning be incorrect if 'n' is a multiple of sizeof(int) ?
I assume by symbolic buffer size, 'n' is not known at compile time.

Thanks,
Prathamesh
>
> And this is how a simple flow looks like:
> warning: Allocated buffer size is not a multiple of the pointee's size
> [CWE-131] [-Wanalyzer-allocation-size]
>39 | int *iptr = (int *)ptr; /* { dg-line assign } */
>   | ^~~~
>   ‘test_4’: events 1-2
> |
> | 38 | void *ptr = malloc (n * sizeof (short)); /* { dg-message } */
> | | ^~~
> | | |
> | | (1) allocated here
> | 39 | int *iptr = (int *)ptr; /* { dg-line assign } */
> | | 
> | | |
> | | (2) ‘ptr’ is incompatible with ‘int *’; either the
> allocated size at (1) is bogus or the type on the left-hand side is
> wrong
> |
>
> There are some things to discuss from my side:
> * The tests with the "toy re-implementation of CPython's object
> model"[2] fail due to a extra warning emitted. Because the analyzer
> can't know the calculation actually results in a correct buffer size
> when viewed as a string_obj later on, it emits a warning, e.g. at line
> 61 in data-model-5.c. The only mitigation would be to disable the
> warning for structs entirely. Now, the question is to rather have noise
> on these cases or disable the warning for structs entirely?
> * I'm unable to emit a warning whenever the cast happens at an
> assignment with a call as the rhs, e.g. test_1 in allocation-size-4.c.
> This is because I'm unable to access a region_svalue for the returned
> value. Even in the new_program_state, the svalue of the lhs is still a
> conjured_svalue. Maybe David can lead me to a place where I can access
> the return value's region_svalue or do I have to adapt the engine?
> * attr-malloc-6.c and pr96639.c did both contain structs without an
> implementation. Something in the analyzer must have triggered another
> warning about the usage of those without them having an implementation.
> I changed those structs to have an empty implementation, such that the
> additional warning are gone. I think this shouldn't change the test
> case, so is this change okay?
>
> - Tim
>
> [0] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105900
> [1] While all tests except the cpython ones work, I have yet to test it
> on large C projects
> [2] FAIL: gcc.dg/analyzer/data-model-5.c (test for excess erro

Re: [RFC] Support for nonzero attribute

2022-06-11 Thread Prathamesh Kulkarni via Gcc
On Mon, 6 Jun 2022 at 01:39, Miika via Gcc  wrote:
>
> Based on Jakub's and Yair's comments I created a new attribute "inrange".
> Inrage takes three arguments, pos min and max.
> Pos being the argument position in the function, and min and max defines the
> range of valid integer. Both min and max are inclusive and work with enums.
> Warnings are enabled with the new flag: "-Winrange".
>
> The attribute definition would look something like this:
> inrange(pos, min, max)
>
>
> So for example compiling this with "gcc foo.c -Winrange":
>
> #include 
> void foo(int d) __attribute__ ((inrange (1, 100, 200)));
> void foo(int d) {
> printf("%d\n", d == 0);
> }
>
> int main() {
> foo(0); // warning
> foo(100); // no warning
> }
>
> Would give the following error:
>
> foo.c: In function 'main':
> foo.c:8:9: warning: argument in position 1 not in rage of 100..200 [-Winrange]
> 8 | foo(0); // warning
>   | ^~~
>
>
> I thought about having separate minval and maxval attributes but I personally
> prefer that min and max values have to be defined explicitly.
>
> If this looks good, I could look into applying inrange to types and variables
> and after that I could start looking into optimization.
>
> Patch for adding inrange is attached below
Hi,
Thanks for the POC patch!
A few suggestions:

(1) It doesn't apply as-is because of transition from .c to .cc filenames.
Perhaps you're using an older version of trunk ?

(2) AFAIK, this warning will need an entry in doc/invoke.texi for
documenting it.

(3) While this patch addresses warning, I suppose it could be extended
so the tree optimizer
can take advantage of value range info provided by the attribute.
For example, the condition d > 20, could be optimized away in
following function by inferring
range from the attribute.

__attribute__((inrange (1, 10, 20)))
void foo(int d)
{
  if (d > 20)
__builtin_abort ();
}

>
> Miika
>
> ---
> diff --git a/gcc/builtin-attrs.def b/gcc/builtin-attrs.def
> index 3239311b5a4..2f5732b3ed2 100644
> --- a/gcc/builtin-attrs.def
> +++ b/gcc/builtin-attrs.def
> @@ -98,6 +98,7 @@ DEF_ATTR_IDENT (ATTR_FORMAT, "format")
>  DEF_ATTR_IDENT (ATTR_FORMAT_ARG, "format_arg")
>  DEF_ATTR_IDENT (ATTR_MALLOC, "malloc")
>  DEF_ATTR_IDENT (ATTR_NONNULL, "nonnull")
> +DEF_ATTR_IDENT (ATTR_INRANGE, "inrange")
>  DEF_ATTR_IDENT (ATTR_NORETURN, "noreturn")
>  DEF_ATTR_IDENT (ATTR_NOTHROW, "nothrow")
>  DEF_ATTR_IDENT (ATTR_LEAF, "leaf")
> diff --git a/gcc/c-family/c-attribs.c b/gcc/c-family/c-attribs.c
> index ac936d5..d6dc9c37723 100644
> --- a/gcc/c-family/c-attribs.c
> +++ b/gcc/c-family/c-attribs.c
> @@ -119,6 +119,7 @@ static tree handle_novops_attribute (tree *, tree, tree, 
> int, bool *);
>  static tree handle_vector_size_attribute (tree *, tree, tree, int,
>   bool *);
>  static tree handle_nonnull_attribute (tree *, tree, tree, int, bool *);
> +static tree handle_inrange_attribute (tree *, tree, tree, int, bool *);
>  static tree handle_nonstring_attribute (tree *, tree, tree, int, bool *);
>  static tree handle_nothrow_attribute (tree *, tree, tree, int, bool *);
>  static tree handle_cleanup_attribute (tree *, tree, tree, int, bool *);
> @@ -379,6 +380,8 @@ const struct attribute_spec c_common_attribute_table[] =
>   handle_tls_model_attribute, NULL },
>{ "nonnull",0, -1, false, true, true, false,
>   handle_nonnull_attribute, NULL },
> +  { "inrange",3, 3, false, true, true, false,
> + handle_inrange_attribute, NULL },
>{ "nonstring",  0, 0, true, false, false, false,
>   handle_nonstring_attribute, NULL },
>{ "nothrow",0, 0, true,  false, false, false,
> @@ -3754,6 +3757,59 @@ handle_nonnull_attribute (tree *node, tree name,
>return NULL_TREE;
>  }
>
> +/* Handle the "inrange" attribute.  */
> +
> +static tree
> +handle_inrange_attribute (tree *node, tree name,
> + tree args, int ARG_UNUSED (flags),
> + bool *no_add_attrs)
> +{
> +  tree type = *node;
> +
> +  /* Test the position argument  */
> +  tree pos = TREE_VALUE (args);
> +  if (!positional_argument (type, name, pos, INTEGER_TYPE, 0))
> +*no_add_attrs = true;
> +
> +  /* Make sure that range args are INTEGRALs  */
> +  bool range_err = false;
> +  for (tree range = TREE_CHAIN (args); range; range = TREE_CHAIN (range))
> +{
> +  tree val = TREE_VALUE (range);
> +  if (val && TREE_CODE (val) != IDENTIFIER_NODE
> + && TREE_CODE (val) != FUNCTION_DECL)
> +   val = default_conversion (val);
> +
> +  if (TREE_CODE (val) != INTEGER_CST
> + || !INTEGRAL_TYPE_P (TREE_TYPE (val)))
Um, why the check for INTEGRAL_TYPE_P here ?
IIUC, this will also accept non-constant integer values.
For eg, the following compiles without any warning:
in

Re: How to run C++ IPA tests?

2021-10-27 Thread Prathamesh Kulkarni via Gcc
On Wed, 27 Oct 2021 at 20:09, Marek Polacek via Gcc  wrote:
>
> On Wed, Oct 27, 2021 at 04:29:32PM +0200, Erick Ochoa via Gcc wrote:
> > Hi,
> >
> > I have been adding tests to the gcc/testsuite/gcc.dg/ipa folder
> > successfully for a while now. I am starting to add some tests into
> > gcc/testsuite/g++.dg/ipa now but I am having some issues.
> >
> > 1. Using `make check-g++` returns the following error message "No rule
> > to make target 'check-g++'".
> > 2. While `make check-gcc` works, this seems to be only for C tests
> > (which is correct).
> > 3. When I type `make check RUNTESTS="ipa.exp"`  but the section "g++
> > Summary" looks empty.
> > 4. I have added a test that I know will fail, but I don't see it
> > (because I don't think I have run g++ tests correctly).
> >
> > To make sure that I haven't changed anything with my transformation I
> > actually checked out releases/gcc-11.2.0
> >
> > Can someone tell me how to run the C++ IPA tests? I'm sure there is
> > something silly that I'm doing wrong, but can't find out what it is.
>
> There's no ipa.exp in gcc/testsuite/g++.dg/ipa.  Instead you probably
> want
>
> make check-c++ RUNTESTFLAGS=dg.exp=ipa/*
Perhaps make check-gcc-c++ would be slightly better (to avoid running
tests for libitm, libgomp etc) ?

Thanks,
Prathamesh
>
> Marek
>


Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-19 Thread Prathamesh Kulkarni via Gcc
On Wed, 18 Aug 2021 at 20:10, Martin Sebor  wrote:
>
> On 8/18/21 12:52 AM, Prathamesh Kulkarni wrote:
> > On Fri, 13 Aug 2021 at 22:44, Martin Sebor  wrote:
> >>
> >> On 8/12/21 2:32 AM, Prathamesh Kulkarni wrote:
> >>> On Sat, 7 Aug 2021 at 02:09, Martin Sebor  wrote:
> >>>>
> >>>> On 8/6/21 4:51 AM, Richard Earnshaw wrote:
> >>>>>
> >>>>>
> >>>>> On 06/08/2021 01:06, Martin Sebor via Gcc wrote:
> >>>>>> On 8/4/21 3:46 AM, Richard Earnshaw wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> On 03/08/2021 18:44, Martin Sebor wrote:
> >>>>>>>> On 8/3/21 4:11 AM, Prathamesh Kulkarni via Gcc wrote:
> >>>>>>>>> On Tue, 27 Jul 2021 at 13:49, Richard Biener
> >>>>>>>>>  wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
> >>>>>>>>>>  wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, 23 Jul 2021 at 23:29, Andrew Pinski 
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
> >>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>> Continuing from this thread,
> >>>>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
> >>>>>>>>>>>>> The proposal is to provide a mechanism to mark a parameter in a
> >>>>>>>>>>>>> function as a literal constant.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Motivation:
> >>>>>>>>>>>>> Consider the following intrinsic vshl_n_s32 from 
> >>>>>>>>>>>>> arrm/arm_neon.h:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> __extension__ extern __inline int32x2_t
> >>>>>>>>>>>>> __attribute__  ((__always_inline__, __gnu_inline__,
> >>>>>>>>>>>>> __artificial__))
> >>>>>>>>>>>>> vshl_n_s32 (int32x2_t __a, const int __b)
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>>  return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> and it's caller:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> int32x2_t f (int32x2_t x)
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>>   return vshl_n_s32 (x, 1);
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> Can't you do similar to what is done already in the aarch64
> >>>>>>>>>>>> back-end:
> >>>>>>>>>>>> #define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
> >>>>>>>>>>>> #define __AARCH64_LANE_CHECK(__vec, __idx)  \
> >>>>>>>>>>>>__builtin_aarch64_im_lane_boundsi (sizeof(__vec),
> >>>>>>>>>>>> sizeof(__vec[0]), __idx)
> >>>>>>>>>>>>
> >>>>>>>>>>>> ?
> >>>>>>>>>>>> Yes this is about lanes but you could even add one for min/max
> >>>>>>>>>>>> which
> >>>>>>>>>>>> is generic and such; add an argument to say the intrinsics name
> >>>>>>>>>>>> even.
> >>>>>>>>>>>> You could do this as a non-target builtin if you want and reuse 
> >>>>>>>>>>>> it
> >>>>>>>>>>>> also for the 

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-17 Thread Prathamesh Kulkarni via Gcc
On Fri, 13 Aug 2021 at 22:44, Martin Sebor  wrote:
>
> On 8/12/21 2:32 AM, Prathamesh Kulkarni wrote:
> > On Sat, 7 Aug 2021 at 02:09, Martin Sebor  wrote:
> >>
> >> On 8/6/21 4:51 AM, Richard Earnshaw wrote:
> >>>
> >>>
> >>> On 06/08/2021 01:06, Martin Sebor via Gcc wrote:
> >>>> On 8/4/21 3:46 AM, Richard Earnshaw wrote:
> >>>>>
> >>>>>
> >>>>> On 03/08/2021 18:44, Martin Sebor wrote:
> >>>>>> On 8/3/21 4:11 AM, Prathamesh Kulkarni via Gcc wrote:
> >>>>>>> On Tue, 27 Jul 2021 at 13:49, Richard Biener
> >>>>>>>  wrote:
> >>>>>>>>
> >>>>>>>> On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
> >>>>>>>>  wrote:
> >>>>>>>>>
> >>>>>>>>> On Fri, 23 Jul 2021 at 23:29, Andrew Pinski 
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
> >>>>>>>>>>  wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>> Continuing from this thread,
> >>>>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
> >>>>>>>>>>> The proposal is to provide a mechanism to mark a parameter in a
> >>>>>>>>>>> function as a literal constant.
> >>>>>>>>>>>
> >>>>>>>>>>> Motivation:
> >>>>>>>>>>> Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:
> >>>>>>>>>>>
> >>>>>>>>>>> __extension__ extern __inline int32x2_t
> >>>>>>>>>>> __attribute__  ((__always_inline__, __gnu_inline__,
> >>>>>>>>>>> __artificial__))
> >>>>>>>>>>> vshl_n_s32 (int32x2_t __a, const int __b)
> >>>>>>>>>>> {
> >>>>>>>>>>> return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> and it's caller:
> >>>>>>>>>>>
> >>>>>>>>>>> int32x2_t f (int32x2_t x)
> >>>>>>>>>>> {
> >>>>>>>>>>>  return vshl_n_s32 (x, 1);
> >>>>>>>>>>> }
> >>>>>>>>>>
> >>>>>>>>>> Can't you do similar to what is done already in the aarch64
> >>>>>>>>>> back-end:
> >>>>>>>>>> #define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
> >>>>>>>>>> #define __AARCH64_LANE_CHECK(__vec, __idx)  \
> >>>>>>>>>>   __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
> >>>>>>>>>> sizeof(__vec[0]), __idx)
> >>>>>>>>>>
> >>>>>>>>>> ?
> >>>>>>>>>> Yes this is about lanes but you could even add one for min/max
> >>>>>>>>>> which
> >>>>>>>>>> is generic and such; add an argument to say the intrinsics name
> >>>>>>>>>> even.
> >>>>>>>>>> You could do this as a non-target builtin if you want and reuse it
> >>>>>>>>>> also for the aarch64 backend.
> >>>>>>>>> Hi Andrew,
> >>>>>>>>> Thanks for the suggestions. IIUC, we could use this approach to
> >>>>>>>>> check
> >>>>>>>>> if the argument
> >>>>>>>>> falls within a certain range (min / max), but I am not sure how it
> >>>>>>>>> will help to determine
> >>>>>>>>> if the arg is a constant immediate ? AFAIK, vshl_n intrinsics
> >>>>>>>>> require
> >>>>>>>>> that the 2nd arg is immediate ?
> >>>>>>>>>
> >>>

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-12 Thread Prathamesh Kulkarni via Gcc
On Sat, 7 Aug 2021 at 02:09, Martin Sebor  wrote:
>
> On 8/6/21 4:51 AM, Richard Earnshaw wrote:
> >
> >
> > On 06/08/2021 01:06, Martin Sebor via Gcc wrote:
> >> On 8/4/21 3:46 AM, Richard Earnshaw wrote:
> >>>
> >>>
> >>> On 03/08/2021 18:44, Martin Sebor wrote:
> >>>> On 8/3/21 4:11 AM, Prathamesh Kulkarni via Gcc wrote:
> >>>>> On Tue, 27 Jul 2021 at 13:49, Richard Biener
> >>>>>  wrote:
> >>>>>>
> >>>>>> On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
> >>>>>>  wrote:
> >>>>>>>
> >>>>>>> On Fri, 23 Jul 2021 at 23:29, Andrew Pinski 
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
> >>>>>>>>  wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>> Continuing from this thread,
> >>>>>>>>> https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
> >>>>>>>>> The proposal is to provide a mechanism to mark a parameter in a
> >>>>>>>>> function as a literal constant.
> >>>>>>>>>
> >>>>>>>>> Motivation:
> >>>>>>>>> Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:
> >>>>>>>>>
> >>>>>>>>> __extension__ extern __inline int32x2_t
> >>>>>>>>> __attribute__  ((__always_inline__, __gnu_inline__,
> >>>>>>>>> __artificial__))
> >>>>>>>>> vshl_n_s32 (int32x2_t __a, const int __b)
> >>>>>>>>> {
> >>>>>>>>>return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> and it's caller:
> >>>>>>>>>
> >>>>>>>>> int32x2_t f (int32x2_t x)
> >>>>>>>>> {
> >>>>>>>>> return vshl_n_s32 (x, 1);
> >>>>>>>>> }
> >>>>>>>>
> >>>>>>>> Can't you do similar to what is done already in the aarch64
> >>>>>>>> back-end:
> >>>>>>>> #define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
> >>>>>>>> #define __AARCH64_LANE_CHECK(__vec, __idx)  \
> >>>>>>>>  __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
> >>>>>>>> sizeof(__vec[0]), __idx)
> >>>>>>>>
> >>>>>>>> ?
> >>>>>>>> Yes this is about lanes but you could even add one for min/max
> >>>>>>>> which
> >>>>>>>> is generic and such; add an argument to say the intrinsics name
> >>>>>>>> even.
> >>>>>>>> You could do this as a non-target builtin if you want and reuse it
> >>>>>>>> also for the aarch64 backend.
> >>>>>>> Hi Andrew,
> >>>>>>> Thanks for the suggestions. IIUC, we could use this approach to
> >>>>>>> check
> >>>>>>> if the argument
> >>>>>>> falls within a certain range (min / max), but I am not sure how it
> >>>>>>> will help to determine
> >>>>>>> if the arg is a constant immediate ? AFAIK, vshl_n intrinsics
> >>>>>>> require
> >>>>>>> that the 2nd arg is immediate ?
> >>>>>>>
> >>>>>>> Even the current RTL builtin checking is not consistent across
> >>>>>>> optimization levels:
> >>>>>>> For eg:
> >>>>>>> int32x2_t f(int32_t *restrict a)
> >>>>>>> {
> >>>>>>>int32x2_t v = vld1_s32 (a);
> >>>>>>>int b = 2;
> >>>>>>>return vshl_n_s32 (v, b);
> >>>>>>> }
> >>>>>>>
> >>>>>>> With pristine trunk, compiling with -O2 results in no errors because
> >>>>>>> constant propagatio

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-05 Thread Prathamesh Kulkarni via Gcc
On Wed, 4 Aug 2021 at 18:30, Richard Earnshaw
 wrote:
>
> On 04/08/2021 13:46, Segher Boessenkool wrote:
> > On Wed, Aug 04, 2021 at 05:20:58PM +0530, Prathamesh Kulkarni wrote:
> >> On Wed, 4 Aug 2021 at 15:49, Segher Boessenkool
> >>  wrote:
> >>> Both __builtin_constant_p and __is_constexpr will not work in your use
> >>> case (since a function argument is not a constant, let alone an ICE).
> >>> It only becomes a constant value later on.  The manual (for the former)
> >>> says:
> >>>   You may use this built-in function in either a macro or an inline
> >>>   function. However, if you use it in an inlined function and pass an
> >>>   argument of the function as the argument to the built-in, GCC never
> >>>   returns 1 when you call the inline function with a string constant or
> >>>   compound literal (see Compound Literals) and does not return 1 when you
> >>>   pass a constant numeric value to the inline function unless you specify
> >>>   the -O option.
> >> Indeed, that's why I was thinking if we should use an attribute to mark 
> >> param as
> >> a constant, so during type-checking the function call, the compiler
> >> can emit a diagnostic if the passed arg
> >> is not a constant.
> >
> > That will depend on the vagaries of what optimisations the compiler
> > managed to do :-(
> >
> >> Alternatively -- as you suggest, we could define a new builtin, say
> >> __builtin_ice(x) that returns true if 'x' is an ICE.
> >
> > (That is a terrible name, it's not clear at all to the reader, just
> > write it out?  It is fun if you know what it means, but infuriating
> > otherwise.)
> >
> >> And wrap the intrinsic inside a macro that would check if the arg is an 
> >> ICE ?
> >
> > That will work yeah.  Maybe not as elegant as you'd like, but not all
> > that bad, and it *works*.  Well, hopefully it does :-)
> >
> >> For eg:
> >>
> >> __extension__ extern __inline int32x2_t
> >> __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> >> vshl_n_s32_1 (int32x2_t __a, const int __b)
> >> {
> >>   return __builtin_neon_vshl_nv2si (__a, __b);
> >> }
> >>
> >> #define vshl_n_s32(__a, __b) \
> >> ({ typeof (__a) a = (__a); \
> >>_Static_assert (__builtin_constant_p ((__b)), #__b " is not an
> >> integer constant"); \
> >>vshl_n_s32_1 (a, (__b)); })
> >>
> >> void f(int32x2_t x, const int y)
> >> {
> >>   vshl_n_s32 (x, 2);
> >>   vshl_n_s32 (x, y);
> >>
> >>   int z = 1;
> >>   vshl_n_s32 (x, z);
> >> }
> >>
> >> With this, the compiler rejects vshl_n_s32 (x, y) and vshl_n_s32 (x,
> >> z) at all optimization levels since neither 'y' nor 'z' is an ICE.
> >
> > You used __builtin_constant_p though, which works differently, so the
> > test is not conclusive, might not show what you want to show.
> >
> >> Instead of __builtin_constant_p, we could use __builtin_ice.
> >> Would that be a reasonable approach ?
> >
> > I think it will work, yes.
> >
> >> But this changes the semantics of intrinsic from being an inline
> >> function to a macro, and I am not sure if that's a good idea.
> >
> > Well, what happens if you call the actual builtin directly, with some
> > non-constant parameter?  That just fails with a more cryptic error,
> > right?  So you can view this as some syntactic sugar to make these
> > intrinsics easier to use.
> >
> > Hrm I now remember a place I could have used this:
> >
> > #define mtspr(n, x) do { asm("mtspr %1,%0" : : "r"(x), "n"(n)); } while (0)
> > #define mfspr(n) ({ \
> >   u32 x; asm volatile("mfspr %0,%1" : "=r"(x) : "n"(n)); x; \
> > })
> >
> > It is quite similar to your builtin code really, and I did resort to
> > macros there, for similar reasons :-)
> >
> >
> > Segher
> >
>
> We don't want to have to resort to macros.  Not least because at some
> point we want to replace the content of arm_neon.h with a single #pragma
> directive to remove all the parsing of the header that's needed.  What's
> more, if we had a suitable pragma we'd stand a fighting chance of being
> able to extend support to other languages as well that don't use the
> pre-processor, such as Fortran or Ada (not that that is on the cards
> right now).
Hi,
IIUC, a more general issue here, is that the intrinsics require
special type-checking of arguments, beyond what is dictated by the
Standard.
An argument needing to be an ICE could be seen as one instance.

So perhaps, should there be some mechanism to tell the FE to let the
target do additional checking for a particular function call, say by
explicitly marking
it with "intrinsic" attribute ? So while type checking a call to a
function marked with "intrinsic" attribute, FE can invoke target
handler with name of function and
corresponding arguments passed, and then leave it to the target for
further checking ?
For vshl_n case, the target hook would check that the 2nd arg is an
integer constant within the permissible range.

I propose to do this only for intrinsics that need special checking
and can be entirely implemented with C extensions and won't 

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-04 Thread Prathamesh Kulkarni via Gcc
On Wed, 4 Aug 2021 at 15:49, Segher Boessenkool
 wrote:
>
> On Wed, Aug 04, 2021 at 03:20:45PM +0530, Prathamesh Kulkarni wrote:
> > On Wed, 4 Aug 2021 at 03:27, Segher Boessenkool
> >  wrote:
> > > The Linux kernel has a macro __is_constexpr to test if something is an
> > > integer constant expression, see  .  That is a much
> > > better idea imo.  There could be a builtin for that of course, but an
> > > attribute is less powerful, less usable, less useful.
> > Hi Segher,
> > Thanks for the suggestions. I am not sure tho if we could use a macro
> > similar to __is_constexpr
> > to check if parameter is constant inside an inline function (which is
> > the case for intrinsics) ?
>
> I said we can make a builtin that returns if its arg is an ICE -- we do
> not have to do tricky tricks :-)
>
> The macro would work fine in an inline function though, or, where do you
> see potential problems?
>
> > For eg:
> > #define __is_constexpr(x) \
> > (sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int 
> > *)8)))
> >
> > inline int foo(const int x)
> > {
> >   _Static_assert (__is_constexpr (x));
> >   return x;
> > }
> >
> > int main()
> > {
> >   return foo (1);
> > }
> >
> > results in:
> > foo.c: In function ‘foo’:
> > foo.c:8:3: error: static assertion failed
> > 8 |   _Static_assert (__is_constexpr (x));
>
> And that is correct, x is *not* an integer constant expression here.
> Because it is a variable, instead :-)
>
> If you do this in a macro it should work though?
>
> > Initially we tried to use __Static_assert (__builtin_constant_p (arg))
> > for the same purpose but that did not work
> > because while parsing the intrinsic function, the FE cannot determine
> > if the arg is indeed a constant.
>
> Yes.  If you want something like that you need to test very late during
> compilation whether something is a constant then: it will not be earlier.
>
> > I guess the static assertion or __is_constexpr would work only if the
> > intrinsic were defined as a macro instead of an inline function ?
> > Or am I misunderstanding ?
>
> Both __builtin_constant_p and __is_constexpr will not work in your use
> case (since a function argument is not a constant, let alone an ICE).
> It only becomes a constant value later on.  The manual (for the former)
> says:
>   You may use this built-in function in either a macro or an inline
>   function. However, if you use it in an inlined function and pass an
>   argument of the function as the argument to the built-in, GCC never
>   returns 1 when you call the inline function with a string constant or
>   compound literal (see Compound Literals) and does not return 1 when you
>   pass a constant numeric value to the inline function unless you specify
>   the -O option.
Indeed, that's why I was thinking if we should use an attribute to mark param as
a constant, so during type-checking the function call, the compiler
can emit a diagnostic if the passed arg
is not a constant.

Alternatively -- as you suggest, we could define a new builtin, say
__builtin_ice(x) that returns true if 'x' is an ICE.
And wrap the intrinsic inside a macro that would check if the arg is an ICE ?

For eg:

__extension__ extern __inline int32x2_t
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vshl_n_s32_1 (int32x2_t __a, const int __b)
{
  return __builtin_neon_vshl_nv2si (__a, __b);
}

#define vshl_n_s32(__a, __b) \
({ typeof (__a) a = (__a); \
   _Static_assert (__builtin_constant_p ((__b)), #__b " is not an
integer constant"); \
   vshl_n_s32_1 (a, (__b)); })

void f(int32x2_t x, const int y)
{
  vshl_n_s32 (x, 2);
  vshl_n_s32 (x, y);

  int z = 1;
  vshl_n_s32 (x, z);
}

With this, the compiler rejects vshl_n_s32 (x, y) and vshl_n_s32 (x,
z) at all optimization levels since neither 'y' nor 'z' is an ICE.
Instead of __builtin_constant_p, we could use __builtin_ice.
Would that be a reasonable approach ?

But this changes the semantics of intrinsic from being an inline
function to a macro, and I am not sure if that's a good idea.

Thanks,
Prathamesh

> An integer constant expression is well-defined whatever the optimisation
> level is, it is a feature of the language.
>
> If some x is an ICE you can do
>   asm ("" :: "n"(x));
> and if it is a constant you can do
>   asm ("" :: "i"(x));
> (not that that gets you much further, but it might help explorng this).
>
>
> Segher


Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-04 Thread Prathamesh Kulkarni via Gcc
On Wed, 4 Aug 2021 at 03:27, Segher Boessenkool
 wrote:
>
> Hi!
>
> On Fri, Jul 23, 2021 at 04:23:42PM +0530, Prathamesh Kulkarni via Gcc wrote:
> > The constraint here is that, vshl_n intrinsics require that the
> > second arg (__b),
> > should be an immediate value.
>
> Something that matches the "n" constraint, not necessarily a literal,
> but stricter than just "immediate".  It probably is a good idea to allow
> only "integer constant expression"s, so that the validity of the source
> code does not depend on what the optimisers do with the code.
>
> > As Richard suggested, sth like:
> > void foo(int x __attribute__((literal_constant (min_val, max_val)));
>
> The Linux kernel has a macro __is_constexpr to test if something is an
> integer constant expression, see  .  That is a much
> better idea imo.  There could be a builtin for that of course, but an
> attribute is less powerful, less usable, less useful.
Hi Segher,
Thanks for the suggestions. I am not sure tho if we could use a macro
similar to __is_constexpr
to check if parameter is constant inside an inline function (which is
the case for intrinsics) ?

For eg:
#define __is_constexpr(x) \
(sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))

inline int foo(const int x)
{
  _Static_assert (__is_constexpr (x));
  return x;
}

int main()
{
  return foo (1);
}

results in:
foo.c: In function ‘foo’:
foo.c:8:3: error: static assertion failed
8 |   _Static_assert (__is_constexpr (x));

Initially we tried to use __Static_assert (__builtin_constant_p (arg))
for the same purpose but that did not work
because while parsing the intrinsic function, the FE cannot determine
if the arg is indeed a constant.
I guess the static assertion or __is_constexpr would work only if the
intrinsic were defined as a macro instead of an inline function ?
Or am I misunderstanding ?

Thanks,
Prathamesh
>
>
> Segher


Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-03 Thread Prathamesh Kulkarni via Gcc
On Tue, 3 Aug 2021 at 15:41, Prathamesh Kulkarni
 wrote:
>
> On Tue, 27 Jul 2021 at 13:49, Richard Biener  
> wrote:
> >
> > On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
> >  wrote:
> > >
> > > On Fri, 23 Jul 2021 at 23:29, Andrew Pinski  wrote:
> > > >
> > > > On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
> > > >  wrote:
> > > > >
> > > > > Hi,
> > > > > Continuing from this thread,
> > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
> > > > > The proposal is to provide a mechanism to mark a parameter in a
> > > > > function as a literal constant.
> > > > >
> > > > > Motivation:
> > > > > Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:
> > > > >
> > > > > __extension__ extern __inline int32x2_t
> > > > > __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> > > > > vshl_n_s32 (int32x2_t __a, const int __b)
> > > > > {
> > > > >   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
> > > > > }
> > > > >
> > > > > and it's caller:
> > > > >
> > > > > int32x2_t f (int32x2_t x)
> > > > > {
> > > > >return vshl_n_s32 (x, 1);
> > > > > }
> > > >
> > > > Can't you do similar to what is done already in the aarch64 back-end:
> > > > #define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
> > > > #define __AARCH64_LANE_CHECK(__vec, __idx)  \
> > > > __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
> > > > sizeof(__vec[0]), __idx)
> > > >
> > > > ?
> > > > Yes this is about lanes but you could even add one for min/max which
> > > > is generic and such; add an argument to say the intrinsics name even.
> > > > You could do this as a non-target builtin if you want and reuse it
> > > > also for the aarch64 backend.
> > > Hi Andrew,
> > > Thanks for the suggestions. IIUC, we could use this approach to check
> > > if the argument
> > > falls within a certain range (min / max), but I am not sure how it
> > > will help to determine
> > > if the arg is a constant immediate ? AFAIK, vshl_n intrinsics require
> > > that the 2nd arg is immediate ?
> > >
> > > Even the current RTL builtin checking is not consistent across
> > > optimization levels:
> > > For eg:
> > > int32x2_t f(int32_t *restrict a)
> > > {
> > >   int32x2_t v = vld1_s32 (a);
> > >   int b = 2;
> > >   return vshl_n_s32 (v, b);
> > > }
> > >
> > > With pristine trunk, compiling with -O2 results in no errors because
> > > constant propagation replaces 'b' with 2, and during expansion,
> > > expand_builtin_args is happy. But at -O0, it results in the error -
> > > "argument 2 must be a constant immediate".
> > >
> > > So I guess we need some mechanism to mark a parameter as a constant ?
> >
> > I guess you want to mark it in a way that the frontend should force
> > constant evaluation and error if that's not possible?   C++ doesn't
> > allow to declare a parameter as 'constexpr' but something like
> >
> > void foo (consteval int i);
> >
> > since I guess you do want to allow passing constexpr arguments
> > in C++ or in C extended forms of constants like
> >
> > static const int a[4];
> >
> > foo (a[1]);
> >
> > ?  But yes, this looks useful to me.
> Hi Richard,
> Thanks for the suggestions and sorry for late response.
> I have attached a prototype patch that implements consteval attribute.
> As implemented, the attribute takes at least one argument(s), which
> refer to parameter position,
> and the corresponding parameter must be const qualified, failing
> which, the attribute is ignored.
>
> The patch does type-checking for arguments in
> check_function_consteval_attr, which
> simply does a linear search to see if the corresponding argument
> number is included in consteval attribute,
> and if yes, it checks if the argument is CONSTANT_CLASS_P.
>
> This works for simple constants and the intrinsics use-case, but
> rejects "extended constants" as in your above example.
> I guess we can special case to detect cases like a[i] where 'a' is
> const and &#x

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-08-03 Thread Prathamesh Kulkarni via Gcc
On Tue, 27 Jul 2021 at 13:49, Richard Biener  wrote:
>
> On Mon, Jul 26, 2021 at 11:06 AM Prathamesh Kulkarni via Gcc
>  wrote:
> >
> > On Fri, 23 Jul 2021 at 23:29, Andrew Pinski  wrote:
> > >
> > > On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
> > >  wrote:
> > > >
> > > > Hi,
> > > > Continuing from this thread,
> > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
> > > > The proposal is to provide a mechanism to mark a parameter in a
> > > > function as a literal constant.
> > > >
> > > > Motivation:
> > > > Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:
> > > >
> > > > __extension__ extern __inline int32x2_t
> > > > __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> > > > vshl_n_s32 (int32x2_t __a, const int __b)
> > > > {
> > > >   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
> > > > }
> > > >
> > > > and it's caller:
> > > >
> > > > int32x2_t f (int32x2_t x)
> > > > {
> > > >return vshl_n_s32 (x, 1);
> > > > }
> > >
> > > Can't you do similar to what is done already in the aarch64 back-end:
> > > #define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
> > > #define __AARCH64_LANE_CHECK(__vec, __idx)  \
> > > __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
> > > sizeof(__vec[0]), __idx)
> > >
> > > ?
> > > Yes this is about lanes but you could even add one for min/max which
> > > is generic and such; add an argument to say the intrinsics name even.
> > > You could do this as a non-target builtin if you want and reuse it
> > > also for the aarch64 backend.
> > Hi Andrew,
> > Thanks for the suggestions. IIUC, we could use this approach to check
> > if the argument
> > falls within a certain range (min / max), but I am not sure how it
> > will help to determine
> > if the arg is a constant immediate ? AFAIK, vshl_n intrinsics require
> > that the 2nd arg is immediate ?
> >
> > Even the current RTL builtin checking is not consistent across
> > optimization levels:
> > For eg:
> > int32x2_t f(int32_t *restrict a)
> > {
> >   int32x2_t v = vld1_s32 (a);
> >   int b = 2;
> >   return vshl_n_s32 (v, b);
> > }
> >
> > With pristine trunk, compiling with -O2 results in no errors because
> > constant propagation replaces 'b' with 2, and during expansion,
> > expand_builtin_args is happy. But at -O0, it results in the error -
> > "argument 2 must be a constant immediate".
> >
> > So I guess we need some mechanism to mark a parameter as a constant ?
>
> I guess you want to mark it in a way that the frontend should force
> constant evaluation and error if that's not possible?   C++ doesn't
> allow to declare a parameter as 'constexpr' but something like
>
> void foo (consteval int i);
>
> since I guess you do want to allow passing constexpr arguments
> in C++ or in C extended forms of constants like
>
> static const int a[4];
>
> foo (a[1]);
>
> ?  But yes, this looks useful to me.
Hi Richard,
Thanks for the suggestions and sorry for late response.
I have attached a prototype patch that implements consteval attribute.
As implemented, the attribute takes at least one argument(s), which
refer to parameter position,
and the corresponding parameter must be const qualified, failing
which, the attribute is ignored.

The patch does type-checking for arguments in
check_function_consteval_attr, which
simply does a linear search to see if the corresponding argument
number is included in consteval attribute,
and if yes, it checks if the argument is CONSTANT_CLASS_P.

This works for simple constants and the intrinsics use-case, but
rejects "extended constants" as in your above example.
I guess we can special case to detect cases like a[i] where 'a' is
const and 'i' is an immediate,
but I am not sure if there's a way to force constant evaluation in FE ?
I tried using c_fully_fold (argarray[i], false, &maybe_const) but that
didn't seem to work.
Do you have any suggestions on how to detect those in the front-end ?

Example:

__attribute__((consteval(1, 2)))
void f(const int x, int *p)
{}

Calling it with:
1) f(0, (int *) 0) is accepted.

2)
void g(int *p)
{
  f (0, p);
}

emits the following error:
test.c: In function ‘g’:
test.c:7:9: error: argument 2 is not a constant.
7 |   f (0, p);
 

Re: [RFC] Adding a new attribute to function param to mark it as constant

2021-07-26 Thread Prathamesh Kulkarni via Gcc
On Fri, 23 Jul 2021 at 23:29, Andrew Pinski  wrote:
>
> On Fri, Jul 23, 2021 at 3:55 AM Prathamesh Kulkarni via Gcc
>  wrote:
> >
> > Hi,
> > Continuing from this thread,
> > https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
> > The proposal is to provide a mechanism to mark a parameter in a
> > function as a literal constant.
> >
> > Motivation:
> > Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:
> >
> > __extension__ extern __inline int32x2_t
> > __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> > vshl_n_s32 (int32x2_t __a, const int __b)
> > {
> >   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
> > }
> >
> > and it's caller:
> >
> > int32x2_t f (int32x2_t x)
> > {
> >return vshl_n_s32 (x, 1);
> > }
>
> Can't you do similar to what is done already in the aarch64 back-end:
> #define __AARCH64_NUM_LANES(__v) (sizeof (__v) / sizeof (__v[0]))
> #define __AARCH64_LANE_CHECK(__vec, __idx)  \
> __builtin_aarch64_im_lane_boundsi (sizeof(__vec),
> sizeof(__vec[0]), __idx)
>
> ?
> Yes this is about lanes but you could even add one for min/max which
> is generic and such; add an argument to say the intrinsics name even.
> You could do this as a non-target builtin if you want and reuse it
> also for the aarch64 backend.
Hi Andrew,
Thanks for the suggestions. IIUC, we could use this approach to check
if the argument
falls within a certain range (min / max), but I am not sure how it
will help to determine
if the arg is a constant immediate ? AFAIK, vshl_n intrinsics require
that the 2nd arg is immediate ?

Even the current RTL builtin checking is not consistent across
optimization levels:
For eg:
int32x2_t f(int32_t *restrict a)
{
  int32x2_t v = vld1_s32 (a);
  int b = 2;
  return vshl_n_s32 (v, b);
}

With pristine trunk, compiling with -O2 results in no errors because
constant propagation replaces 'b' with 2, and during expansion,
expand_builtin_args is happy. But at -O0, it results in the error -
"argument 2 must be a constant immediate".

So I guess we need some mechanism to mark a parameter as a constant ?

Thanks,
Prathamesh
>
> Thanks,
> Andrew Pinski
>
> >
> > The constraint here is that, vshl_n intrinsics require that the
> > second arg (__b),
> > should be an immediate value.
> > Currently, this check is performed by arm_expand_builtin_args, and if
> > a non-constant
> > value gets passed, it emits the following diagnostic:
> >
> > ../armhf-build/gcc/include/arm_neon.h:4904:10: error: argument 2 must
> > be a constant immediate
> >  4904 |   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
> >   |  ^~~
> >
> > However, we're trying to replace builtin calls with gcc's C vector
> > extensions where
> > possible (PR66791), because the builtins are opaque to the optimizers.
> >
> > Unfortunately, we lose type checking of immediate value if we replace
> > the builtin
> > with << operator:
> >
> > __extension__ extern __inline int32x2_t
> > __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
> > vshl_n_s32 (int32x2_t __a, const int __b)
> > {
> >   return __a << __b;
> > }
> >
> > So, I was wondering if we should have an attribute for a parameter to
> > specifically
> > mark it as a constant value with optional range value info ?
> > As Richard suggested, sth like:
> > void foo(int x __attribute__((literal_constant (min_val, max_val)));
> >
> > Thanks,
> > Prathamesh


[RFC] Adding a new attribute to function param to mark it as constant

2021-07-23 Thread Prathamesh Kulkarni via Gcc
Hi,
Continuing from this thread,
https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575920.html
The proposal is to provide a mechanism to mark a parameter in a
function as a literal constant.

Motivation:
Consider the following intrinsic vshl_n_s32 from arrm/arm_neon.h:

__extension__ extern __inline int32x2_t
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vshl_n_s32 (int32x2_t __a, const int __b)
{
  return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
}

and it's caller:

int32x2_t f (int32x2_t x)
{
   return vshl_n_s32 (x, 1);
}

The constraint here is that, vshl_n intrinsics require that the
second arg (__b),
should be an immediate value.
Currently, this check is performed by arm_expand_builtin_args, and if
a non-constant
value gets passed, it emits the following diagnostic:

../armhf-build/gcc/include/arm_neon.h:4904:10: error: argument 2 must
be a constant immediate
 4904 |   return (int32x2_t)__builtin_neon_vshl_nv2si (__a, __b);
  |  ^~~

However, we're trying to replace builtin calls with gcc's C vector
extensions where
possible (PR66791), because the builtins are opaque to the optimizers.

Unfortunately, we lose type checking of immediate value if we replace
the builtin
with << operator:

__extension__ extern __inline int32x2_t
__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
vshl_n_s32 (int32x2_t __a, const int __b)
{
  return __a << __b;
}

So, I was wondering if we should have an attribute for a parameter to
specifically
mark it as a constant value with optional range value info ?
As Richard suggested, sth like:
void foo(int x __attribute__((literal_constant (min_val, max_val)));

Thanks,
Prathamesh


Re: Successive hoisting and AVAIL_OUT in at least one successor heuristic

2021-05-07 Thread Prathamesh Kulkarni via Gcc
On Thu, 6 May 2021 at 18:51, Michael Matz  wrote:
>
> Hello,
>
> On Thu, 6 May 2021, Prathamesh Kulkarni via Gcc wrote:
>
> > Well, I was thinking of this test-case:
> >
> > int f(int cond1, int cond2, int cond3, int x, int y)
> > {
> >   void f1();
> >   void f2(int);
> >   void f3();
> >
> >   if (cond1)
> > f1 ();
> >   else
> > {
> >   if (cond2)
> > f2 (x + y);
> >   else
> > f3 ();
> > }
> >
> >   return x + y;
> > }
> > ...
> > And during second iteration, it hoists x + y into bb2. So we are
> > effectively hoisting x + y from bb5, bb7 into bb2, which doesn't seem to
> > benefit other two paths (bb2->bb3->bb7 and bb2->bb4->bb6->bb7) since
> > they don't contain x + y.
>
> But bb7 eventually does, so it also doesn't make those paths worse.
> That's the nature of partial redundancy elimination: it doesn't require
> benefits on all paths, only on one of them.  That's in difference to full
> redundancy eliminaton.
>
> As with all hoistings it can increase register pressure.  The counter
> measure to this is not to limit the hoisting (because that can have
> ripple down effects when some hoists don't happen anymore), but rather to
> tackle the register pressure problem somewhen later, in the register
> allocator (or some preparatory pass).  But I'm not even sure if this is
> the reason you're wondering about how PRE hoists.
Hi Michael,
Yes, my concern was primarily w.r.t increased register pressure, and
was trying to think
of a "hack" that will disable hoisting in block if it could
potentially increase chances of spills,
offsetting it's benefit (altho the issue isn't limited to hoisting, I
was special casing it because hoisting
seems to regress a couple of benchmarks on arm and aarch64).
But yes, that doesn't look like the right approach.

Thanks,
Prathamesh
>
>
> Ciao,
> Michael.


Re: Successive hoisting and AVAIL_OUT in at least one successor heuristic

2021-05-06 Thread Prathamesh Kulkarni via Gcc
On Thu, 6 May 2021 at 17:01, Richard Biener  wrote:
>
> On Thu, 6 May 2021, Prathamesh Kulkarni wrote:
>
> > On Thu, 6 May 2021 at 15:43, Richard Biener  wrote:
> > >
> > > On Thu, 6 May 2021, Prathamesh Kulkarni wrote:
> > >
> > > > Hi Richard,
> > > > I was just wondering if second (and higher) order hoistings may defeat
> > > > the "AVAIL_OUT in at least one successor heuristic" ?
> > > >
> > > > For eg:
> > > > bb2:
> > > > if (cond1) goto bb3 else goto bb4;
> > > >
> > > > bb3:
> > > > if (cond2) goto bb5 else goto bb6;
> > > >
> > > > bb5:
> > > > return x + y;
> > > >
> > > > bb6:
> > > > return x + y;
> > > >
> > > > bb4:
> > > > if (cond3) goto bb7 else goto bb8;
> > > >
> > > > bb7:
> > > > return x + y;
> > > >
> > > > bb8:
> > > > return x + y;
> > > >
> > > > In above CFG, IIUC, we will end up hoisting x + y in bb2 even if bb3
> > > > or bb4 don't originally
> > > > contain x + y since during first hoisting pass x + y will be hoisted
> > > > from bb5, bb6 into bb3,
> > > > and likewise from bb7, bb8 into bb4 and during second hoisting pass it
> > > > will hoist x + y into bb2, since x + y is now available in bb3 and
> > > > bb4.
> > >
> > > But that's intended - the logic is _supposed_ to do "everything
> > > at once", thus repeated runs of hoisting should not hoist more.
> > > Which is also why we no longer iterate hoist insertion.
> > >
> > > > This might become an issue if successive hoistings will end up
> > > > hoisting the expression "too far" resulting in long range hoisting ?
> > >
> > > I think this "long range hoisting" argument is a red herring...
> > >
> > > > Also if we're hoisting from post-dom block in which case the
> > > > expression may not be truly ANTIC, and eventually end up being
> > > > inserted into a successor block by successive hoisting passes.
> > >
> > > But this shouldn't happen - can you show a testcase where it does?
> > Well, I was thinking of this test-case:
> >
> > int f(int cond1, int cond2, int cond3, int x, int y)
> > {
> >   void f1();
> >   void f2(int);
> >   void f3();
> >
> >   if (cond1)
> > f1 ();
> >   else
> > {
> >   if (cond2)
> > f2 (x + y);
> >   else
> > f3 ();
> > }
> >
> >   return x + y;
> > }
> >
> > The input to PRE pass is:
> >
> > f (int cond1, int cond2, int cond3, int x, int y)
> > {
> >   int _1;
> >   int _11;
> >
> >[local count: 1073741824]:
> >   if (cond1_3(D) != 0)
> > goto ; [33.00%]
> >   else
> > goto ; [67.00%]
> >
> >[local count: 354334800]:
> >   f1 ();
> >   goto ; [100.00%]
> >
> >[local count: 719407025]:
> >   if (cond2_4(D) != 0)
> > goto ; [50.00%]
> >   else
> > goto ; [50.00%]
> >
> >[local count: 359703512]:
> >   _1 = x_7(D) + y_8(D);
> >   f2 (_1);
> >   goto ; [100.00%]
> >
> >[local count: 359703512]:
> >   f3 ();
> >
> >[local count: 1073741824]:
> >   _11 = x_7(D) + y_8(D);
> >   return _11;
> > }
> >
> > pre dump shows:
> > Inserting expression in block 4 for code hoisting:
> > {plus_expr,x_7(D),y_8(D)} (0001)
> > Inserted _12 = x_7(D) + y_8(D);
> >  in predecessor 4 (0001)
> > Found partial redundancy for expression {plus_expr,x_7(D),y_8(D)} (0001)
> > Inserted _13 = x_7(D) + y_8(D);
> >  in predecessor 3 (0001)
>
> You are clearly looking at old GCC - GCC 11+ first do all PRE
Oops, sorry! I was looking at an old branch.
> insertion and only then do a _single_ hoist insert run.  But still
> it happens exactly as designed - there's a partial redundancy
> for the x + y add on BB7 predecessors so we insert in BB6 and BB3.
> That in turn leaves us with a perfect opportunity for hoisting
> where I don't see any reason do _not_ perform it since it saves
> us two copies of x + y.
>
> Trunk:
>
> Starting iteration 1
> Starting insert iteration 1
> Found partial redundancy for expression {plus_expr,x_7(D),y_8(D)} (0001)
> Inserted _12 = x_7(D) + y_8(D);
>  in predecessor 3 (0001)
> Inserted _13 = x_7(D) + y_8(D);
>  in predecessor 6 (0001)
> Created phi prephitmp_14 = PHI <_12(3), _1(5), _13(6)>
>  in block 7 (0001)
> Inserting expression in block 4 for code hoisting:
> {plus_expr,x_7(D),y_8(D)} (0001)
> Inserted _15 = x_7(D) + y_8(D);
>  in predecessor 4 (0001)
> Inserting expression in block 2 for code hoisting:
> {plus_expr,x_7(D),y_8(D)} (0001)
> Inserted _16 = x_7(D) + y_8(D);
>  in predecessor 2 (0001)
>
> (the hoist insert into BB4 is spurious)
>
> > Created phi prephitmp_14 = PHI <_13(3), _12(5), _12(6)>
> >  in block 7 (0001)
> > Starting insert iteration 2
> > Inserting expression in block 2 for code hoisting:
> > {plus_expr,x_7(D),y_8(D)} (0001)
> > Inserted _15 = x_7(D) + y_8(D);
> >  in predecessor 2 (0001)
> >
> > So IIUC, during first insert iteration, it is hoisting x + y into bb4
> > and PRE inserts x + y into bb3.
> > And during second iteration, it hoists x + y into bb2.
> > So we are effectively hoisting x + y from bb5, bb7 into bb2, which
> > doesn't seem to benefit other
> > two paths (bb2->bb3->bb7 and bb2->bb4->bb

Re: Successive hoisting and AVAIL_OUT in at least one successor heuristic

2021-05-06 Thread Prathamesh Kulkarni via Gcc
On Thu, 6 May 2021 at 15:43, Richard Biener  wrote:
>
> On Thu, 6 May 2021, Prathamesh Kulkarni wrote:
>
> > Hi Richard,
> > I was just wondering if second (and higher) order hoistings may defeat
> > the "AVAIL_OUT in at least one successor heuristic" ?
> >
> > For eg:
> > bb2:
> > if (cond1) goto bb3 else goto bb4;
> >
> > bb3:
> > if (cond2) goto bb5 else goto bb6;
> >
> > bb5:
> > return x + y;
> >
> > bb6:
> > return x + y;
> >
> > bb4:
> > if (cond3) goto bb7 else goto bb8;
> >
> > bb7:
> > return x + y;
> >
> > bb8:
> > return x + y;
> >
> > In above CFG, IIUC, we will end up hoisting x + y in bb2 even if bb3
> > or bb4 don't originally
> > contain x + y since during first hoisting pass x + y will be hoisted
> > from bb5, bb6 into bb3,
> > and likewise from bb7, bb8 into bb4 and during second hoisting pass it
> > will hoist x + y into bb2, since x + y is now available in bb3 and
> > bb4.
>
> But that's intended - the logic is _supposed_ to do "everything
> at once", thus repeated runs of hoisting should not hoist more.
> Which is also why we no longer iterate hoist insertion.
>
> > This might become an issue if successive hoistings will end up
> > hoisting the expression "too far" resulting in long range hoisting ?
>
> I think this "long range hoisting" argument is a red herring...
>
> > Also if we're hoisting from post-dom block in which case the
> > expression may not be truly ANTIC, and eventually end up being
> > inserted into a successor block by successive hoisting passes.
>
> But this shouldn't happen - can you show a testcase where it does?
Well, I was thinking of this test-case:

int f(int cond1, int cond2, int cond3, int x, int y)
{
  void f1();
  void f2(int);
  void f3();

  if (cond1)
f1 ();
  else
{
  if (cond2)
f2 (x + y);
  else
f3 ();
}

  return x + y;
}

The input to PRE pass is:

f (int cond1, int cond2, int cond3, int x, int y)
{
  int _1;
  int _11;

   [local count: 1073741824]:
  if (cond1_3(D) != 0)
goto ; [33.00%]
  else
goto ; [67.00%]

   [local count: 354334800]:
  f1 ();
  goto ; [100.00%]

   [local count: 719407025]:
  if (cond2_4(D) != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 359703512]:
  _1 = x_7(D) + y_8(D);
  f2 (_1);
  goto ; [100.00%]

   [local count: 359703512]:
  f3 ();

   [local count: 1073741824]:
  _11 = x_7(D) + y_8(D);
  return _11;
}

pre dump shows:
Inserting expression in block 4 for code hoisting:
{plus_expr,x_7(D),y_8(D)} (0001)
Inserted _12 = x_7(D) + y_8(D);
 in predecessor 4 (0001)
Found partial redundancy for expression {plus_expr,x_7(D),y_8(D)} (0001)
Inserted _13 = x_7(D) + y_8(D);
 in predecessor 3 (0001)
Created phi prephitmp_14 = PHI <_13(3), _12(5), _12(6)>
 in block 7 (0001)
Starting insert iteration 2
Inserting expression in block 2 for code hoisting:
{plus_expr,x_7(D),y_8(D)} (0001)
Inserted _15 = x_7(D) + y_8(D);
 in predecessor 2 (0001)

So IIUC, during first insert iteration, it is hoisting x + y into bb4
and PRE inserts x + y into bb3.
And during second iteration, it hoists x + y into bb2.
So we are effectively hoisting x + y from bb5, bb7 into bb2, which
doesn't seem to benefit other
two paths (bb2->bb3->bb7 and bb2->bb4->bb6->bb7) since they don't contain x + y.
I am not sure if we should be hoisting x + y into bb2 for this case ?

Thanks,
Prathamesh
>
> > I was wondering if we should check that the expression is "originally"
> > in AVAIL_OUT in at least one successor of block to avoid considering
> > expressions inserted by hoisting / PRE ?
>
> No, why should we?
>
> Richard.


Successive hoisting and AVAIL_OUT in at least one successor heuristic

2021-05-06 Thread Prathamesh Kulkarni via Gcc
Hi Richard,
I was just wondering if second (and higher) order hoistings may defeat
the "AVAIL_OUT in at least one successor heuristic" ?

For eg:
bb2:
if (cond1) goto bb3 else goto bb4;

bb3:
if (cond2) goto bb5 else goto bb6;

bb5:
return x + y;

bb6:
return x + y;

bb4:
if (cond3) goto bb7 else goto bb8;

bb7:
return x + y;

bb8:
return x + y;

In above CFG, IIUC, we will end up hoisting x + y in bb2 even if bb3
or bb4 don't originally
contain x + y since during first hoisting pass x + y will be hoisted
from bb5, bb6 into bb3,
and likewise from bb7, bb8 into bb4 and during second hoisting pass it
will hoist x + y into bb2, since x + y is now available in bb3 and
bb4.

This might become an issue if successive hoistings will end up
hoisting the expression "too far" resulting in long range hoisting ?
Also if we're hoisting from post-dom block in which case the
expression may not be truly ANTIC, and eventually end up being
inserted into a successor block by successive hoisting passes.

I was wondering if we should check that the expression is "originally"
in AVAIL_OUT in at least one successor of block to avoid considering
expressions inserted by hoisting / PRE ?
We could do that by "marking" blocks during first hoisting pass (or
before hoisting / PRE),
that intersect with AVAIL_OUT of at least one successor and use that
info to check the heuristic during successive hoisting passes ?

Thanks,
Prathamesh


Re: My 2nd attempt to devel for gcc

2021-04-14 Thread Prathamesh Kulkarni via Gcc
On Wed, 14 Apr 2021 at 22:54, Joseph Myers  wrote:
>
> On Wed, 14 Apr 2021, pawel k. via Gcc wrote:
>
> > My best guess is if we could hookify all target code everything callable
> > either from frontends or midend, we could try to severly cut this estimate.
>
> That's a 700-patch series (there are about 700 target macros).  For every
> target macro, it's necessary to work out the corresponding target hook
> interface, which shouldn't always correspond one-to-one to the macro
> interface (and hooks have well-defined argument types, which macros don't
> always), and deal with the conversion, covering all the irregular ways in
> which different targets define the macros (including e.g. cases where some
> architectures define a target macro differently for different target
> OSes).
Joseph has articulately summarized the process in this thread on
target macro conversion:
https://gcc.gnu.org/pipermail/gcc/2015-February/216586.html

Thanks,
Prathamesh
>
> --
> Joseph S. Myers
> jos...@codesourcery.com


Re: How to decide the what type is currently being used in tree_node?

2021-02-19 Thread Prathamesh Kulkarni via Gcc
On Fri, 19 Feb 2021 at 10:31, Shuai Wang via Gcc  wrote:
>
> Hello,
>
> I noticed that tree_node is implemented as a union (
> https://code.woboq.org/gcc/gcc/tree-core.h.html#tree_node). However, I
> cannot find a way of checking whether the current tree_node is really a
> base or type.
>
> For instance, currently when I am using:
>
> is_gimple_constant(v)
>
> Given `v` as a tree type but NOT base type, the above statement would
> crash. I am thinking there should be a method like:
>
> is_tree_base(v) == false
>
> or something like this; however, I couldn't find one. Can anyone shed some
> lights on this? Thank you very much!
If you want to ascertain which class the tree node belongs to, you can
use either TREE_CODE, or one of the _P macros
defined in tree.h. If you use an accessor that needs tree node
belonging to a certain class, then you need to check for it before
hand.
For example before using DECL_NAME on a tree node, you need to
explicitly check if it's indeed a decl_node using DECL_P.
For is_gimple_constant(v), it will return true if v is a one of the
constant nodes (*_CST), and false otherwise, so I don't see how it
will crash if you
pass a non-constant node ?

Thanks,
Prathamesh
>
> best,
> Shuai


Re: Performing inter-procedural dataflow analysis

2021-02-17 Thread Prathamesh Kulkarni via Gcc
On Thu, 18 Feb 2021 at 08:39, Shuai Wang via Gcc  wrote:
>
> Hello,
>
> I am doing interprocedural dataflow analysis and countered the following
> issue. Suppose I have an GIMPLE IR code as follows, which is after the
> "simdclone" pass while before my own SIMPLE IPA pass:
>
>
> foo (int a, int b)
> {
>   int c;
>   int d;
>   int D.2425;
>   int _5;
>
>:
> *  c_4 = a_2(D) + b_3(D);  *
> *  _5 = c_4;*
>   
>
> As you can see, while propagating certain dataflow facts among local
> variables are smooth (e.g., from c_4 --> c_4 --> _5), I can hardly somehow
> "link" function parameter "a" (or "b") with its first local usage "a_2(D)"
> or "b_3(D)".
>
> So my question is, when traversing the GIMPLE statements and encounter SSA
> variables like "a_2(D)" or "b_3(D)", how do I know they originate from
> parameter "a" and "b"?
You can use SSA_NAME_VAR to get the "base" variable corresponding to ssa name,
and then check it against PARM_DECL.
Sth like:
is_param = TREE_CODE (SSA_NAME_VAR (ssa_name)) == PARM_DECL;

Thanks,
Prathamesh
>
> Thank you!
>
> Best,
> Shuai


Re: Request for contribution to your project

2021-01-22 Thread Prathamesh Kulkarni via Gcc
On Thu, 21 Jan 2021 at 20:49, divyanshu jamloki via Gcc  wrote:
>
> Ma'am
>
> I am a 1st year computer science engineering undergraduate student at
> krishna engineering college (affiliated to Dr. A.P.J. Abdul Kalam Technical
> University (AKTU)) . I am actively looking for some GSoC organisation to
> contribute . I actively contribute 2 years to a website name
> livecultureofindia.com ( WordPress ) and have worked with html , c ,c++ . I
> am interested to contribute in your project and your view to change world
> by making others life easy .
>
> I saw your organisation in GSoC and thought of asking you if system can be
> a good organisation for me to start . If yes please guide me how to begin
> and let me know how can I contact you for further help.
>
>
> Hoping to hear from you soon
Hi Divyanshu,
As a starting point, go thru "Before you apply" section in GSoC wiki:
https://gcc.gnu.org/wiki/SummerOfCode,
and the ideas listed on the page. Feel free to ask if you have any
questions about getting started
with GCC or about the ideas listed under GSoC 2021 section.

Thanks,
Prathamesh
>
>  Thank you


Re: Help with PR97872

2020-12-10 Thread Prathamesh Kulkarni via Gcc
On Thu, 10 Dec 2020 at 17:11, Richard Biener  wrote:
>
> On Wed, 9 Dec 2020, Prathamesh Kulkarni wrote:
>
> > On Tue, 8 Dec 2020 at 14:36, Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Mon, 7 Dec 2020 at 17:37, Hongtao Liu  wrote:
> > > >
> > > > On Mon, Dec 7, 2020 at 7:11 PM Prathamesh Kulkarni
> > > >  wrote:
> > > > >
> > > > > On Mon, 7 Dec 2020 at 16:15, Hongtao Liu  wrote:
> > > > > >
> > > > > > On Mon, Dec 7, 2020 at 5:47 PM Richard Biener  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > >
> > > > > > > > On Mon, 7 Dec 2020 at 13:01, Richard Biener  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, 4 Dec 2020 at 17:18, Richard Biener 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener 
> > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > For the test mentioned in PR, I was trying to 
> > > > > > > > > > > > > > > > see if we could do
> > > > > > > > > > > > > > > > specialized expansion for vcond in target when 
> > > > > > > > > > > > > > > > operands are -1 and 0.
> > > > > > > > > > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > > > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > > > > > > > > > (reg:V8QI 117)
> > > > > > > > > > > > > > > > (reg:V8QI 118)
> > > > > > > > > > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > > > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > where r117 and r118 are set to vector constants 
> > > > > > > > > > > > > > > > -1 and 0 respectively.
> > > > > > > > > > > > > > > > However, I am not sure if there's a way to 
> > > > > > > > > > > > > > > > check if the register is
> > > > > > > > > > > > > > > > constant during expansion time (since we don't 
> > > > > > > > > > > > > > > > have df analysis yet) ?
> > > > > >
> > > > > > It seems to me that all you need to do is relax the predicates of 
> > > > > > op1
> > > > > > and op2 in vcondmn to accept const0_rtx and constm1_rtx. I haven't
> > > > > > debugged it, but I see that vcondmn in neon.md only accepts
> > > > > > s_register_operand.
> > > > > >
> > > > > > (define_expand "vcond"
> > > > > >   [(set (match_operand:VDQW 0 "s_register_operand")
> > > > > > (if_then_else:VDQW
> > > > > >   (match_operator 3 "comparison_operator"
> > > > > > [(match_operand:VDQW 4 "s_register_operand")
> > > > > >  (match_operand:VDQW 5 "reg_or_zero_operand")])
> > > > > >   (match_operand:VDQW 1 "s_register_operand")
> > > > > >   (match_operand:VDQW 2 "s_register_operand")))]
> > > > > >   "TARGET_NEON && (! || 
> > > > > > flag_unsafe_math_optimizations)"
> > > > > > {
> > > > > >   arm_expand_vcond (operands, mode);
> > > > > >   DONE;
> > > > > > })
> > > > > >
> > > > > > in sse.md it's defined as
> > > > > > (define_expand "vcondu"
> > > > > >   [(set (match_operand:V_512 0 "register_operand")
> > > > > > (if_then_else:V_512
> > > > > >   (match_operator 3 ""
> > > > > > [(match_operand:VI_AVX512BW 4 "nonimmediate_operand")
> > > > > >  (match_operand:VI_AVX512BW 5 "nonimmediate_operand")])
> > > > > >   (match_operand:V_512 1 "general_operand")
> > > > > >   (match_operand:V_512 2 "general_operand")))]
> > > > > >   "TARGET_AVX512F
> > > > > >&& (GET_MODE_NUNITS (mode)
> > > > > >== GET_MODE_NUNITS (mode))"
> > > > > > {
> > > > > >   bool ok = ix86_expand_int_vcond (operands);
> > > > > >   gcc_assert (ok);
> > > > > >   DONE;
> > > > > > })
> > > > > >
> > > > > > then we can get operands[1] and operands[2] as
> > > > > >
> > > > > > (gdb) p debug_rtx (operands[1])
> > > > > >  (const_vector:V16QI [
> > > > > > (const_int -1 [0x]) repeated x16
> > > > > > ])
> > > > > > (gdb) p debug_rtx (operands[2])
> > > > > > (reg:V16QI 82 [ _2 ])
> > > > > > (const_vector:V16QI [
> > > > > > (const_int 0 [0]) repeated x16
> > > > > > ])
> > > > > Hi Hongtao,
> > > > > Thanks for the suggestions!
> > > > > However IIUC from vector extensions doc page, the result of vector
> > > > > comparison is defined to be 0
> > > > > or -1, so would it be b

Re: Help with PR97872

2020-12-09 Thread Prathamesh Kulkarni via Gcc
On Tue, 8 Dec 2020 at 14:36, Prathamesh Kulkarni
 wrote:
>
> On Mon, 7 Dec 2020 at 17:37, Hongtao Liu  wrote:
> >
> > On Mon, Dec 7, 2020 at 7:11 PM Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Mon, 7 Dec 2020 at 16:15, Hongtao Liu  wrote:
> > > >
> > > > On Mon, Dec 7, 2020 at 5:47 PM Richard Biener  wrote:
> > > > >
> > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > >
> > > > > > On Mon, 7 Dec 2020 at 13:01, Richard Biener  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > >
> > > > > > > > On Fri, 4 Dec 2020 at 17:18, Richard Biener  
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > >
> > > > > > > > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > For the test mentioned in PR, I was trying to see 
> > > > > > > > > > > > > > if we could do
> > > > > > > > > > > > > > specialized expansion for vcond in target when 
> > > > > > > > > > > > > > operands are -1 and 0.
> > > > > > > > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > > > > > > > (reg:V8QI 117)
> > > > > > > > > > > > > > (reg:V8QI 118)
> > > > > > > > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > where r117 and r118 are set to vector constants -1 
> > > > > > > > > > > > > > and 0 respectively.
> > > > > > > > > > > > > > However, I am not sure if there's a way to check if 
> > > > > > > > > > > > > > the register is
> > > > > > > > > > > > > > constant during expansion time (since we don't have 
> > > > > > > > > > > > > > df analysis yet) ?
> > > >
> > > > It seems to me that all you need to do is relax the predicates of op1
> > > > and op2 in vcondmn to accept const0_rtx and constm1_rtx. I haven't
> > > > debugged it, but I see that vcondmn in neon.md only accepts
> > > > s_register_operand.
> > > >
> > > > (define_expand "vcond"
> > > >   [(set (match_operand:VDQW 0 "s_register_operand")
> > > > (if_then_else:VDQW
> > > >   (match_operator 3 "comparison_operator"
> > > > [(match_operand:VDQW 4 "s_register_operand")
> > > >  (match_operand:VDQW 5 "reg_or_zero_operand")])
> > > >   (match_operand:VDQW 1 "s_register_operand")
> > > >   (match_operand:VDQW 2 "s_register_operand")))]
> > > >   "TARGET_NEON && (! || flag_unsafe_math_optimizations)"
> > > > {
> > > >   arm_expand_vcond (operands, mode);
> > > >   DONE;
> > > > })
> > > >
> > > > in sse.md it's defined as
> > > > (define_expand "vcondu"
> > > >   [(set (match_operand:V_512 0 "register_operand")
> > > > (if_then_else:V_512
> > > >   (match_operator 3 ""
> > > > [(match_operand:VI_AVX512BW 4 "nonimmediate_operand")
> > > >  (match_operand:VI_AVX512BW 5 "nonimmediate_operand")])
> > > >   (match_operand:V_512 1 "general_operand")
> > > >   (match_operand:V_512 2 "general_operand")))]
> > > >   "TARGET_AVX512F
> > > >&& (GET_MODE_NUNITS (mode)
> > > >== GET_MODE_NUNITS (mode))"
> > > > {
> > > >   bool ok = ix86_expand_int_vcond (operands);
> > > >   gcc_assert (ok);
> > > >   DONE;
> > > > })
> > > >
> > > > then we can get operands[1] and operands[2] as
> > > >
> > > > (gdb) p debug_rtx (operands[1])
> > > >  (const_vector:V16QI [
> > > > (const_int -1 [0x]) repeated x16
> > > > ])
> > > > (gdb) p debug_rtx (operands[2])
> > > > (reg:V16QI 82 [ _2 ])
> > > > (const_vector:V16QI [
> > > > (const_int 0 [0]) repeated x16
> > > > ])
> > > Hi Hongtao,
> > > Thanks for the suggestions!
> > > However IIUC from vector extensions doc page, the result of vector
> > > comparison is defined to be 0
> > > or -1, so would it be better to canonicalize
> > > x cmp y ? -1 : 0 to x cmp y, on GIMPLE itself during gimple-isel and
> > > adjust targets if required ?
> >
> > Yes, it would be more straightforward to handle it in gimple isel, I
> > would adjust the backend and testcase after you check in the patch.
> Thanks! I have committed the attached patch in
> 3a6e3ad38a17a03ee0139b49a0946e7b9ded1eb1.
Hi,
I was looking at similar issue in PR97903 and wondering what will be the
right approach to lower (a & b) != 0 to vtst ?
For test-case:
#include 

int8x8_t f1(int8x8_t a, int8x8_t b) {
  return (a & b

Re: Help with PR97872

2020-12-08 Thread Prathamesh Kulkarni via Gcc
On Mon, 7 Dec 2020 at 17:37, Hongtao Liu  wrote:
>
> On Mon, Dec 7, 2020 at 7:11 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Mon, 7 Dec 2020 at 16:15, Hongtao Liu  wrote:
> > >
> > > On Mon, Dec 7, 2020 at 5:47 PM Richard Biener  wrote:
> > > >
> > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > >
> > > > > On Mon, 7 Dec 2020 at 13:01, Richard Biener  wrote:
> > > > > >
> > > > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > >
> > > > > > > On Fri, 4 Dec 2020 at 17:18, Richard Biener  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > >
> > > > > > > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > >
> > > > > > > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > For the test mentioned in PR, I was trying to see if 
> > > > > > > > > > > > > we could do
> > > > > > > > > > > > > specialized expansion for vcond in target when 
> > > > > > > > > > > > > operands are -1 and 0.
> > > > > > > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > > > > > > (reg:V8QI 117)
> > > > > > > > > > > > > (reg:V8QI 118)
> > > > > > > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > > > > > > >
> > > > > > > > > > > > > where r117 and r118 are set to vector constants -1 
> > > > > > > > > > > > > and 0 respectively.
> > > > > > > > > > > > > However, I am not sure if there's a way to check if 
> > > > > > > > > > > > > the register is
> > > > > > > > > > > > > constant during expansion time (since we don't have 
> > > > > > > > > > > > > df analysis yet) ?
> > >
> > > It seems to me that all you need to do is relax the predicates of op1
> > > and op2 in vcondmn to accept const0_rtx and constm1_rtx. I haven't
> > > debugged it, but I see that vcondmn in neon.md only accepts
> > > s_register_operand.
> > >
> > > (define_expand "vcond"
> > >   [(set (match_operand:VDQW 0 "s_register_operand")
> > > (if_then_else:VDQW
> > >   (match_operator 3 "comparison_operator"
> > > [(match_operand:VDQW 4 "s_register_operand")
> > >  (match_operand:VDQW 5 "reg_or_zero_operand")])
> > >   (match_operand:VDQW 1 "s_register_operand")
> > >   (match_operand:VDQW 2 "s_register_operand")))]
> > >   "TARGET_NEON && (! || flag_unsafe_math_optimizations)"
> > > {
> > >   arm_expand_vcond (operands, mode);
> > >   DONE;
> > > })
> > >
> > > in sse.md it's defined as
> > > (define_expand "vcondu"
> > >   [(set (match_operand:V_512 0 "register_operand")
> > > (if_then_else:V_512
> > >   (match_operator 3 ""
> > > [(match_operand:VI_AVX512BW 4 "nonimmediate_operand")
> > >  (match_operand:VI_AVX512BW 5 "nonimmediate_operand")])
> > >   (match_operand:V_512 1 "general_operand")
> > >   (match_operand:V_512 2 "general_operand")))]
> > >   "TARGET_AVX512F
> > >&& (GET_MODE_NUNITS (mode)
> > >== GET_MODE_NUNITS (mode))"
> > > {
> > >   bool ok = ix86_expand_int_vcond (operands);
> > >   gcc_assert (ok);
> > >   DONE;
> > > })
> > >
> > > then we can get operands[1] and operands[2] as
> > >
> > > (gdb) p debug_rtx (operands[1])
> > >  (const_vector:V16QI [
> > > (const_int -1 [0x]) repeated x16
> > > ])
> > > (gdb) p debug_rtx (operands[2])
> > > (reg:V16QI 82 [ _2 ])
> > > (const_vector:V16QI [
> > > (const_int 0 [0]) repeated x16
> > > ])
> > Hi Hongtao,
> > Thanks for the suggestions!
> > However IIUC from vector extensions doc page, the result of vector
> > comparison is defined to be 0
> > or -1, so would it be better to canonicalize
> > x cmp y ? -1 : 0 to x cmp y, on GIMPLE itself during gimple-isel and
> > adjust targets if required ?
>
> Yes, it would be more straightforward to handle it in gimple isel, I
> would adjust the backend and testcase after you check in the patch.
Thanks! I have committed the attached patch in
3a6e3ad38a17a03ee0139b49a0946e7b9ded1eb1.

Regards,
Prathamesh
>
> > Alternatively, I could try fixing this in backend as you suggest above.
> >
> > Thanks,
> > Prathamesh
> > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alternatively, should we add a target hook that 
> > > > > > > > > > > > > returns true if the
> > > > > > > > > > > > > result of vector comparison is set to all-ones or 
> > > > > > > > > > > > > all-zeros, and then
> > > > > > > > > > > > > use this hook in gimple ISEL to effectively turn 
> > > > > > > >

Re: Help with PR97872

2020-12-07 Thread Prathamesh Kulkarni via Gcc
On Mon, 7 Dec 2020 at 16:15, Hongtao Liu  wrote:
>
> On Mon, Dec 7, 2020 at 5:47 PM Richard Biener  wrote:
> >
> > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> >
> > > On Mon, 7 Dec 2020 at 13:01, Richard Biener  wrote:
> > > >
> > > > On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
> > > >
> > > > > On Fri, 4 Dec 2020 at 17:18, Richard Biener  wrote:
> > > > > >
> > > > > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > >
> > > > > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > >
> > > > > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > > For the test mentioned in PR, I was trying to see if we 
> > > > > > > > > > > could do
> > > > > > > > > > > specialized expansion for vcond in target when operands 
> > > > > > > > > > > are -1 and 0.
> > > > > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > > > > (reg:V8QI 117)
> > > > > > > > > > > (reg:V8QI 118)
> > > > > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > > > > >
> > > > > > > > > > > where r117 and r118 are set to vector constants -1 and 0 
> > > > > > > > > > > respectively.
> > > > > > > > > > > However, I am not sure if there's a way to check if the 
> > > > > > > > > > > register is
> > > > > > > > > > > constant during expansion time (since we don't have df 
> > > > > > > > > > > analysis yet) ?
>
> It seems to me that all you need to do is relax the predicates of op1
> and op2 in vcondmn to accept const0_rtx and constm1_rtx. I haven't
> debugged it, but I see that vcondmn in neon.md only accepts
> s_register_operand.
>
> (define_expand "vcond"
>   [(set (match_operand:VDQW 0 "s_register_operand")
> (if_then_else:VDQW
>   (match_operator 3 "comparison_operator"
> [(match_operand:VDQW 4 "s_register_operand")
>  (match_operand:VDQW 5 "reg_or_zero_operand")])
>   (match_operand:VDQW 1 "s_register_operand")
>   (match_operand:VDQW 2 "s_register_operand")))]
>   "TARGET_NEON && (! || flag_unsafe_math_optimizations)"
> {
>   arm_expand_vcond (operands, mode);
>   DONE;
> })
>
> in sse.md it's defined as
> (define_expand "vcondu"
>   [(set (match_operand:V_512 0 "register_operand")
> (if_then_else:V_512
>   (match_operator 3 ""
> [(match_operand:VI_AVX512BW 4 "nonimmediate_operand")
>  (match_operand:VI_AVX512BW 5 "nonimmediate_operand")])
>   (match_operand:V_512 1 "general_operand")
>   (match_operand:V_512 2 "general_operand")))]
>   "TARGET_AVX512F
>&& (GET_MODE_NUNITS (mode)
>== GET_MODE_NUNITS (mode))"
> {
>   bool ok = ix86_expand_int_vcond (operands);
>   gcc_assert (ok);
>   DONE;
> })
>
> then we can get operands[1] and operands[2] as
>
> (gdb) p debug_rtx (operands[1])
>  (const_vector:V16QI [
> (const_int -1 [0x]) repeated x16
> ])
> (gdb) p debug_rtx (operands[2])
> (reg:V16QI 82 [ _2 ])
> (const_vector:V16QI [
> (const_int 0 [0]) repeated x16
> ])
Hi Hongtao,
Thanks for the suggestions!
However IIUC from vector extensions doc page, the result of vector
comparison is defined to be 0
or -1, so would it be better to canonicalize
x cmp y ? -1 : 0 to x cmp y, on GIMPLE itself during gimple-isel and
adjust targets if required ?
Alternatively, I could try fixing this in backend as you suggest above.

Thanks,
Prathamesh
>
> > > > > > > > > > >
> > > > > > > > > > > Alternatively, should we add a target hook that returns 
> > > > > > > > > > > true if the
> > > > > > > > > > > result of vector comparison is set to all-ones or 
> > > > > > > > > > > all-zeros, and then
> > > > > > > > > > > use this hook in gimple ISEL to effectively turn 
> > > > > > > > > > > VEC_COND_EXPR into nop ?
> > > > > > > > > >
> > > > > > > > > > Would everything match-up for a .VEC_CMP IFN producing a 
> > > > > > > > > > non-mask
> > > > > > > > > > vector type?  ISEL could special case the a ? -1 : 0 case 
> > > > > > > > > > this way.
> > > > > > > > > I think the vec_cmp pattern matches but it produces a masked 
> > > > > > > > > vector type.
> > > > > > > > > In the attached patch, I simply replaced:
> > > > > > > > > _1 = a < b
> > > > > > > > > x = _1 ? -1 : 0
> > > > > > > > > with
> > > > > > > > > x = view_convert_expr<_1>
> > > > > > > > >
> > > > > > > > > For the test-case, isel generates:
> > > > > > > > >   vector(8)  _1;
> > > > > > > > >   vector(8) signed char _2;
> > > > > > > > >   uint8x8_t _5;
> > > > > > > > >
> > > > > > > > >[local count: 1073741824]:
> > > > > 

Re: Help with PR97872

2020-12-07 Thread Prathamesh Kulkarni via Gcc
On Mon, 7 Dec 2020 at 13:01, Richard Biener  wrote:
>
> On Mon, 7 Dec 2020, Prathamesh Kulkarni wrote:
>
> > On Fri, 4 Dec 2020 at 17:18, Richard Biener  wrote:
> > >
> > > On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
> > >
> > > > On Thu, 3 Dec 2020 at 16:35, Richard Biener  wrote:
> > > > >
> > > > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > > > >
> > > > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > For the test mentioned in PR, I was trying to see if we could do
> > > > > > > > specialized expansion for vcond in target when operands are -1 
> > > > > > > > and 0.
> > > > > > > > arm_expand_vcond gets the following operands:
> > > > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > > > (reg:V8QI 117)
> > > > > > > > (reg:V8QI 118)
> > > > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > > > (reg/v:V8QI 116 [ b ])
> > > > > > > >
> > > > > > > > where r117 and r118 are set to vector constants -1 and 0 
> > > > > > > > respectively.
> > > > > > > > However, I am not sure if there's a way to check if the 
> > > > > > > > register is
> > > > > > > > constant during expansion time (since we don't have df analysis 
> > > > > > > > yet) ?
> > > > > > > >
> > > > > > > > Alternatively, should we add a target hook that returns true if 
> > > > > > > > the
> > > > > > > > result of vector comparison is set to all-ones or all-zeros, 
> > > > > > > > and then
> > > > > > > > use this hook in gimple ISEL to effectively turn VEC_COND_EXPR 
> > > > > > > > into nop ?
> > > > > > >
> > > > > > > Would everything match-up for a .VEC_CMP IFN producing a non-mask
> > > > > > > vector type?  ISEL could special case the a ? -1 : 0 case this 
> > > > > > > way.
> > > > > > I think the vec_cmp pattern matches but it produces a masked vector 
> > > > > > type.
> > > > > > In the attached patch, I simply replaced:
> > > > > > _1 = a < b
> > > > > > x = _1 ? -1 : 0
> > > > > > with
> > > > > > x = view_convert_expr<_1>
> > > > > >
> > > > > > For the test-case, isel generates:
> > > > > >   vector(8)  _1;
> > > > > >   vector(8) signed char _2;
> > > > > >   uint8x8_t _5;
> > > > > >
> > > > > >[local count: 1073741824]:
> > > > > >   _1 = a_3(D) < b_4(D);
> > > > > >   _2 = VIEW_CONVERT_EXPR(_1);
> > > > > >   _5 = VIEW_CONVERT_EXPR(_2);
> > > > > >   return _5;
> > > > > >
> > > > > > and results in desired code-gen:
> > > > > > f1:
> > > > > > vcgt.s8 d0, d1, d0
> > > > > > bx  lr
> > > > > >
> > > > > > Altho I guess, we should remove the redundant conversions during 
> > > > > > isel itself ?
> > > > > > and result in:
> > > > > > _1 = a_3(D) < b_4(D)
> > > > > > _5 = VIEW_CONVERT_EXPR(_1)
> > > > > >
> > > > > > (Patch is lightly tested with only vect.exp)
> > > > >
> > > > > +  /* For targets where result of comparison is all-ones or all-zeros,
> > > > > + a < b ? -1 : 0 can be reduced to a < b.  */
> > > > > +
> > > > > +  if (integer_minus_onep (op1) && integer_zerop (op2))
> > > > > +{
> > > > >
> > > > > So this really belongs here:
> > > > >
> > > > >   tree op0_type = TREE_TYPE (op0);
> > > > >   tree op0a_type = TREE_TYPE (op0a);
> > > > >
> > > > > <---
> > > > >
> > > > >   if (used_vec_cond_exprs >= 2
> > > > >   && (get_vcond_mask_icode (mode, TYPE_MODE (op0_type))
> > > > >   != CODE_FOR_nothing)
> > > > >   && expand_vec_cmp_expr_p (op0a_type, op0_type, tcode))
> > > > > {
> > > > >
> > > > >
> > > > > +  gassign *def_stmt = dyn_cast (SSA_NAME_DEF_STMT 
> > > > > (op0));
> > > > > +  tree op0a = gimple_assign_rhs1 (def_stmt);
> > > > > +  tree op0_type = TREE_TYPE (op0);
> > > > > +  tree op0a_type = TREE_TYPE (op0a);
> > > > > +  enum tree_code tcode = gimple_assign_rhs_code (def_stmt);
> > > > > +
> > > > > +  if (expand_vec_cmp_expr_p (op0a_type, op0_type, tcode))
> > > > > +   {
> > > > > + tree conv_op = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), 
> > > > > op0);
> > > > > + gassign *new_stmt = gimple_build_assign (lhs, conv_op);
> > > > > + gsi_replace (gsi, new_stmt, true);
> > > > >
> > > > > and you need to verify that the mode of the lhs and the mode of op0
> > > > > agree and that the target can actually expand_vec_cmp_expr_p
> > > > Thanks for the suggestions, does the attached patch look OK ?
> > > > Sorry, I am not sure how to check if target can actually expand vec_cmp 
> > > > ?
> > > > I assume that since expand_vec_cmp_expr_p queries optab and if it gets
> > > > a valid cmp icode, that
> > > > should be sufficient ?
> > >
> > > Yes
> > Hi Richard,
> > I tested the patch, and it shows one regression for pr78102.c, because
> > of extra pcmpeqq in code-gen for x != y on x86.
> > For the test-case:
> > __v2di

Re: Help with PR97872

2020-12-06 Thread Prathamesh Kulkarni via Gcc
On Fri, 4 Dec 2020 at 17:18, Richard Biener  wrote:
>
> On Fri, 4 Dec 2020, Prathamesh Kulkarni wrote:
>
> > On Thu, 3 Dec 2020 at 16:35, Richard Biener  wrote:
> > >
> > > On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
> > >
> > > > On Tue, 1 Dec 2020 at 16:39, Richard Biener  wrote:
> > > > >
> > > > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > > > >
> > > > > > Hi,
> > > > > > For the test mentioned in PR, I was trying to see if we could do
> > > > > > specialized expansion for vcond in target when operands are -1 and 
> > > > > > 0.
> > > > > > arm_expand_vcond gets the following operands:
> > > > > > (reg:V8QI 113 [ _2 ])
> > > > > > (reg:V8QI 117)
> > > > > > (reg:V8QI 118)
> > > > > > (lt (reg/v:V8QI 115 [ a ])
> > > > > > (reg/v:V8QI 116 [ b ]))
> > > > > > (reg/v:V8QI 115 [ a ])
> > > > > > (reg/v:V8QI 116 [ b ])
> > > > > >
> > > > > > where r117 and r118 are set to vector constants -1 and 0 
> > > > > > respectively.
> > > > > > However, I am not sure if there's a way to check if the register is
> > > > > > constant during expansion time (since we don't have df analysis 
> > > > > > yet) ?
> > > > > >
> > > > > > Alternatively, should we add a target hook that returns true if the
> > > > > > result of vector comparison is set to all-ones or all-zeros, and 
> > > > > > then
> > > > > > use this hook in gimple ISEL to effectively turn VEC_COND_EXPR into 
> > > > > > nop ?
> > > > >
> > > > > Would everything match-up for a .VEC_CMP IFN producing a non-mask
> > > > > vector type?  ISEL could special case the a ? -1 : 0 case this way.
> > > > I think the vec_cmp pattern matches but it produces a masked vector 
> > > > type.
> > > > In the attached patch, I simply replaced:
> > > > _1 = a < b
> > > > x = _1 ? -1 : 0
> > > > with
> > > > x = view_convert_expr<_1>
> > > >
> > > > For the test-case, isel generates:
> > > >   vector(8)  _1;
> > > >   vector(8) signed char _2;
> > > >   uint8x8_t _5;
> > > >
> > > >[local count: 1073741824]:
> > > >   _1 = a_3(D) < b_4(D);
> > > >   _2 = VIEW_CONVERT_EXPR(_1);
> > > >   _5 = VIEW_CONVERT_EXPR(_2);
> > > >   return _5;
> > > >
> > > > and results in desired code-gen:
> > > > f1:
> > > > vcgt.s8 d0, d1, d0
> > > > bx  lr
> > > >
> > > > Altho I guess, we should remove the redundant conversions during isel 
> > > > itself ?
> > > > and result in:
> > > > _1 = a_3(D) < b_4(D)
> > > > _5 = VIEW_CONVERT_EXPR(_1)
> > > >
> > > > (Patch is lightly tested with only vect.exp)
> > >
> > > +  /* For targets where result of comparison is all-ones or all-zeros,
> > > + a < b ? -1 : 0 can be reduced to a < b.  */
> > > +
> > > +  if (integer_minus_onep (op1) && integer_zerop (op2))
> > > +{
> > >
> > > So this really belongs here:
> > >
> > >   tree op0_type = TREE_TYPE (op0);
> > >   tree op0a_type = TREE_TYPE (op0a);
> > >
> > > <---
> > >
> > >   if (used_vec_cond_exprs >= 2
> > >   && (get_vcond_mask_icode (mode, TYPE_MODE (op0_type))
> > >   != CODE_FOR_nothing)
> > >   && expand_vec_cmp_expr_p (op0a_type, op0_type, tcode))
> > > {
> > >
> > >
> > > +  gassign *def_stmt = dyn_cast (SSA_NAME_DEF_STMT (op0));
> > > +  tree op0a = gimple_assign_rhs1 (def_stmt);
> > > +  tree op0_type = TREE_TYPE (op0);
> > > +  tree op0a_type = TREE_TYPE (op0a);
> > > +  enum tree_code tcode = gimple_assign_rhs_code (def_stmt);
> > > +
> > > +  if (expand_vec_cmp_expr_p (op0a_type, op0_type, tcode))
> > > +   {
> > > + tree conv_op = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), op0);
> > > + gassign *new_stmt = gimple_build_assign (lhs, conv_op);
> > > + gsi_replace (gsi, new_stmt, true);
> > >
> > > and you need to verify that the mode of the lhs and the mode of op0
> > > agree and that the target can actually expand_vec_cmp_expr_p
> > Thanks for the suggestions, does the attached patch look OK ?
> > Sorry, I am not sure how to check if target can actually expand vec_cmp ?
> > I assume that since expand_vec_cmp_expr_p queries optab and if it gets
> > a valid cmp icode, that
> > should be sufficient ?
>
> Yes
Hi Richard,
I tested the patch, and it shows one regression for pr78102.c, because
of extra pcmpeqq in code-gen for x != y on x86.
For the test-case:
__v2di
baz (const __v2di x, const __v2di y)
{
  return x != y;
}

Before patch:
baz:
pcmpeqq %xmm1, %xmm0
pcmpeqd %xmm1, %xmm1
pandn   %xmm1, %xmm0
ret

After patch,
Before ISEL:
  vector(2)  _1;
  __v2di _4;

   [local count: 1073741824]:
  _1 = x_2(D) != y_3(D);
  _4 = VEC_COND_EXPR <_1, { -1, -1 }, { 0, 0 }>;
  return _4;

After ISEL:
  vector(2)  _1;
  __v2di _4;

   [local count: 1073741824]:
  _1 = x_2(D) != y_3(D);
  _4 = VIEW_CONVERT_EXPR<__v2di>(_1);
  return _4;

which results in:
pcmpeqq %xmm1, %xmm0
pxor%xmm1, %xmm1
pcmpeqq %xmm1, %xmm0
ret

IIUC, the new code-gen is essentially compari

Re: Help with PR97872

2020-12-04 Thread Prathamesh Kulkarni via Gcc
On Thu, 3 Dec 2020 at 16:35, Richard Biener  wrote:
>
> On Thu, 3 Dec 2020, Prathamesh Kulkarni wrote:
>
> > On Tue, 1 Dec 2020 at 16:39, Richard Biener  wrote:
> > >
> > > On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
> > >
> > > > Hi,
> > > > For the test mentioned in PR, I was trying to see if we could do
> > > > specialized expansion for vcond in target when operands are -1 and 0.
> > > > arm_expand_vcond gets the following operands:
> > > > (reg:V8QI 113 [ _2 ])
> > > > (reg:V8QI 117)
> > > > (reg:V8QI 118)
> > > > (lt (reg/v:V8QI 115 [ a ])
> > > > (reg/v:V8QI 116 [ b ]))
> > > > (reg/v:V8QI 115 [ a ])
> > > > (reg/v:V8QI 116 [ b ])
> > > >
> > > > where r117 and r118 are set to vector constants -1 and 0 respectively.
> > > > However, I am not sure if there's a way to check if the register is
> > > > constant during expansion time (since we don't have df analysis yet) ?
> > > >
> > > > Alternatively, should we add a target hook that returns true if the
> > > > result of vector comparison is set to all-ones or all-zeros, and then
> > > > use this hook in gimple ISEL to effectively turn VEC_COND_EXPR into nop 
> > > > ?
> > >
> > > Would everything match-up for a .VEC_CMP IFN producing a non-mask
> > > vector type?  ISEL could special case the a ? -1 : 0 case this way.
> > I think the vec_cmp pattern matches but it produces a masked vector type.
> > In the attached patch, I simply replaced:
> > _1 = a < b
> > x = _1 ? -1 : 0
> > with
> > x = view_convert_expr<_1>
> >
> > For the test-case, isel generates:
> >   vector(8)  _1;
> >   vector(8) signed char _2;
> >   uint8x8_t _5;
> >
> >[local count: 1073741824]:
> >   _1 = a_3(D) < b_4(D);
> >   _2 = VIEW_CONVERT_EXPR(_1);
> >   _5 = VIEW_CONVERT_EXPR(_2);
> >   return _5;
> >
> > and results in desired code-gen:
> > f1:
> > vcgt.s8 d0, d1, d0
> > bx  lr
> >
> > Altho I guess, we should remove the redundant conversions during isel 
> > itself ?
> > and result in:
> > _1 = a_3(D) < b_4(D)
> > _5 = VIEW_CONVERT_EXPR(_1)
> >
> > (Patch is lightly tested with only vect.exp)
>
> +  /* For targets where result of comparison is all-ones or all-zeros,
> + a < b ? -1 : 0 can be reduced to a < b.  */
> +
> +  if (integer_minus_onep (op1) && integer_zerop (op2))
> +{
>
> So this really belongs here:
>
>   tree op0_type = TREE_TYPE (op0);
>   tree op0a_type = TREE_TYPE (op0a);
>
> <---
>
>   if (used_vec_cond_exprs >= 2
>   && (get_vcond_mask_icode (mode, TYPE_MODE (op0_type))
>   != CODE_FOR_nothing)
>   && expand_vec_cmp_expr_p (op0a_type, op0_type, tcode))
> {
>
>
> +  gassign *def_stmt = dyn_cast (SSA_NAME_DEF_STMT (op0));
> +  tree op0a = gimple_assign_rhs1 (def_stmt);
> +  tree op0_type = TREE_TYPE (op0);
> +  tree op0a_type = TREE_TYPE (op0a);
> +  enum tree_code tcode = gimple_assign_rhs_code (def_stmt);
> +
> +  if (expand_vec_cmp_expr_p (op0a_type, op0_type, tcode))
> +   {
> + tree conv_op = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (lhs), op0);
> + gassign *new_stmt = gimple_build_assign (lhs, conv_op);
> + gsi_replace (gsi, new_stmt, true);
>
> and you need to verify that the mode of the lhs and the mode of op0
> agree and that the target can actually expand_vec_cmp_expr_p
Thanks for the suggestions, does the attached patch look OK ?
Sorry, I am not sure how to check if target can actually expand vec_cmp ?
I assume that since expand_vec_cmp_expr_p queries optab and if it gets
a valid cmp icode, that
should be sufficient ?

Thanks,
Prathamesh
>
> Richard.
>
>
>
> > Thanks,
> > Prathamesh
> > >
> > > > Thanks,
> > > > Prathamesh
> > > >
> > >
> > > --
> > > Richard Biener 
> > > SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> > > Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)
> >
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


pr97282-2.diff
Description: Binary data


Re: Help with PR97872

2020-12-03 Thread Prathamesh Kulkarni via Gcc
On Tue, 1 Dec 2020 at 16:39, Richard Biener  wrote:
>
> On Tue, 1 Dec 2020, Prathamesh Kulkarni wrote:
>
> > Hi,
> > For the test mentioned in PR, I was trying to see if we could do
> > specialized expansion for vcond in target when operands are -1 and 0.
> > arm_expand_vcond gets the following operands:
> > (reg:V8QI 113 [ _2 ])
> > (reg:V8QI 117)
> > (reg:V8QI 118)
> > (lt (reg/v:V8QI 115 [ a ])
> > (reg/v:V8QI 116 [ b ]))
> > (reg/v:V8QI 115 [ a ])
> > (reg/v:V8QI 116 [ b ])
> >
> > where r117 and r118 are set to vector constants -1 and 0 respectively.
> > However, I am not sure if there's a way to check if the register is
> > constant during expansion time (since we don't have df analysis yet) ?
> >
> > Alternatively, should we add a target hook that returns true if the
> > result of vector comparison is set to all-ones or all-zeros, and then
> > use this hook in gimple ISEL to effectively turn VEC_COND_EXPR into nop ?
>
> Would everything match-up for a .VEC_CMP IFN producing a non-mask
> vector type?  ISEL could special case the a ? -1 : 0 case this way.
I think the vec_cmp pattern matches but it produces a masked vector type.
In the attached patch, I simply replaced:
_1 = a < b
x = _1 ? -1 : 0
with
x = view_convert_expr<_1>

For the test-case, isel generates:
  vector(8)  _1;
  vector(8) signed char _2;
  uint8x8_t _5;

   [local count: 1073741824]:
  _1 = a_3(D) < b_4(D);
  _2 = VIEW_CONVERT_EXPR(_1);
  _5 = VIEW_CONVERT_EXPR(_2);
  return _5;

and results in desired code-gen:
f1:
vcgt.s8 d0, d1, d0
bx  lr

Altho I guess, we should remove the redundant conversions during isel itself ?
and result in:
_1 = a_3(D) < b_4(D)
_5 = VIEW_CONVERT_EXPR(_1)

(Patch is lightly tested with only vect.exp)

Thanks,
Prathamesh
>
> > Thanks,
> > Prathamesh
> >
>
> --
> Richard Biener 
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


pr97282-1.diff
Description: Binary data


Help with PR97872

2020-12-01 Thread Prathamesh Kulkarni via Gcc
Hi,
For the test mentioned in PR, I was trying to see if we could do
specialized expansion for vcond in target when operands are -1 and 0.
arm_expand_vcond gets the following operands:
(reg:V8QI 113 [ _2 ])
(reg:V8QI 117)
(reg:V8QI 118)
(lt (reg/v:V8QI 115 [ a ])
(reg/v:V8QI 116 [ b ]))
(reg/v:V8QI 115 [ a ])
(reg/v:V8QI 116 [ b ])

where r117 and r118 are set to vector constants -1 and 0 respectively.
However, I am not sure if there's a way to check if the register is
constant during expansion time (since we don't have df analysis yet) ?

Alternatively, should we add a target hook that returns true if the
result of vector comparison is set to all-ones or all-zeros, and then
use this hook in gimple ISEL to effectively turn VEC_COND_EXPR into nop ?

Thanks,
Prathamesh


Re: LTO slows down calculix by more than 10% on aarch64

2020-10-27 Thread Prathamesh Kulkarni via Gcc
On Wed, 21 Oct 2020 at 16:10, Richard Biener  wrote:
>
> On Wed, Oct 21, 2020 at 12:04 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Thu, 24 Sep 2020 at 16:44, Richard Biener  
> > wrote:
> > >
> > > On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Wed, 23 Sep 2020 at 16:40, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov 
> > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I obtained perf stat results for following 
> > > > > > > > > > > > > > > > benchmark runs:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 7856832.692380  task-clock (msec)   
> > > > > > > > > > > > > > > >   #1.000 CPUs utilized
> > > > > > > > > > > > > > > >   3758   
> > > > > > > > > > > > > > > > context-switches  #0.000 K/sec
> > > > > > > > > > > > > > > > 40 
> > > > > > > > > > > > > > > > cpu-migrations #0.000 K/sec
> > > > > > > > > > > > > > > >  40847  page-faults 
> > > > > > > > > > > > > > > >   #0.005 K/sec
> > > > > > > > > > > > > > > >  7856782413676  cycles  
> > > > > > > > > > > > > > > >  #1.000 GHz
> > > > > > > > > > > > > > > >  6034510093417  instructions
> > > > > > > > > > > > > > > >#0.77  insn per cycle
> > > > > > > > > > > > > > > >   363937274287   branches   
> > > > > > > > > > > > > > > > #   46.321 M/sec
> > > > > > > > > > > > > > > >48557110132   branch-misses  
> > > > > > > > > > > > > > > >   #   13.34% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a 
> > > > > > > > > > > > > > > profile over a minute should be
> > > > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 8319643.114380  task-clock (msec)   
> > > > > > > > > > > > > > > > #1.000 CPUs utilized
> > > > > > > > > > > > > > > >   4285   
> > > > > > > > > > > > > > > > context-switches #0.001 K/sec
> > > > > > > > > > > > > > > > 28 
> > > > > > > > > > > > > > > > cpu-migrations#0.000 K/sec
> > > > > > > > > > > > > > > >  40843  page-faults 
> > > > > > > > > > > > > > > >  #0.005 K/sec
> > > > > > > > > > > > > > > >  8319591038295  cycles  
> > > > > > > > > > > > > > > > #1.000 GHz
> > > > > > > > > > > > > > > >  6276338800377  instructions
> > > > > > > > > > > > > > > >   #0.75  insn per cycle
> > > > > > > > > > > > > > > >   467400726106   branches   
> > > > > > > > > > > > > > > >#   56.180 M/sec
> > > > > > > > > > > > > > > >45986364011branch-misses 
> > > > > > > > > > > > > > > >  #9.84% of all branches
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and 
> > > > > > > > > > > > > > > +480e9 cycles, probably implying
> > > > > > > > > > > > > > > that extra instructions are appearing in this 
> > > > > > > > > > > > > > > loop nest, but not in the innermost
> > > > > > > > > > > > > > > loop. As a reminder for others, the innermost 
> > > > > > > > > > > > > > > loop has only 3 iterations.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this 
> > > > > > > > > > > > > > > > removes the extra branches):
> > > > > > > > > > 

Re: LTO slows down calculix by more than 10% on aarch64

2020-10-21 Thread Prathamesh Kulkarni via Gcc
On Thu, 24 Sep 2020 at 16:44, Richard Biener  wrote:
>
> On Thu, Sep 24, 2020 at 12:36 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Wed, 23 Sep 2020 at 16:40, Richard Biener  
> > wrote:
> > >
> > > On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Wed, 23 Sep 2020 at 13:22, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I obtained perf stat results for following 
> > > > > > > > > > > > > > benchmark runs:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 7856832.692380  task-clock (msec) # 
> > > > > > > > > > > > > >1.000 CPUs utilized
> > > > > > > > > > > > > >   3758   context-switches   
> > > > > > > > > > > > > >#0.000 K/sec
> > > > > > > > > > > > > > 40 cpu-migrations   
> > > > > > > > > > > > > >   #0.000 K/sec
> > > > > > > > > > > > > >  40847  page-faults 
> > > > > > > > > > > > > >   #0.005 K/sec
> > > > > > > > > > > > > >  7856782413676  cycles  
> > > > > > > > > > > > > >  #1.000 GHz
> > > > > > > > > > > > > >  6034510093417  instructions
> > > > > > > > > > > > > >#0.77  insn per cycle
> > > > > > > > > > > > > >   363937274287   branches   
> > > > > > > > > > > > > > #   46.321 M/sec
> > > > > > > > > > > > > >48557110132   branch-misses  
> > > > > > > > > > > > > >   #   13.34% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a 
> > > > > > > > > > > > > profile over a minute should be
> > > > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 8319643.114380  task-clock (msec)   #   
> > > > > > > > > > > > > >  1.000 CPUs utilized
> > > > > > > > > > > > > >   4285   context-switches   
> > > > > > > > > > > > > >   #0.001 K/sec
> > > > > > > > > > > > > > 28 cpu-migrations   
> > > > > > > > > > > > > >  #0.000 K/sec
> > > > > > > > > > > > > >  40843  page-faults 
> > > > > > > > > > > > > >  #0.005 K/sec
> > > > > > > > > > > > > >  8319591038295  cycles  
> > > > > > > > > > > > > > #1.000 GHz
> > > > > > > > > > > > > >  6276338800377  instructions
> > > > > > > > > > > > > >   #0.75  insn per cycle
> > > > > > > > > > > > > >   467400726106   branches   
> > > > > > > > > > > > > >#   56.180 M/sec
> > > > > > > > > > > > > >45986364011branch-misses 
> > > > > > > > > > > > > >  #9.84% of all branches
> > > > > > > > > > > > >
> > > > > > > > > > > > > So +100e9 branches, but +240e9 instructions and 
> > > > > > > > > > > > > +480e9 cycles, probably implying
> > > > > > > > > > > > > that extra instructions are appearing in this loop 
> > > > > > > > > > > > > nest, but not in the innermost
> > > > > > > > > > > > > loop. As a reminder for others, the innermost loop 
> > > > > > > > > > > > > has only 3 iterations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this 
> > > > > > > > > > > > > > removes the extra branches):
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >8207331.088040  task-clock (msec)   #
> > > > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > > > >   2266   context-switches   
> > > > > > > > > > > > > >  #0.000 K/sec
> > > > > > > > > > > > > > 32 cpu-migrations   
> > > > > > > > > > > > > > #0.000 K/sec
> > > > > > > > > > > > > >  40846  page-faults 
> > > > > > > > > > > > > 

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-24 Thread Prathamesh Kulkarni via Gcc
On Wed, 23 Sep 2020 at 16:40, Richard Biener  wrote:
>
> On Wed, Sep 23, 2020 at 12:11 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Wed, 23 Sep 2020 at 13:22, Richard Biener  
> > wrote:
> > >
> > > On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Tue, 22 Sep 2020 at 16:36, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov 
> > > > > > > > > >  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I obtained perf stat results for following benchmark 
> > > > > > > > > > > > runs:
> > > > > > > > > > > >
> > > > > > > > > > > > -O2:
> > > > > > > > > > > >
> > > > > > > > > > > > 7856832.692380  task-clock (msec) #
> > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > >   3758   context-switches   
> > > > > > > > > > > >#0.000 K/sec
> > > > > > > > > > > > 40 cpu-migrations   
> > > > > > > > > > > >   #0.000 K/sec
> > > > > > > > > > > >  40847  page-faults 
> > > > > > > > > > > >   #0.005 K/sec
> > > > > > > > > > > >  7856782413676  cycles  
> > > > > > > > > > > >  #1.000 GHz
> > > > > > > > > > > >  6034510093417  instructions   
> > > > > > > > > > > > #0.77  insn per cycle
> > > > > > > > > > > >   363937274287   branches   
> > > > > > > > > > > > #   46.321 M/sec
> > > > > > > > > > > >48557110132   branch-misses# 
> > > > > > > > > > > >   13.34% of all branches
> > > > > > > > > > >
> > > > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile 
> > > > > > > > > > > over a minute should be
> > > > > > > > > > > enough for this kind of code)
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > > > >
> > > > > > > > > > > > 8319643.114380  task-clock (msec)   #
> > > > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > > > >   4285   context-switches   
> > > > > > > > > > > >   #0.001 K/sec
> > > > > > > > > > > > 28 cpu-migrations   
> > > > > > > > > > > >  #0.000 K/sec
> > > > > > > > > > > >  40843  page-faults 
> > > > > > > > > > > >  #0.005 K/sec
> > > > > > > > > > > >  8319591038295  cycles  
> > > > > > > > > > > > #1.000 GHz
> > > > > > > > > > > >  6276338800377  instructions  # 
> > > > > > > > > > > >0.75  insn per cycle
> > > > > > > > > > > >   467400726106   branches  
> > > > > > > > > > > > #   56.180 M/sec
> > > > > > > > > > > >45986364011branch-misses  #  
> > > > > > > > > > > >   9.84% of all branches
> > > > > > > > > > >
> > > > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 
> > > > > > > > > > > cycles, probably implying
> > > > > > > > > > > that extra instructions are appearing in this loop nest, 
> > > > > > > > > > > but not in the innermost
> > > > > > > > > > > loop. As a reminder for others, the innermost loop has 
> > > > > > > > > > > only 3 iterations.
> > > > > > > > > > >
> > > > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes 
> > > > > > > > > > > > the extra branches):
> > > > > > > > > > > >
> > > > > > > > > > > >8207331.088040  task-clock (msec)   #1.000 
> > > > > > > > > > > > CPUs utilized
> > > > > > > > > > > >   2266   context-switches#  
> > > > > > > > > > > >   0.000 K/sec
> > > > > > > > > > > > 32 cpu-migrations   
> > > > > > > > > > > > #0.000 K/sec
> > > > > > > > > > > >  40846  page-faults 
> > > > > > > > > > > > #0.005 K/sec
> > > > > > > > > > > >  8207292032467  cycles #   
> > > > > > > > > > > > 1.000 GHz
> > > > > > > > > > > >  6035724436440  instructions #
> > > > > > > > > > > > 0.74  insn per cycle
> > > > > > > > > > > >   364415440156   branches #   
> > > > > > > > > > > > 44.401 M/sec
> > > > > > > > > > > >53138327276branch-misses #   
> > > > > > > > > > > > 14.58% of all branches
> > > > 

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-23 Thread Prathamesh Kulkarni via Gcc
On Wed, 23 Sep 2020 at 13:22, Richard Biener  wrote:
>
> On Tue, Sep 22, 2020 at 6:25 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Tue, 22 Sep 2020 at 16:36, Richard Biener  
> > wrote:
> > >
> > > On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Tue, 22 Sep 2020 at 12:56, Richard Biener 
> > > >  wrote:
> > > > >
> > > > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > > > >
> > > > > > > > > > -O2:
> > > > > > > > > >
> > > > > > > > > > 7856832.692380  task-clock (msec) #
> > > > > > > > > > 1.000 CPUs utilized
> > > > > > > > > >   3758   context-switches  
> > > > > > > > > > #0.000 K/sec
> > > > > > > > > > 40 cpu-migrations   
> > > > > > > > > >   #0.000 K/sec
> > > > > > > > > >  40847  page-faults 
> > > > > > > > > >   #0.005 K/sec
> > > > > > > > > >  7856782413676  cycles   #  
> > > > > > > > > >   1.000 GHz
> > > > > > > > > >  6034510093417  instructions   #
> > > > > > > > > > 0.77  insn per cycle
> > > > > > > > > >   363937274287   branches   #   
> > > > > > > > > > 46.321 M/sec
> > > > > > > > > >48557110132   branch-misses#   
> > > > > > > > > > 13.34% of all branches
> > > > > > > > >
> > > > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a 
> > > > > > > > > minute should be
> > > > > > > > > enough for this kind of code)
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined:
> > > > > > > > > >
> > > > > > > > > > 8319643.114380  task-clock (msec)   #1.000 
> > > > > > > > > > CPUs utilized
> > > > > > > > > >   4285   context-switches # 
> > > > > > > > > >0.001 K/sec
> > > > > > > > > > 28 cpu-migrations   
> > > > > > > > > >  #0.000 K/sec
> > > > > > > > > >  40843  page-faults 
> > > > > > > > > >  #0.005 K/sec
> > > > > > > > > >  8319591038295  cycles  #   
> > > > > > > > > >  1.000 GHz
> > > > > > > > > >  6276338800377  instructions  #
> > > > > > > > > > 0.75  insn per cycle
> > > > > > > > > >   467400726106   branches  #   
> > > > > > > > > > 56.180 M/sec
> > > > > > > > > >45986364011branch-misses  #
> > > > > > > > > > 9.84% of all branches
> > > > > > > > >
> > > > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 
> > > > > > > > > cycles, probably implying
> > > > > > > > > that extra instructions are appearing in this loop nest, but 
> > > > > > > > > not in the innermost
> > > > > > > > > loop. As a reminder for others, the innermost loop has only 3 
> > > > > > > > > iterations.
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the 
> > > > > > > > > > extra branches):
> > > > > > > > > >
> > > > > > > > > >8207331.088040  task-clock (msec)   #1.000 CPUs 
> > > > > > > > > > utilized
> > > > > > > > > >   2266   context-switches#
> > > > > > > > > > 0.000 K/sec
> > > > > > > > > > 32 cpu-migrations   #   
> > > > > > > > > >  0.000 K/sec
> > > > > > > > > >  40846  page-faults #   
> > > > > > > > > >  0.005 K/sec
> > > > > > > > > >  8207292032467  cycles #   
> > > > > > > > > > 1.000 GHz
> > > > > > > > > >  6035724436440  instructions #0.74  
> > > > > > > > > > insn per cycle
> > > > > > > > > >   364415440156   branches #   
> > > > > > > > > > 44.401 M/sec
> > > > > > > > > >53138327276branch-misses #   14.58% 
> > > > > > > > > > of all branches
> > > > > > > > >
> > > > > > > > > This seems to match baseline in terms of instruction count, 
> > > > > > > > > but without PRE
> > > > > > > > > the loop nest may be carrying some dependencies over memory. 
> > > > > > > > > I would simply
> > > > > > > > > check the assembly for the entire 6-level loop nest in 
> > > > > > > > > question, I hope it's
> > > > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > > > >
> > > > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > > > >
> > >

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-22 Thread Prathamesh Kulkarni via Gcc
On Tue, 22 Sep 2020 at 16:36, Richard Biener  wrote:
>
> On Tue, Sep 22, 2020 at 11:37 AM Prathamesh Kulkarni
>  wrote:
> >
> > On Tue, 22 Sep 2020 at 12:56, Richard Biener  
> > wrote:
> > >
> > > On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> > > >  wrote:
> > > > >
> > > > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov  
> > > > > > wrote:
> > > > > > >
> > > > > > > > I obtained perf stat results for following benchmark runs:
> > > > > > > >
> > > > > > > > -O2:
> > > > > > > >
> > > > > > > > 7856832.692380  task-clock (msec) #1.000 
> > > > > > > > CPUs utilized
> > > > > > > >   3758   context-switches  #
> > > > > > > > 0.000 K/sec
> > > > > > > > 40 cpu-migrations # 
> > > > > > > >0.000 K/sec
> > > > > > > >  40847  page-faults   # 
> > > > > > > >0.005 K/sec
> > > > > > > >  7856782413676  cycles   #
> > > > > > > > 1.000 GHz
> > > > > > > >  6034510093417  instructions   #
> > > > > > > > 0.77  insn per cycle
> > > > > > > >   363937274287   branches   #   
> > > > > > > > 46.321 M/sec
> > > > > > > >48557110132   branch-misses#   
> > > > > > > > 13.34% of all branches
> > > > > > >
> > > > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a 
> > > > > > > minute should be
> > > > > > > enough for this kind of code)
> > > > > > >
> > > > > > > > -O2 with orthonl inlined:
> > > > > > > >
> > > > > > > > 8319643.114380  task-clock (msec)   #1.000 CPUs 
> > > > > > > > utilized
> > > > > > > >   4285   context-switches #
> > > > > > > > 0.001 K/sec
> > > > > > > > 28 cpu-migrations#  
> > > > > > > >   0.000 K/sec
> > > > > > > >  40843  page-faults  #  
> > > > > > > >   0.005 K/sec
> > > > > > > >  8319591038295  cycles  #
> > > > > > > > 1.000 GHz
> > > > > > > >  6276338800377  instructions  #0.75 
> > > > > > > >  insn per cycle
> > > > > > > >   467400726106   branches  #   
> > > > > > > > 56.180 M/sec
> > > > > > > >45986364011branch-misses  #9.84% 
> > > > > > > > of all branches
> > > > > > >
> > > > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, 
> > > > > > > probably implying
> > > > > > > that extra instructions are appearing in this loop nest, but not 
> > > > > > > in the innermost
> > > > > > > loop. As a reminder for others, the innermost loop has only 3 
> > > > > > > iterations.
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and PRE disabled (this removes the 
> > > > > > > > extra branches):
> > > > > > > >
> > > > > > > >8207331.088040  task-clock (msec)   #1.000 CPUs 
> > > > > > > > utilized
> > > > > > > >   2266   context-switches#0.000 
> > > > > > > > K/sec
> > > > > > > > 32 cpu-migrations   #
> > > > > > > > 0.000 K/sec
> > > > > > > >  40846  page-faults #
> > > > > > > > 0.005 K/sec
> > > > > > > >  8207292032467  cycles #   1.000 GHz
> > > > > > > >  6035724436440  instructions #0.74  
> > > > > > > > insn per cycle
> > > > > > > >   364415440156   branches #   44.401 
> > > > > > > > M/sec
> > > > > > > >53138327276branch-misses #   14.58% of 
> > > > > > > > all branches
> > > > > > >
> > > > > > > This seems to match baseline in terms of instruction count, but 
> > > > > > > without PRE
> > > > > > > the loop nest may be carrying some dependencies over memory. I 
> > > > > > > would simply
> > > > > > > check the assembly for the entire 6-level loop nest in question, 
> > > > > > > I hope it's
> > > > > > > not very complicated (though Fortran array addressing...).
> > > > > > >
> > > > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > > > >
> > > > > > > >7797265.206850  task-clock (msec) #1.000 
> > > > > > > > CPUs utilized
> > > > > > > >   3139  context-switches  #
> > > > > > > > 0.000 K/sec
> > > > > > > > 20cpu-migrations #  
> > > > > > > >   0.000 K/sec
> > > > > > > >  40846  page-faults  #  
> > > > > > > >   0.005 K/sec
> > > > > > > >  7797221351467  cycles  #
> > > > > > > > 1.000 GHz
> > > > > > >

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-22 Thread Prathamesh Kulkarni via Gcc
On Tue, 22 Sep 2020 at 12:56, Richard Biener  wrote:
>
> On Tue, Sep 22, 2020 at 7:08 AM Prathamesh Kulkarni
>  wrote:
> >
> > On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
> >  wrote:
> > >
> > > On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
> > >  wrote:
> > > >
> > > > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov  
> > > > wrote:
> > > > >
> > > > > > I obtained perf stat results for following benchmark runs:
> > > > > >
> > > > > > -O2:
> > > > > >
> > > > > > 7856832.692380  task-clock (msec) #1.000 CPUs 
> > > > > > utilized
> > > > > >   3758   context-switches  #
> > > > > > 0.000 K/sec
> > > > > > 40 cpu-migrations #
> > > > > > 0.000 K/sec
> > > > > >  40847  page-faults   #
> > > > > > 0.005 K/sec
> > > > > >  7856782413676  cycles   #1.000 
> > > > > > GHz
> > > > > >  6034510093417  instructions   #0.77  
> > > > > > insn per cycle
> > > > > >   363937274287   branches   #   46.321 
> > > > > > M/sec
> > > > > >48557110132   branch-misses#   13.34% of 
> > > > > > all branches
> > > > >
> > > > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute 
> > > > > should be
> > > > > enough for this kind of code)
> > > > >
> > > > > > -O2 with orthonl inlined:
> > > > > >
> > > > > > 8319643.114380  task-clock (msec)   #1.000 CPUs 
> > > > > > utilized
> > > > > >   4285   context-switches #
> > > > > > 0.001 K/sec
> > > > > > 28 cpu-migrations#
> > > > > > 0.000 K/sec
> > > > > >  40843  page-faults  #
> > > > > > 0.005 K/sec
> > > > > >  8319591038295  cycles  #1.000 
> > > > > > GHz
> > > > > >  6276338800377  instructions  #0.75  
> > > > > > insn per cycle
> > > > > >   467400726106   branches  #   56.180 
> > > > > > M/sec
> > > > > >45986364011branch-misses  #9.84% of 
> > > > > > all branches
> > > > >
> > > > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, 
> > > > > probably implying
> > > > > that extra instructions are appearing in this loop nest, but not in 
> > > > > the innermost
> > > > > loop. As a reminder for others, the innermost loop has only 3 
> > > > > iterations.
> > > > >
> > > > > > -O2 with orthonl inlined and PRE disabled (this removes the extra 
> > > > > > branches):
> > > > > >
> > > > > >8207331.088040  task-clock (msec)   #1.000 CPUs utilized
> > > > > >   2266   context-switches#0.000 
> > > > > > K/sec
> > > > > > 32 cpu-migrations   #0.000 
> > > > > > K/sec
> > > > > >  40846  page-faults #0.005 
> > > > > > K/sec
> > > > > >  8207292032467  cycles #   1.000 GHz
> > > > > >  6035724436440  instructions #0.74  insn 
> > > > > > per cycle
> > > > > >   364415440156   branches #   44.401 M/sec
> > > > > >53138327276branch-misses #   14.58% of all 
> > > > > > branches
> > > > >
> > > > > This seems to match baseline in terms of instruction count, but 
> > > > > without PRE
> > > > > the loop nest may be carrying some dependencies over memory. I would 
> > > > > simply
> > > > > check the assembly for the entire 6-level loop nest in question, I 
> > > > > hope it's
> > > > > not very complicated (though Fortran array addressing...).
> > > > >
> > > > > > -O2 with orthonl inlined and hoisting disabled:
> > > > > >
> > > > > >7797265.206850  task-clock (msec) #1.000 CPUs 
> > > > > > utilized
> > > > > >   3139  context-switches  #
> > > > > > 0.000 K/sec
> > > > > > 20cpu-migrations #
> > > > > > 0.000 K/sec
> > > > > >  40846  page-faults  #
> > > > > > 0.005 K/sec
> > > > > >  7797221351467  cycles  #1.000 
> > > > > > GHz
> > > > > >  6187348757324  instructions  #0.79  
> > > > > > insn per cycle
> > > > > >   461840800061   branches  #   59.231 
> > > > > > M/sec
> > > > > >26920311761branch-misses #5.83% of 
> > > > > > all branches
> > > > >
> > > > > There's a 20e9 reduction in branch misses and a 500e9 reduction in 
> > > > > cycle count.
> > > > > I don't think the former fully covers the latter (there's also a 90e9 
> > > > > reduction
> > > > > in insn count).
> > > > >
> > > > > Given that the inner

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-21 Thread Prathamesh Kulkarni via Gcc
On Mon, 21 Sep 2020 at 18:14, Prathamesh Kulkarni
 wrote:
>
> On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
>  wrote:
> >
> > On Fri, 4 Sep 2020 at 17:08, Alexander Monakov  wrote:
> > >
> > > > I obtained perf stat results for following benchmark runs:
> > > >
> > > > -O2:
> > > >
> > > > 7856832.692380  task-clock (msec) #1.000 CPUs 
> > > > utilized
> > > >   3758   context-switches  #0.000 
> > > > K/sec
> > > > 40 cpu-migrations #
> > > > 0.000 K/sec
> > > >  40847  page-faults   #
> > > > 0.005 K/sec
> > > >  7856782413676  cycles   #1.000 GHz
> > > >  6034510093417  instructions   #0.77  insn 
> > > > per cycle
> > > >   363937274287   branches   #   46.321 M/sec
> > > >48557110132   branch-misses#   13.34% of all 
> > > > branches
> > >
> > > (ouch, 2+ hours per run is a lot, collecting a profile over a minute 
> > > should be
> > > enough for this kind of code)
> > >
> > > > -O2 with orthonl inlined:
> > > >
> > > > 8319643.114380  task-clock (msec)   #1.000 CPUs utilized
> > > >   4285   context-switches #0.001 
> > > > K/sec
> > > > 28 cpu-migrations#0.000 
> > > > K/sec
> > > >  40843  page-faults  #0.005 
> > > > K/sec
> > > >  8319591038295  cycles  #1.000 GHz
> > > >  6276338800377  instructions  #0.75  insn 
> > > > per cycle
> > > >   467400726106   branches  #   56.180 M/sec
> > > >45986364011branch-misses  #9.84% of all 
> > > > branches
> > >
> > > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably 
> > > implying
> > > that extra instructions are appearing in this loop nest, but not in the 
> > > innermost
> > > loop. As a reminder for others, the innermost loop has only 3 iterations.
> > >
> > > > -O2 with orthonl inlined and PRE disabled (this removes the extra 
> > > > branches):
> > > >
> > > >8207331.088040  task-clock (msec)   #1.000 CPUs utilized
> > > >   2266   context-switches#0.000 K/sec
> > > > 32 cpu-migrations   #0.000 K/sec
> > > >  40846  page-faults #0.005 K/sec
> > > >  8207292032467  cycles #   1.000 GHz
> > > >  6035724436440  instructions #0.74  insn per 
> > > > cycle
> > > >   364415440156   branches #   44.401 M/sec
> > > >53138327276branch-misses #   14.58% of all 
> > > > branches
> > >
> > > This seems to match baseline in terms of instruction count, but without 
> > > PRE
> > > the loop nest may be carrying some dependencies over memory. I would 
> > > simply
> > > check the assembly for the entire 6-level loop nest in question, I hope 
> > > it's
> > > not very complicated (though Fortran array addressing...).
> > >
> > > > -O2 with orthonl inlined and hoisting disabled:
> > > >
> > > >7797265.206850  task-clock (msec) #1.000 CPUs 
> > > > utilized
> > > >   3139  context-switches  #0.000 
> > > > K/sec
> > > > 20cpu-migrations #0.000 
> > > > K/sec
> > > >  40846  page-faults  #0.005 
> > > > K/sec
> > > >  7797221351467  cycles  #1.000 GHz
> > > >  6187348757324  instructions  #0.79  insn 
> > > > per cycle
> > > >   461840800061   branches  #   59.231 M/sec
> > > >26920311761branch-misses #5.83% of all 
> > > > branches
> > >
> > > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle 
> > > count.
> > > I don't think the former fully covers the latter (there's also a 90e9 
> > > reduction
> > > in insn count).
> > >
> > > Given that the inner loop iterates only 3 times, my main suggestion is to
> > > consider how the profile for the entire loop nest looks like (it's 6 
> > > loops deep,
> > > each iterating only 3 times).
> > >
> > > > Perf profiles for
> > > > -O2 -fno-code-hoisting and inlined orthonl:
> > > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > > >
> > > >   3196866 |1f04:ldur   d1, [x1, #-248]
> > > > 216348301800│addw0, w0, #0x1
> > > > 985098 |addx2, x2, #0x18
> > > > 216215999206│addx1, x1, #0x48
> > > > 215630376504│fmul   d1, d5, d1
> > > > 863829148015│fmul   

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-21 Thread Prathamesh Kulkarni via Gcc
On Mon, 21 Sep 2020 at 15:19, Prathamesh Kulkarni
 wrote:
>
> On Fri, 4 Sep 2020 at 17:08, Alexander Monakov  wrote:
> >
> > > I obtained perf stat results for following benchmark runs:
> > >
> > > -O2:
> > >
> > > 7856832.692380  task-clock (msec) #1.000 CPUs utilized
> > >   3758   context-switches  #0.000 
> > > K/sec
> > > 40 cpu-migrations #0.000 
> > > K/sec
> > >  40847  page-faults   #0.005 
> > > K/sec
> > >  7856782413676  cycles   #1.000 GHz
> > >  6034510093417  instructions   #0.77  insn 
> > > per cycle
> > >   363937274287   branches   #   46.321 M/sec
> > >48557110132   branch-misses#   13.34% of all 
> > > branches
> >
> > (ouch, 2+ hours per run is a lot, collecting a profile over a minute should 
> > be
> > enough for this kind of code)
> >
> > > -O2 with orthonl inlined:
> > >
> > > 8319643.114380  task-clock (msec)   #1.000 CPUs utilized
> > >   4285   context-switches #0.001 K/sec
> > > 28 cpu-migrations#0.000 
> > > K/sec
> > >  40843  page-faults  #0.005 
> > > K/sec
> > >  8319591038295  cycles  #1.000 GHz
> > >  6276338800377  instructions  #0.75  insn per 
> > > cycle
> > >   467400726106   branches  #   56.180 M/sec
> > >45986364011branch-misses  #9.84% of all 
> > > branches
> >
> > So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably 
> > implying
> > that extra instructions are appearing in this loop nest, but not in the 
> > innermost
> > loop. As a reminder for others, the innermost loop has only 3 iterations.
> >
> > > -O2 with orthonl inlined and PRE disabled (this removes the extra 
> > > branches):
> > >
> > >8207331.088040  task-clock (msec)   #1.000 CPUs utilized
> > >   2266   context-switches#0.000 K/sec
> > > 32 cpu-migrations   #0.000 K/sec
> > >  40846  page-faults #0.005 K/sec
> > >  8207292032467  cycles #   1.000 GHz
> > >  6035724436440  instructions #0.74  insn per cycle
> > >   364415440156   branches #   44.401 M/sec
> > >53138327276branch-misses #   14.58% of all branches
> >
> > This seems to match baseline in terms of instruction count, but without PRE
> > the loop nest may be carrying some dependencies over memory. I would simply
> > check the assembly for the entire 6-level loop nest in question, I hope it's
> > not very complicated (though Fortran array addressing...).
> >
> > > -O2 with orthonl inlined and hoisting disabled:
> > >
> > >7797265.206850  task-clock (msec) #1.000 CPUs utilized
> > >   3139  context-switches  #0.000 K/sec
> > > 20cpu-migrations #0.000 
> > > K/sec
> > >  40846  page-faults  #0.005 
> > > K/sec
> > >  7797221351467  cycles  #1.000 GHz
> > >  6187348757324  instructions  #0.79  insn per 
> > > cycle
> > >   461840800061   branches  #   59.231 M/sec
> > >26920311761branch-misses #5.83% of all 
> > > branches
> >
> > There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle 
> > count.
> > I don't think the former fully covers the latter (there's also a 90e9 
> > reduction
> > in insn count).
> >
> > Given that the inner loop iterates only 3 times, my main suggestion is to
> > consider how the profile for the entire loop nest looks like (it's 6 loops 
> > deep,
> > each iterating only 3 times).
> >
> > > Perf profiles for
> > > -O2 -fno-code-hoisting and inlined orthonl:
> > > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> > >
> > >   3196866 |1f04:ldur   d1, [x1, #-248]
> > > 216348301800│addw0, w0, #0x1
> > > 985098 |addx2, x2, #0x18
> > > 216215999206│addx1, x1, #0x48
> > > 215630376504│fmul   d1, d5, d1
> > > 863829148015│fmul   d1, d1, d6
> > > 864228353526│fmul   d0, d1, d0
> > > 864568163014│fmadd  d2, d0, d16, d2
> > > │ cmpw0, #0x4
> > > 216125427594│  ↓ b.eq   1f34
> > > 15010377│ ldur   d0, [x2, #-8]
> > > 143753737468│  ↑ b  1f04
> > >
> > > -O2 with

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-21 Thread Prathamesh Kulkarni via Gcc
On Fri, 4 Sep 2020 at 17:08, Alexander Monakov  wrote:
>
> > I obtained perf stat results for following benchmark runs:
> >
> > -O2:
> >
> > 7856832.692380  task-clock (msec) #1.000 CPUs utilized
> >   3758   context-switches  #0.000 K/sec
> > 40 cpu-migrations #0.000 
> > K/sec
> >  40847  page-faults   #0.005 
> > K/sec
> >  7856782413676  cycles   #1.000 GHz
> >  6034510093417  instructions   #0.77  insn per 
> > cycle
> >   363937274287   branches   #   46.321 M/sec
> >48557110132   branch-misses#   13.34% of all 
> > branches
>
> (ouch, 2+ hours per run is a lot, collecting a profile over a minute should be
> enough for this kind of code)
>
> > -O2 with orthonl inlined:
> >
> > 8319643.114380  task-clock (msec)   #1.000 CPUs utilized
> >   4285   context-switches #0.001 K/sec
> > 28 cpu-migrations#0.000 
> > K/sec
> >  40843  page-faults  #0.005 
> > K/sec
> >  8319591038295  cycles  #1.000 GHz
> >  6276338800377  instructions  #0.75  insn per 
> > cycle
> >   467400726106   branches  #   56.180 M/sec
> >45986364011branch-misses  #9.84% of all 
> > branches
>
> So +100e9 branches, but +240e9 instructions and +480e9 cycles, probably 
> implying
> that extra instructions are appearing in this loop nest, but not in the 
> innermost
> loop. As a reminder for others, the innermost loop has only 3 iterations.
>
> > -O2 with orthonl inlined and PRE disabled (this removes the extra branches):
> >
> >8207331.088040  task-clock (msec)   #1.000 CPUs utilized
> >   2266   context-switches#0.000 K/sec
> > 32 cpu-migrations   #0.000 K/sec
> >  40846  page-faults #0.005 K/sec
> >  8207292032467  cycles #   1.000 GHz
> >  6035724436440  instructions #0.74  insn per cycle
> >   364415440156   branches #   44.401 M/sec
> >53138327276branch-misses #   14.58% of all branches
>
> This seems to match baseline in terms of instruction count, but without PRE
> the loop nest may be carrying some dependencies over memory. I would simply
> check the assembly for the entire 6-level loop nest in question, I hope it's
> not very complicated (though Fortran array addressing...).
>
> > -O2 with orthonl inlined and hoisting disabled:
> >
> >7797265.206850  task-clock (msec) #1.000 CPUs utilized
> >   3139  context-switches  #0.000 K/sec
> > 20cpu-migrations #0.000 
> > K/sec
> >  40846  page-faults  #0.005 
> > K/sec
> >  7797221351467  cycles  #1.000 GHz
> >  6187348757324  instructions  #0.79  insn per 
> > cycle
> >   461840800061   branches  #   59.231 M/sec
> >26920311761branch-misses #5.83% of all 
> > branches
>
> There's a 20e9 reduction in branch misses and a 500e9 reduction in cycle 
> count.
> I don't think the former fully covers the latter (there's also a 90e9 
> reduction
> in insn count).
>
> Given that the inner loop iterates only 3 times, my main suggestion is to
> consider how the profile for the entire loop nest looks like (it's 6 loops 
> deep,
> each iterating only 3 times).
>
> > Perf profiles for
> > -O2 -fno-code-hoisting and inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> >   3196866 |1f04:ldur   d1, [x1, #-248]
> > 216348301800│addw0, w0, #0x1
> > 985098 |addx2, x2, #0x18
> > 216215999206│addx1, x1, #0x48
> > 215630376504│fmul   d1, d5, d1
> > 863829148015│fmul   d1, d1, d6
> > 864228353526│fmul   d0, d1, d0
> > 864568163014│fmadd  d2, d0, d16, d2
> > │ cmpw0, #0x4
> > 216125427594│  ↓ b.eq   1f34
> > 15010377│ ldur   d0, [x2, #-8]
> > 143753737468│  ↑ b  1f04
> >
> > -O2 with inlined orthonl:
> > https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data
> >
> > 359871503840│ 1ef8:   ldur   d15, [x1, #-248]
> > 144055883055│addw0, w0, #0x1
> >   72262104254│addx2, x2, #0x18
> > 143991169721│addx1, x1

Re: LTO slows down calculix by more than 10% on aarch64

2020-09-04 Thread Prathamesh Kulkarni via Gcc
On Mon, 31 Aug 2020 at 16:53, Prathamesh Kulkarni
 wrote:
>
> On Fri, 28 Aug 2020 at 17:33, Alexander Monakov  wrote:
> >
> > On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
> >
> > > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > > not too relevant ?
> >
> > Probably not. Some advice to make your search more directed:
> >
> > Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> > when they are computed against different bases, it's much easier to see that
> > a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> > as opposed to 90% to 91% in relative sample ratio.
> >
> > Before diving down 'perf report', be sure to fully account for differences
> > in 'perf stat' output. Do the programs execute the same number of 
> > instructions,
> > so the difference only in scheduling? Do the programs suffer from the same
> > amount of branch mispredictions? Please show output of 'perf stat' on the
> > mailing list too, so everyone is on the same page about that.
> >
> > I also suspect that the dramatic slowdown has to do with the extra branch.
> > Your CPU might have some specialized counters for branch prediction, see
> > 'perf list'.
> Hi Alexander,
> Thanks for the suggestions! I am in the process of doing the
> benchmarking experiments,
> and will post the results soon.
Hi,
I obtained perf stat results for following benchmark runs:

-O2:

7856832.692380  task-clock (msec) #1.000 CPUs utilized
  3758   context-switches  #0.000 K/sec
40 cpu-migrations #0.000 K/sec
 40847  page-faults   #0.005 K/sec
 7856782413676  cycles   #1.000 GHz
 6034510093417  instructions   #0.77  insn per cycle
  363937274287   branches   #   46.321 M/sec
   48557110132   branch-misses#   13.34% of all branches

-O2 with orthonl inlined:

8319643.114380  task-clock (msec)   #1.000 CPUs utilized
  4285   context-switches #0.001 K/sec
28 cpu-migrations#0.000 K/sec
 40843  page-faults  #0.005 K/sec
 8319591038295  cycles  #1.000 GHz
 6276338800377  instructions  #0.75  insn per cycle
  467400726106   branches  #   56.180 M/sec
   45986364011branch-misses  #9.84% of all branches

-O2 with orthonl inlined and PRE disabled (this removes the extra branches):

   8207331.088040  task-clock (msec)   #1.000 CPUs utilized
  2266   context-switches#0.000 K/sec
32 cpu-migrations   #0.000 K/sec
 40846  page-faults #0.005 K/sec
 8207292032467  cycles #   1.000 GHz
 6035724436440  instructions #0.74  insn per cycle
  364415440156   branches #   44.401 M/sec
   53138327276branch-misses #   14.58% of all branches

-O2 with orthonl inlined and hoisting disabled:

   7797265.206850  task-clock (msec) #1.000 CPUs utilized
  3139  context-switches  #0.000 K/sec
20cpu-migrations #0.000 K/sec
 40846  page-faults  #0.005 K/sec
 7797221351467  cycles  #1.000 GHz
 6187348757324  instructions  #0.79  insn per cycle
  461840800061   branches  #   59.231 M/sec
   26920311761branch-misses #5.83% of all branches

Perf profiles for
-O2 -fno-code-hoisting and inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_inline.data

  3196866 |1f04:ldur   d1, [x1, #-248]
216348301800│addw0, w0, #0x1
985098 |addx2, x2, #0x18
216215999206│addx1, x1, #0x48
215630376504│fmul   d1, d5, d1
863829148015│fmul   d1, d1, d6
864228353526│fmul   d0, d1, d0
864568163014│fmadd  d2, d0, d16, d2
│ cmpw0, #0x4
216125427594│  ↓ b.eq   1f34
15010377│ ldur   d0, [x2, #-8]
143753737468│  ↑ b  1f04

-O2 with inlined orthonl:
https://people.linaro.org/~prathamesh.kulkarni/perf_O2_

Re: LTO slows down calculix by more than 10% on aarch64

2020-08-31 Thread Prathamesh Kulkarni via Gcc
On Fri, 28 Aug 2020 at 17:33, Alexander Monakov  wrote:
>
> On Fri, 28 Aug 2020, Prathamesh Kulkarni via Gcc wrote:
>
> > I wonder if that's (one of) the main factor(s) behind slowdown or it's
> > not too relevant ?
>
> Probably not. Some advice to make your search more directed:
>
> Pass '-n' to 'perf report'. Relative sample ratios are hard to reason about
> when they are computed against different bases, it's much easier to see that
> a loop is slowing down if it went from 4000 to 4500 in absolute sample count
> as opposed to 90% to 91% in relative sample ratio.
>
> Before diving down 'perf report', be sure to fully account for differences
> in 'perf stat' output. Do the programs execute the same number of 
> instructions,
> so the difference only in scheduling? Do the programs suffer from the same
> amount of branch mispredictions? Please show output of 'perf stat' on the
> mailing list too, so everyone is on the same page about that.
>
> I also suspect that the dramatic slowdown has to do with the extra branch.
> Your CPU might have some specialized counters for branch prediction, see
> 'perf list'.
Hi Alexander,
Thanks for the suggestions! I am in the process of doing the
benchmarking experiments,
and will post the results soon.

Thanks,
Prathamesh
>
> Alexander


Re: LTO slows down calculix by more than 10% on aarch64

2020-08-31 Thread Prathamesh Kulkarni via Gcc
On Fri, 28 Aug 2020 at 17:27, Richard Biener  wrote:
>
> On Fri, Aug 28, 2020 at 1:17 PM Prathamesh Kulkarni
>  wrote:
> >
> > On Wed, 26 Aug 2020 at 16:50, Richard Biener  
> > wrote:
> > >
> > > On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
> > >  wrote:
> > > >
> > > > Hi,
> > > > We're seeing a consistent regression >10% on calculix with -O2 -flto vs 
> > > > -O2
> > > > on aarch64 in our validation CI. I tried to investigate this issue a
> > > > bit, and it seems the regression comes from inlining of orthonl into
> > > > e_c3d. Disabling that brings back the performance. However, inlining
> > > > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > > > 16.9% which isn't too large.
> > > >
> > > > I have attached two test-cases, e_c3d.f that has orthonl manually
> > > > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > > > which contains unmodified function.
> > > > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > > > sufficient.
> > > >
> > > > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > > > which are then hoisted to block 173, in particular hoistings of w(1,
> > > > 1) ... w(3, 3), which wasn't
> > > > possible without inlining. The hoistings happen because of basic block
> > > > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > > > following block in line 1035 in e_c3d.f:
> > > >
> > > > senergy=
> > > >  &(s11*w(1,1)+s12*(w(1,2)+w(2,1))
> > > >  &+s13*(w(1,3)+w(3,1))+s22*w(2,2)
> > > >  &+s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> > > >
> > > > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > > > performance. I am not able to understand why (if?) these hoistings of
> > > > w(1, 1) ...
> > > > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > > > code-path from perf in e_c3d shows following code-gen diff:
> > > > For inlined version:
> > > > .L122:
> > > > ldr d15, [x1, -248]
> > > > add w0, w0, 1
> > > > add x2, x2, 24
> > > > add x1, x1, 72
> > > > fmuld15, d17, d15
> > > > fmuld15, d15, d18
> > > > fmuld14, d15, d14
> > > > fmadd   d16, d14, d31, d16
> > > > cmp w0, 4
> > > > beq .L121
> > > > ldr d14, [x2, -8]
> > > > b   .L122
> > > >
> > > > and for non-inlined version:
> > > > .L118:
> > > > ldr d0, [x1, -248]
> > > > add w0, w0, 1
> > > > ldr d2, [x2, -8]
> > > > add x1, x1, 72
> > > > add x2, x2, 24
> > > > fmuld0, d3, d0
> > > > fmuld0, d0, d5
> > > > fmuld0, d0, d2
> > > > fmadd   d1, d4, d0, d1
> > > > cmp w0, 4
> > > > bne .L118
> > >
> > > I wonder if you have profles.  The inlined version has a
> > > non-empty latch block (looks like some PRE is happening
> > > there?).  Eventually your uarch does not like the close
> > > (does your assembly show the layour as it is?) branches?
> > Hi Richard,
> > I have uploaded profiles obtained by perf here:
> > -O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
> > -O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data
> >
> > For the above loop, it shows the following:
> > -O2:
> >   0.01 │ f1c:  ldur   d0, [x1, #-248]
> >   3.53 │addw0, w0, #0x1
> >   │ldur   d2, [x2, #-8]
> >   3.54 │addx1, x1, #0x48
> >   │addx2, x2, #0x18
> >   5.89 │fmul   d0, d3, d0
> > 14.12 │fmul   d0, d0, d5
> > 14.14 │fmul   d0, d0, d2
> > 14.13 │fmadd  d1, d4, d0, d1
> >   0.00 │cmpw0, #0x4
> >   3.52 │  ↑ b.ne   f1c
> >
> > -O2 -flto:
> >   5.47  |1124:ldur   d15, [x1, #-248]
> >   2.19  │addw0, w0, #0x1
>

Re: LTO slows down calculix by more than 10% on aarch64

2020-08-28 Thread Prathamesh Kulkarni via Gcc
On Wed, 26 Aug 2020 at 16:50, Richard Biener  wrote:
>
> On Wed, Aug 26, 2020 at 12:34 PM Prathamesh Kulkarni via Gcc
>  wrote:
> >
> > Hi,
> > We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
> > on aarch64 in our validation CI. I tried to investigate this issue a
> > bit, and it seems the regression comes from inlining of orthonl into
> > e_c3d. Disabling that brings back the performance. However, inlining
> > orthonl into e_c3d, increases it's size from 3187 to 3837 by around
> > 16.9% which isn't too large.
> >
> > I have attached two test-cases, e_c3d.f that has orthonl manually
> > inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
> > which contains unmodified function.
> > (gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
> > sufficient.
> >
> > It seems that inlining orthonl, causes 20 hoistings into block 181,
> > which are then hoisted to block 173, in particular hoistings of w(1,
> > 1) ... w(3, 3), which wasn't
> > possible without inlining. The hoistings happen because of basic block
> > that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
> > following block in line 1035 in e_c3d.f:
> >
> > senergy=
> >  &(s11*w(1,1)+s12*(w(1,2)+w(2,1))
> >  &+s13*(w(1,3)+w(3,1))+s22*w(2,2)
> >  &+s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight
> >
> > Disabling hoisting into blocks 173 (and 181), brings back most of the
> > performance. I am not able to understand why (if?) these hoistings of
> > w(1, 1) ...
> > w(3, 3) are causing slowdown however. Looking at assembly, the hot
> > code-path from perf in e_c3d shows following code-gen diff:
> > For inlined version:
> > .L122:
> > ldr d15, [x1, -248]
> > add w0, w0, 1
> > add x2, x2, 24
> > add x1, x1, 72
> > fmuld15, d17, d15
> > fmuld15, d15, d18
> > fmuld14, d15, d14
> > fmadd   d16, d14, d31, d16
> > cmp w0, 4
> > beq .L121
> > ldr d14, [x2, -8]
> > b   .L122
> >
> > and for non-inlined version:
> > .L118:
> > ldr d0, [x1, -248]
> > add w0, w0, 1
> > ldr d2, [x2, -8]
> > add x1, x1, 72
> > add x2, x2, 24
> > fmuld0, d3, d0
> > fmuld0, d0, d5
> > fmuld0, d0, d2
> > fmadd   d1, d4, d0, d1
> > cmp w0, 4
> > bne .L118
>
> I wonder if you have profles.  The inlined version has a
> non-empty latch block (looks like some PRE is happening
> there?).  Eventually your uarch does not like the close
> (does your assembly show the layour as it is?) branches?
Hi Richard,
I have uploaded profiles obtained by perf here:
-O2: https://people.linaro.org/~prathamesh.kulkarni/o2_perf.data
-O2 -flto: https://people.linaro.org/~prathamesh.kulkarni/o2_lto_perf.data

For the above loop, it shows the following:
-O2:
  0.01 │ f1c:  ldur   d0, [x1, #-248]
  3.53 │addw0, w0, #0x1
  │ldur   d2, [x2, #-8]
  3.54 │addx1, x1, #0x48
  │addx2, x2, #0x18
  5.89 │fmul   d0, d3, d0
14.12 │fmul   d0, d0, d5
14.14 │fmul   d0, d0, d2
14.13 │fmadd  d1, d4, d0, d1
  0.00 │cmpw0, #0x4
  3.52 │  ↑ b.ne   f1c

-O2 -flto:
  5.47  |1124:ldur   d15, [x1, #-248]
  2.19  │addw0, w0, #0x1
  1.10  │addx2, x2, #0x18
  2.18  │addx1, x1, #0x48
  4.37  │fmul   d15, d17, d15
 13.13 │fmul   d15, d15, d18
 13.13 │fmul   d14, d15, d14
 13.14 │fmadd  d16, d14, d31, d16
   │cmpw0, #0x4
  3.28  │↓ b.eq   1154
  0.00  │ldur   d14, [x2, #-8]
  2.19  │↑ b  1124

IIUC, the biggest relative difference comes from load [x1, #-248]
which in LTO's case takes 5.47% of overall samples:
5.47  |1124:   ldur   d15, [x1, #-248]
while in case of -O2, it's just 0.01:
 0.01 │ f1c:   ldur   d0, [x1, #-248]

I wonder if that's (one of) the main factor(s) behind slowdown or it's
not too relevant ?

Thanks,
Prathamesh
>
> > which corresponds to the following loop in line 1014.
> > do n1=1,3
> >   s(iii1,jjj1)=s(iii1,jjj1)
> >  &  +anisox(m1,k1,n1,l1)
> >  &

LTO slows down calculix by more than 10% on aarch64

2020-08-26 Thread Prathamesh Kulkarni via Gcc
Hi,
We're seeing a consistent regression >10% on calculix with -O2 -flto vs -O2
on aarch64 in our validation CI. I tried to investigate this issue a
bit, and it seems the regression comes from inlining of orthonl into
e_c3d. Disabling that brings back the performance. However, inlining
orthonl into e_c3d, increases it's size from 3187 to 3837 by around
16.9% which isn't too large.

I have attached two test-cases, e_c3d.f that has orthonl manually
inlined into e_c3d to "simulate" LTO's inlining, and e_c3d-orig.f,
which contains unmodified function.
(gauss.f is included by e_c3d.f). For reproducing, just passing -O2 is
sufficient.

It seems that inlining orthonl, causes 20 hoistings into block 181,
which are then hoisted to block 173, in particular hoistings of w(1,
1) ... w(3, 3), which wasn't
possible without inlining. The hoistings happen because of basic block
that computes orthonl in line 672 has w(1, 1) ... w(3, 3) and the
following block in line 1035 in e_c3d.f:

senergy=
 &(s11*w(1,1)+s12*(w(1,2)+w(2,1))
 &+s13*(w(1,3)+w(3,1))+s22*w(2,2)
 &+s23*(w(2,3)+w(3,2))+s33*w(3,3))*weight

Disabling hoisting into blocks 173 (and 181), brings back most of the
performance. I am not able to understand why (if?) these hoistings of
w(1, 1) ...
w(3, 3) are causing slowdown however. Looking at assembly, the hot
code-path from perf in e_c3d shows following code-gen diff:
For inlined version:
.L122:
ldr d15, [x1, -248]
add w0, w0, 1
add x2, x2, 24
add x1, x1, 72
fmuld15, d17, d15
fmuld15, d15, d18
fmuld14, d15, d14
fmadd   d16, d14, d31, d16
cmp w0, 4
beq .L121
ldr d14, [x2, -8]
b   .L122

and for non-inlined version:
.L118:
ldr d0, [x1, -248]
add w0, w0, 1
ldr d2, [x2, -8]
add x1, x1, 72
add x2, x2, 24
fmuld0, d3, d0
fmuld0, d0, d5
fmuld0, d0, d2
fmadd   d1, d4, d0, d1
cmp w0, 4
bne .L118

which corresponds to the following loop in line 1014.
do n1=1,3
  s(iii1,jjj1)=s(iii1,jjj1)
 &  +anisox(m1,k1,n1,l1)
 &  *w(k1,l1)*vo(i1,m1)*vo(j1,n1)
 &  *weight

I am not sure why would hoisting have any direct effect on this loop
except perhaps that hoisting allocated more reigsters, and led to
increased register pressure. Perhaps that's why it's using highered
number regs for code-gen in inlined version ? However disabling
hoisting in blocks 173 and 181, also leads to overall 6 extra spills
(by grepping for str to sp), so
hoisting is also helping here ? I am not sure how to proceed further,
and would be grateful for suggestions.

Thanks,
Prathamesh


e_c3d.f
Description: Binary data


e_c3d-orig.f
Description: Binary data


gauss.f
Description: Binary data