Hi Josef, hi Martin, hi all,

first, I noticed that the following does not work, but
something like that seems to be a relatively common
pattern. (Albeit often more with the knowledge > 0
or some ranges than with actual constants, but this
can happen as well - especial after some other
value propagations.)

First, unsurprisingly, gcc.dg/ipa/ipcp-cb1.c works
well, which uses:
--------- gcc.dg/ipa/ipcp-cb1.c --------
  test(100);
...
void test(int c) {
#pragma omp parallel for
  for (int i = 0; i < c; i++) {
--------- end of gcc.dg/ipa/ipcp-cb1.c --------


But now to my testcase:
-----------------------
#include <math.h>
#include <stdio.h>

int main()
{
  double x;
//  #pragma omp target map(from: x)
  {
    x = 1.3547;
    #pragma omp parallel if(0)
      x = sin (x);
  }
  printf ("x = %f\n", x);
}
-----------------------

This still gives:

  x = 1.3547000000000000152766688188421539962291717529296875e+0;
  .omp_data_o.1.x = x;
  __builtin_GOMP_parallel (main._omp_fn.0, &.omp_data_o.1, 1, 0);

and

void main._omp_fn.0 (struct .omp_data_s.0 & restrict .omp_data_i)
{
  double x [value-expr: .omp_data_i->x];
...
  D.5200 = .omp_data_i->x;
  D.5201 = sin (D.5200);

I compiled with:
  -save-temps -fopenmp -flto -Ofast -fdump-tree-{ompexp,cp,optimized} 
-foffload-options=-fdump-tree-optimized -lm

And I originally tried it with the 'omp target' commented in,
but then without to make it easier.

I also tried: 'double x,y;' together with 'y = sin(x);' to
avoid actually writing to 'x', but it made no difference.

Any idea?

* * *

Josef Melcr wrote:
ASSUME is on my mind since it was first mentioned when I submitted v1 of this patch.  With the special casing mechanism in place, it would definitely be possible, but the thing I kinda struggle with is priority.  I do have some top priorities, like implementing the summary to save space in the cgraph_edge class, but then there are features like the GOMP_task copy function support, omp target or IFN_ASSUME support and I am not sure which of these is the most important.  Do you have any thoughts on that? 🙂

As mentioned, I think it would be useful to have it working
without -flto for the common cases (cf. previous discussion).

And to have the case above working.

I also think 'target' is of lower priority as it is harder
(see below) and requires more thinking about how to handle it
best.

Handling assumptions - and ranges? - propagations could
be useful. For ranges, I was wondering about code like:
  double x = 1.0 + abs(y);
  ...
  if (x > 0.5)';
which in principle appears in real-world code, but I am not
sure to what extend the knowledge of, e.g., '>= 1.0' will
really help in real-world code. Likewise for assumptions,
albeit
  if (...)
  else
     __builtin_unreachable ();
is at least somewhat common in GCC's own code ...

* * *

Back to 'target' - the quote is from Josef's other email
in this thread:

It seems that the decision stage of ipa-cp is executed after outputting the tables, so the tables would have to be updated.  I am not sure how difficult that's going to be, but having the ability to clone the kernels would be useful, so it's definitely worth looking into.  Not sure when I'll have the time to do so, but it's on my radar 🙂

Thanks.

Note: As mentioned in the reply to v5, OpenACC's 'acc parallel'
and 'acc kernels' work likewise; i.e. the following applies to
those as well - and not only to OpenMP's 'omp target'.

Below some reasons why I'd suggested above to defer the 'target' handling.


The current workflow is:

lto-streamer-out.cc:
* Stream out the functions that should on the offload side
  - this implies both that are indirectly called via GOMP_target_ext
    and called only on the offload side.
  - output the table with the offload functions (function output_offload_tables,
    the vector/array itself is offload_funcs).

- On the host, further optimizations are done at this part
- On the offload side (lto1 offload compiler - but not actually doing lto there)
  optimizations are done there as well.

As part of this work, both sides eventually output the function-pointer array
'offload_funcs' (via omp-offload.cc's omp_finish_file).


Hence:
* Important is that both the device side and the host side have the respective
  function-pointer entry at the same position in that array - and the 
host/offload
  arrays must have the same size.
=> This implies that a single cloning + replacement is fine, but multiple clones
  are not.

* All IPA-CP changes will only affect the host fallback as the IPA comes too 
late
  for the device code - which will have been already streamed out.

And as the GOMP_target_ext code + arguments are only on the host side, there is
nothing the device-side IPA can work on with regards to this callback.

While optimizing the host fallback it fine, what we actually want to have is 
(also)
an optimized device side. – It seems as if we may need to come up with some 
scheme
which delays writing out the device-side to permit more optimizations - whether
just pass reordering or something else.

Actually, before writing out the device code the offload
table, actual cloning would be fine - this would then lead to
multiple host and device versions, but there is no reason
why that should be a problem as long there is a one-to-one
relation between host and device version.

* * *

And, finally, if 'parallel' is optimized, more complex offload
kernels will profit - even if IPA does not work with
GOMP_target_ext.

Tobias

Reply via email to