Josef Melcr wrote:
But also if it weren't static (or had multiple caller), for -O3 I'd
expect the cloning in this case (assuming that it is profitable by
whatever measure the compiler has).
Do you suggest cloning with -O2 or editing the function inline?  I am not sure how difficult either option would be to implement, though just enabling cloning with -O2 shouldn't be too difficult. Cloning makes sense in this case, since the original function will be cleaned up anyway, meaning there would be very little code growth, if any.  Implementing it through a new compilation flag enabled at -O2 by default (as suggested by Martin) would probably be the best.

At least if there is only a single (indirect) call to that function,
the constant propagation should be always profitable (hence, -fipa-cp
is enabled already with -O2, I think).

If that function is called from multiple code paths, it has to be
cloned and then the benefit is less clear (cf. -fipa-cp-clone,
enabled by -O3).

For functions used in callbacks, I think most common is the 'static'
attribute and only being called via a single indirect call.
(For OpenMP regions this is always the case.) However, if someone
missed the 'static' attribute or for some cases where the same
function is passed to multiple functions, *cloning* must be happen.

An example for the latter would be doloop_contained_procedure_code in
gcc/fortran/frontend-passes.cc, which is used twice, but there are
likely more cases in GCC or other real-world code, where are
function might get directly and indirectly called or multiple times
indirectly - or, as mentioned, 'static' has been missed such that
it is only called once indirectly - but the compiler doesn't know
this.

Still, I think enabling it always with -O2 - or at least when
'static' and only called once makes sense.

Whether cloning or not, I don't know - but I would like if the
offloading example below could be made to work such that there
is no call to 'sin' on neither the host nor device side.

* * *

The following is not for the first round feature landing, but as it
is related to the question of cloning and I mentioned it in the Q&A
part during the Cauldron:

Offloading (OpenMP's 'omp target', OpenACC's 'acc kernels', 'acc parallel')

Assume the following rather stupid code (I should have used 'double' or 'sinf'):

float mycalc() {
  float x = 0.123453;
  #pragma omp target map(x)
    x = sin(x);
  return x;
}

This becomes:

...
  <bb 2> :
  x = 1.23452998697757720947265625e-1;
  .omp_data_arr.3.x = &x;
...
  __builtin_GOMP_target_ext (-1, mycalc._omp_fn.0, 1, &.omp_data_arr.3, 
&.omp_data_sizes.4, &.omp_data_kinds.5, 0, 0B, &.omp_target_args.6);
  .omp_data_arr.3 = {CLOBBER};
  D.4653 = x;
...

and

__attribute__((omp target entrypoint, noclone))
void mycalc._omp_fn.0 (const struct .omp_data_t.2 & restrict .omp_data_i)
{
  float x [value-expr: *.omp_data_i->x];
...
  D.4667 = .omp_data_i->x;
  x.0 = *D.4667;
  D.4669 = (double) x.0;
  D.4670 = sin (D.4669);
...


With offloading, it works as follows:

(A) If no suitable offloading device is available, it is a normal
    callback ("host fallback"): __builtin_GOMP_target_ext
    calls mycalc._omp_fn.0 and it receives .omp_data_arr alias
    .omp_data_i for the arguments.

(B) If the default device (implied by the '-1' argument) is an
    offload device:

__builtin_GOMP_target_ext resolves the required offload device and
for 'mycalc._omp_fn.0' the associated offload function is found.

If it is not a shared-memory device, the 'x' variable has to be
copied to the device.


Hence:

* The associated device function needs to be found for the host
  function.

* The arguments ('.omp_data_arr') passed to the host and device
  version must be identically; however, if IPA CP propagates the
  values into one version and not the other that's fine as it
  does not affect the result.

In order to find the associated function, GCC generates on both
the device side and the host side a table with the offload functions
such that the n-th host function is the n-th device function.
The host function is found by function-pointer comparison.

In order that this works, there needs to be a table on both
sides – and it must match.

[For offloading, the offload functions and the mentioned tables
are stored in LTO format but in a different section, i.e.
independent whether LTO is done on the host side or not.]

Before lto-cgraph.cc's output_offload_tables is called, new
functions might be added – and symtab nodes might be deleted
but afterward, the offload_funcs vector must contain all
functions used for the GOMP_target_ext call - and the order
may not change.

That's the reason we set node->force_output in output_offload_tables
and that those functions have the "noclone" attribute.

Without LTO ("-flto"), output_offload_tables is called once
per translation unit, with LTO only once during the host's LTO
run.

* * *

It is not completely clear to me which analysis is run before
output_offload_tables and which one afterward. Obviously,
a lot optimizations have happened before - but I know of
cases where force_output prevents further optimizations,
i.e. some optimizations must happening later!

For IPA CP optimizations involving callbacks, I hope that those
happen before output_offload_tables – because then everything,
including cloning should be fine, and the non-referenced node
can be deleted.

If the IPA CP runs after output_offload_tables, it requires
that the changes still refer to the same tree (i.e. offload_funcs
still can find the same function) - and, likely, the changes
would only affect the host fallback as it is too late for the
device side.

Tobias

PS: https://gcc.gnu.org/wiki/Offloading under "Compilation process"
contains the description how this is handled internally; the
actual writing of the table for host and device side happens in
omp_finish_file – the two tables are combined in the constructor
generated by gcc/config/{gcn,nvptx}/mkoffload.cc. But that's not
really relevant for implementing IPA CP for 'omp target'.

Reply via email to