Re: [patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect)

2023-08-09 Thread Thomas Schwinge
Hi Tobias!

On 2023-07-28T13:51:41+0200, Tobias Burnus  wrote:
> On 27.07.23 23:00, Thomas Schwinge wrote:
>>> +  else if (src_devicep != NULL
>>> +&& (dst_devicep == NULL
>>> +|| (dst_devicep->capabilities
>>> +& GOMP_OFFLOAD_CAP_SHARED_MEM)))
>> Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that
>> 'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears out
>> the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'?
>
> I have now undone this change – I did not dig deep enough into the
> function calls.
>
>
>>> +  else if (dst_devicep == NULL && src_devicep == NULL)
>>> + {
>>> +   memcpy ((char *) dst + dst_off, (const char *) src + src_off,
>>> +   length);
>>> +   ret = 1;
>>> + }
>>> else if (src_devicep == dst_devicep)
>>>ret = src_devicep->dev2dev_func (src_devicep->target_id,
>>> (char *) dst + dst_off,
>>> (const char *) src + src_off,
>>> length);
>> ..., but also left the intra-device case here -- which should now be dead
>> code here?
>
> Why? Unless I missed something, the old, the current, and the proposed
> (= old) code do still run this code.

It is now again reachable, but wasn't in the "intermediate state"
(without your follow-on change) -- at least per my understanding, which
may be gappy.

> I have not added an assert to confirm, but in any case, it is tested for
> in my recently added testcase - thus, we could add a 'printf' to confirm.

We're now back to the original code, which should be fine.

>>> +   else if (*tmp_size < length)
>>> + {
>>> +   *tmp_size = length;
>>> +   *tmp = realloc (*tmp, length);
>>> +   if (*tmp == NULL)
>>> + return ENOMEM;
>> If 'realloc' returns 'NULL', we should 'free' the original '*tmp'?
>>
>> Do we really need here the property here that if the re-allocation can't
>> be done in-place, 'realloc' copies the original content to the new?  In
>> other words, should we just unconditionally 'free' and re-'malloc' here,
>> instead of 'realloc'?
> I have now done so – but I am not really sure what's faster on average.
> If it can be enlarged, 'realloc' is faster, if it cannot free+malloc is
> better.

I have no proof, but would assume that the C library handles as
efficiently as 'realloc' the case of 'free' plus subsequent 'malloc' if
there's space available after the original allocation.

>> I haven't looked whether the re-use of 'tmp' for multiple calls to this
>> is then actually useful, or whether we should just always 'malloc', use,
>> 'free' the buffer here?

..., hence that suggestion.  But I agree what we've got now is good, just
one more idea: couldn't we now actually unify the (original) 'malloc' and
(original) 'realloc' case:

-if (*tmp_size == 0)
-  {
-*tmp_size = length;
-*tmp = malloc (length);
-if (*tmp == NULL)
-  return ENOMEM;
-  }
-else if (*tmp_size < length)
+if (*tmp_size < length)
   {
 *tmp_size = length;
 free (*tmp);
 *tmp = malloc (length);
 if (*tmp == NULL)
   return ENOMEM;
   }

(Untested.)

> Well, it can run in a hot loop – assume a C-array array[1024][1024][2]
> and copying array[:1024,:1024,0:1] (using OpenMP syntax) – i.e. 1048576
> times every other element. And therefore I would like to avoid repeated
> malloc/free in such a case. (But in general, interdevice copying should
> be very rare.)
>
> Actually, I think the realloc case is unreachable: for rectangular
> copies, as implied both by 'target update' with strided access and by
> 'omp_target_memcpy_rect', the size should be constant.

Worth an 'assert'?

Now I "only" don't understand the '__builtin_mul_overflow' computations
and error checks ('return EINVAL;') for the four cases handled in
'libgomp/target.c:omp_target_memcpy_rect_worker': for example, the
generic loop at end of function iterates over all 'num_dims', but the
specific earlier 'num_dims == N' cases don't.  But I suppose that's just
my own problem not understanding this API/code, and I'm not currently
planning on looking into this any further.  ;-)


Grüße
 Thomas
-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect)

2023-07-29 Thread Tobias Burnus

Now committed as r14-2865-g8b9e559fe7ca5715c74115322af99dbf9137a399

Tobias

On 28.07.23 13:51, Tobias Burnus wrote:

thanks for proof reading and the suggestions! – Do have comments to the
attached patch?

* * *

Crossref: For further optimizations, see also

https://gcc.gnu.org/PR101581 — [OpenMP] omp_target_memcpy – support
inter-device memcpy
https://gcc.gnu.org/PR110813 — [OpenMP] omp_target_memcpy_rect (+
strided 'target update'): Improve GCN performance and contiguous
subranges

and just added based on Thomas' comment:

https://gcc.gnu.org/PR107424 — [OpenMP] Check whether device locking is
really needed for bare memcopy to/from devices (omp_target_memcpy...)

* * *

On 27.07.23 23:00, Thomas Schwinge wrote:

+++ b/include/cuda/cuda.h

I note that you're not actually using everything you're adding here.
(..., but I understand you're simply adding everying that relates to
these 'cuMemcpy[...]' routines -- OK as far as I'm concerned.)


Yes. That was on purpose to make it easier to pick something when needed
– especially as we might want to use some of those later on.

For symmetry, I now also added cuMemcpyPeer + ...Async, which also
remain unused. (But could be used as part of the PRs linked above.)


+  const void *dstHost;

That last one isn't 'const'.  ;-)

Fixed - three times.

A 'cuda.h' that I looked at calls that last one 'reserved0', with
comment
"Must be NULL".

Seems to be unused in real world code and in the documentation. But
let's use this name as it might be exposed in the wild.

--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
+extern int GOMP_OFFLOAD_memcpy2d (int, int, size_t, size_t,
+   void*, size_t, size_t, size_t,
+   const void*, size_t, size_t, size_t);
+extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, size_t, size_t,
void *,
+   size_t, size_t, size_t, size_t, size_t,
+   const void *, size_t, size_t,
size_t, size_t,
+   size_t);

Oh, wow.  ;-)


Maybe this is not the best ABI. We can consider to modify it before the
GCC 14 release. (And in principle also afterwards, given that libgomp
and its plugins should™ be compiled and installed alongside.)

I think once we know how to implement GCN, we will see whether it was
done smartly or whether other arguments should be used or whether the
two functions should be combined.

[Regarding the reserve0/reserve1 values for cuMemcpy3D and whether it
should be NULL or not; quoting the usage in plugin-nvptx.c:]


I note that this doesn't adhere to the two "Must be NULL" remarks from
above -- but I'm confused, because, for example, on
capabilities
+& GOMP_OFFLOAD_CAP_SHARED_MEM)))

Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that
'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears
out
the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'?


I have now undone this change – I did not dig deep enough into the
function calls.



+  else if (dst_devicep == NULL && src_devicep == NULL)
+ {
+   memcpy ((char *) dst + dst_off, (const char *) src + src_off,
+   length);
+   ret = 1;
+ }
else if (src_devicep == dst_devicep)
   ret = src_devicep->dev2dev_func (src_devicep->target_id,
(char *) dst + dst_off,
(const char *) src + src_off,
length);

..., but also left the intra-device case here -- which should now be
dead
code here?


Why? Unless I missed something, the old, the current, and the proposed
(= old) code do still run this code.

I have not added an assert to confirm, but in any case, it is tested for
in my recently added testcase - thus, we could add a 'printf' to confirm.


+   else if (*tmp_size < length)
+ {
+   *tmp_size = length;
+   *tmp = realloc (*tmp, length);
+   if (*tmp == NULL)
+ return ENOMEM;

If 'realloc' returns 'NULL', we should 'free' the original '*tmp'?

Do we really need here the property here that if the re-allocation can't
be done in-place, 'realloc' copies the original content to the new?  In
other words, should we just unconditionally 'free' and re-'malloc' here,
instead of 'realloc'?

I have now done so – but I am not really sure what's faster on average.
If it can be enlarged, 'realloc' is 

[patch] libgomp: cuda.h and omp_target_memcpy_rect cleanup (was: [patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect)

2023-07-28 Thread Tobias Burnus

Hi Thomas,

thanks for proof reading and the suggestions! – Do have comments to the
attached patch?

* * *

Crossref: For further optimizations, see also

https://gcc.gnu.org/PR101581 — [OpenMP] omp_target_memcpy – support
inter-device memcpy
https://gcc.gnu.org/PR110813 — [OpenMP] omp_target_memcpy_rect (+
strided 'target update'): Improve GCN performance and contiguous subranges

and just added based on Thomas' comment:

https://gcc.gnu.org/PR107424 — [OpenMP] Check whether device locking is
really needed for bare memcopy to/from devices (omp_target_memcpy...)

* * *

On 27.07.23 23:00, Thomas Schwinge wrote:

+++ b/include/cuda/cuda.h

I note that you're not actually using everything you're adding here.
(..., but I understand you're simply adding everying that relates to
these 'cuMemcpy[...]' routines -- OK as far as I'm concerned.)


Yes. That was on purpose to make it easier to pick something when needed
– especially as we might want to use some of those later on.

For symmetry, I now also added cuMemcpyPeer + ...Async, which also
remain unused. (But could be used as part of the PRs linked above.)


+  const void *dstHost;

That last one isn't 'const'.  ;-)

Fixed - three times.

A 'cuda.h' that I looked at calls that last one 'reserved0', with comment
"Must be NULL".

Seems to be unused in real world code and in the documentation. But
let's use this name as it might be exposed in the wild.

--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
+extern int GOMP_OFFLOAD_memcpy2d (int, int, size_t, size_t,
+   void*, size_t, size_t, size_t,
+   const void*, size_t, size_t, size_t);
+extern int GOMP_OFFLOAD_memcpy3d (int, int, size_t, size_t, size_t, void *,
+   size_t, size_t, size_t, size_t, size_t,
+   const void *, size_t, size_t, size_t, size_t,
+   size_t);

Oh, wow.  ;-)


Maybe this is not the best ABI. We can consider to modify it before the
GCC 14 release. (And in principle also afterwards, given that libgomp
and its plugins should™ be compiled and installed alongside.)

I think once we know how to implement GCN, we will see whether it was
done smartly or whether other arguments should be used or whether the
two functions should be combined.

[Regarding the reserve0/reserve1 values for cuMemcpy3D and whether it
should be NULL or not; quoting the usage in plugin-nvptx.c:]


I note that this doesn't adhere to the two "Must be NULL" remarks from
above -- but I'm confused, because, for example, on
capabilities
+& GOMP_OFFLOAD_CAP_SHARED_MEM)))

Are these 'GOMP_OFFLOAD_CAP_SHARED_MEM' actually reachable, given that
'omp_target_memcpy_check' (via 'omp_target_memcpy_rect_check') clears out
the device to 'NULL' for 'GOMP_OFFLOAD_CAP_SHARED_MEM'?


I have now undone this change – I did not dig deep enough into the
function calls.



+  else if (dst_devicep == NULL && src_devicep == NULL)
+ {
+   memcpy ((char *) dst + dst_off, (const char *) src + src_off,
+   length);
+   ret = 1;
+ }
else if (src_devicep == dst_devicep)
   ret = src_devicep->dev2dev_func (src_devicep->target_id,
(char *) dst + dst_off,
(const char *) src + src_off,
length);

..., but also left the intra-device case here -- which should now be dead
code here?


Why? Unless I missed something, the old, the current, and the proposed
(= old) code do still run this code.

I have not added an assert to confirm, but in any case, it is tested for
in my recently added testcase - thus, we could add a 'printf' to confirm.


+   else if (*tmp_size < length)
+ {
+   *tmp_size = length;
+   *tmp = realloc (*tmp, length);
+   if (*tmp == NULL)
+ return ENOMEM;

If 'realloc' returns 'NULL', we should 'free' the original '*tmp'?

Do we really need here the property here that if the re-allocation can't
be done in-place, 'realloc' copies the original content to the new?  In
other words, should we just unconditionally 'free' and re-'malloc' here,
instead of 'realloc'?

I have now done so – but I am not really sure what's faster on average.
If it can be enlarged, 'realloc' is faster, if it cannot free+malloc is
better.

I haven't looked whether the re-use of 'tmp' for multiple