[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2021-05-06 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert abandoned this revision.
jdoerfert added a comment.
Herald added a subscriber: yaxunl.
Herald added a reviewer: bollu.

replaced by D101976 


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2020-02-25 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert added a comment.

In D59319#1892544 , @JonChesterfield 
wrote:

> I'd like to rebase this on the current deviceRTL, and add any nvptx/amdgcn 
> specific hooks if necessary. Any objections?


No generic objections. Please take a look at the TRegion paper, we want the 
interface to be somewhat like the one described in there (I think). Other than 
that, feel free to commandeer this revision.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2020-02-25 Thread Jon Chesterfield via Phabricator via cfe-commits
JonChesterfield added a comment.

I'd like to rebase this on the current deviceRTL, and add any nvptx/amdgcn 
specific hooks if necessary. Any objections?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-06-03 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
gtbercea added a comment.

Could you check what the difference is between the same kernel in today's SPMD 
mode vs the SPMD mode produced via this method? Number of registers, 
instructions, checking everything gets optimized out as expected. The LLVM-IR 
should be almost identical.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-06-03 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
gtbercea added a comment.

Could you add some tests for this?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-22 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert updated this revision to Diff 191984.
jdoerfert added a comment.

Introduce a ternary mode for parallel regions, fix minor mistakes


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,210 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "omptarget-nvptx.h"
+#include "interface.h"
+
+#include "../../common/target_region.h"
+
+EXTERN void *__kmpc_target_region_kernel_get_shared_memory() {
+  return _shared_bytes_buffer_memory.begin();
+}
+EXTERN void *__kmpc_target_region_kernel_get_private_memory() {
+  return ((char *)_shared_bytes_buffer_memory.begin()) +
+ _shared_bytes_buffer_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(ident_t *Ident,
+   bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  void *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  void *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(ident_t *Ident, unsigned ThreadLimit,
+   bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(Ident, IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool UseSPMDMode,
+   bool RequiresOMPRuntime,
+   bool UseStateMachine,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+  __kmpc_data_sharing_init_stack_spmd();
+
+return 1;
+  }
+
+  // Reserve one WARP in non-SPMD mode for the masters.
+  unsigned ThreadLimit = NumThreads - WARPSIZE;
+  int8_t FilterVal = __kmpc_target_region_thread_filter(
+  Ident, ThreadLimit, UseStateMachine, RequiresOMPRuntime);
+
+  // If the filter returns 1 the executing thread is a team master which will
+  // initialize the kernel in the following.
+  if (FilterVal == 1) {
+__kmpc_kernel_init(ThreadLimit, RequiresOMPRuntime);
+__kmpc_data_sharing_init_stack();
+_shared_bytes_buffer_memory.init();
+  }
+
+  return FilterVal;
+}
+
+EXTERN void __kmpc_target_region_kernel_deinit(ident_t *Ident, bool UseSPMDMode,
+   bool 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert marked an inline comment as done.
jdoerfert added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:104
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool 
UseSPMDMode,
+   bool RequiresOMPRuntime,

ABataev wrote:
> jdoerfert wrote:
> > ABataev wrote:
> > > jdoerfert wrote:
> > > > ABataev wrote:
> > > > > If you're using `ident_t` `UseSPMDMode` and `RequiresOMPRuntime` 
> > > > > parameters are not needed anymore. They are passed in `ident_t` 
> > > > > structure.
> > > > > If you're using ident_t UseSPMDMode and RequiresOMPRuntime parameters 
> > > > > are not needed anymore. They are passed in ident_t structure.
> > > > 
> > > > They are not in the TRegion interface, at least not by the TRegion code 
> > > > generation. If required, we can add that or require the 
> > > > `__kmpc_target_region_kernel_init` implementation to store the values 
> > > > in the `ident_t`. Regardless, we do not want to hide the variables in 
> > > > the `ident_t` because that would result in worse analysis results and 
> > > > cause optimizations to be harder. The main point of all these changes 
> > > > is, after all, to make optimizations easy. Given that we expect these 
> > > > functions to be inlined, there is also no harm done wrt. runtime costs.
> > > > 
> > > > 
> > > > 
> > > > 
> > > This is why we used them. Those `ident_t`s  are constant and it allows us 
> > > to perform an additional optimization in the functions, that do not have 
> > > `isSPMDMpde` and `RequiresFullRuntime`. Because of this parameter, we 
> > > gained a significant performance boost. LLVM knows how to deal with the 
> > > structures, don't worry about the optimization.
> > > This is why we used them. Those ident_ts are constant and it allows us to 
> > > perform an additional optimization in the functions, that do not have 
> > > isSPMDMpde and RequiresFullRuntime.
> > 
> > The boolean parameters are (currently) also constant. The main point 
> > however is that in our expected use case, an inlined device RTL, there is 
> > literally no cost to pay by having the flags explicit as parameters.
> > 
> > 
> > > Because of this parameter, we gained a significant performance boost.
> > 
> > Compared to what? Not having information about the execution mode, etc. at 
> > all? How would that become worse? 
> > 
> > 
> > 
> > 
> > > LLVM knows how to deal with the structures, don't worry about the 
> > > optimization.
> > 
> > I am (painfully) aware of LLVM's capability to promote arguments (that is 
> > what is needed if we do not inline or perform IP-SCCP). However, using a 
> > pointer does allow the use of non-constant `ident_t` values, which are 
> > problematic. They might actually be useful for the original purpose of 
> > `ident_t`, namely location information. Think function merging that will 
> > cause a call with one of multiple different `ident_t` pointers. Making sure 
> > we can promote the values in that case is already much harder than checking 
> > if all potential values are the same boolean constant.
> > 
> 1. This is the data duplication.
> 2. Compared to the previous implementation.
> 3. It allows, yes, but the compiler generates constant `ident_t`. This 
> structure used not only for the location information, but it used also for 
> other purposes. There are no problems with the code inlining and optimization 
> for `ident_t`s.
> 1. This is the data duplication.

What is? Having explicit constant boolean parameters? There is no "duplication" 
if they are constant and the functions are inlined. If you //really think 
otherwise//, I'm afraid we will not make progress here without a third opinion.


> 2. Compared to the previous implementation.

I do not know what the previous implementation was. I'm also unsure what the 
point is you are trying to make. If it is different from point 1., could you 
please elaborate?

> 3. It allows, yes, but the compiler generates constant ident_t. This 
> structure used not only for the location information, but it used also for 
> other purposes. There are no problems with the code inlining and optimization 
> for ident_ts.

For now, maybe. I just gave you a very plausible example of how there could be 
performance implications in the near future due to the indirection compared to 
explicit boolean parameters.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:104
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool 
UseSPMDMode,
+   bool RequiresOMPRuntime,

jdoerfert wrote:
> ABataev wrote:
> > jdoerfert wrote:
> > > ABataev wrote:
> > > > If you're using `ident_t` `UseSPMDMode` and `RequiresOMPRuntime` 
> > > > parameters are not needed anymore. They are passed in `ident_t` 
> > > > structure.
> > > > If you're using ident_t UseSPMDMode and RequiresOMPRuntime parameters 
> > > > are not needed anymore. They are passed in ident_t structure.
> > > 
> > > They are not in the TRegion interface, at least not by the TRegion code 
> > > generation. If required, we can add that or require the 
> > > `__kmpc_target_region_kernel_init` implementation to store the values in 
> > > the `ident_t`. Regardless, we do not want to hide the variables in the 
> > > `ident_t` because that would result in worse analysis results and cause 
> > > optimizations to be harder. The main point of all these changes is, after 
> > > all, to make optimizations easy. Given that we expect these functions to 
> > > be inlined, there is also no harm done wrt. runtime costs.
> > > 
> > > 
> > > 
> > > 
> > This is why we used them. Those `ident_t`s  are constant and it allows us 
> > to perform an additional optimization in the functions, that do not have 
> > `isSPMDMpde` and `RequiresFullRuntime`. Because of this parameter, we 
> > gained a significant performance boost. LLVM knows how to deal with the 
> > structures, don't worry about the optimization.
> > This is why we used them. Those ident_ts are constant and it allows us to 
> > perform an additional optimization in the functions, that do not have 
> > isSPMDMpde and RequiresFullRuntime.
> 
> The boolean parameters are (currently) also constant. The main point however 
> is that in our expected use case, an inlined device RTL, there is literally 
> no cost to pay by having the flags explicit as parameters.
> 
> 
> > Because of this parameter, we gained a significant performance boost.
> 
> Compared to what? Not having information about the execution mode, etc. at 
> all? How would that become worse? 
> 
> 
> 
> 
> > LLVM knows how to deal with the structures, don't worry about the 
> > optimization.
> 
> I am (painfully) aware of LLVM's capability to promote arguments (that is 
> what is needed if we do not inline or perform IP-SCCP). However, using a 
> pointer does allow the use of non-constant `ident_t` values, which are 
> problematic. They might actually be useful for the original purpose of 
> `ident_t`, namely location information. Think function merging that will 
> cause a call with one of multiple different `ident_t` pointers. Making sure 
> we can promote the values in that case is already much harder than checking 
> if all potential values are the same boolean constant.
> 
1. This is the data duplication.
2. Compared to the previous implementation.
3. It allows, yes, but the compiler generates constant `ident_t`. This 
structure used not only for the location information, but it used also for 
other purposes. There are no problems with the code inlining and optimization 
for `ident_t`s.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert marked an inline comment as done.
jdoerfert added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:104
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool 
UseSPMDMode,
+   bool RequiresOMPRuntime,

ABataev wrote:
> jdoerfert wrote:
> > ABataev wrote:
> > > If you're using `ident_t` `UseSPMDMode` and `RequiresOMPRuntime` 
> > > parameters are not needed anymore. They are passed in `ident_t` structure.
> > > If you're using ident_t UseSPMDMode and RequiresOMPRuntime parameters are 
> > > not needed anymore. They are passed in ident_t structure.
> > 
> > They are not in the TRegion interface, at least not by the TRegion code 
> > generation. If required, we can add that or require the 
> > `__kmpc_target_region_kernel_init` implementation to store the values in 
> > the `ident_t`. Regardless, we do not want to hide the variables in the 
> > `ident_t` because that would result in worse analysis results and cause 
> > optimizations to be harder. The main point of all these changes is, after 
> > all, to make optimizations easy. Given that we expect these functions to be 
> > inlined, there is also no harm done wrt. runtime costs.
> > 
> > 
> > 
> > 
> This is why we used them. Those `ident_t`s  are constant and it allows us to 
> perform an additional optimization in the functions, that do not have 
> `isSPMDMpde` and `RequiresFullRuntime`. Because of this parameter, we gained 
> a significant performance boost. LLVM knows how to deal with the structures, 
> don't worry about the optimization.
> This is why we used them. Those ident_ts are constant and it allows us to 
> perform an additional optimization in the functions, that do not have 
> isSPMDMpde and RequiresFullRuntime.

The boolean parameters are (currently) also constant. The main point however is 
that in our expected use case, an inlined device RTL, there is literally no 
cost to pay by having the flags explicit as parameters.


> Because of this parameter, we gained a significant performance boost.

Compared to what? Not having information about the execution mode, etc. at all? 
How would that become worse? 




> LLVM knows how to deal with the structures, don't worry about the 
> optimization.

I am (painfully) aware of LLVM's capability to promote arguments (that is what 
is needed if we do not inline or perform IP-SCCP). However, using a pointer 
does allow the use of non-constant `ident_t` values, which are problematic. 
They might actually be useful for the original purpose of `ident_t`, namely 
location information. Think function merging that will cause a call with one of 
multiple different `ident_t` pointers. Making sure we can promote the values in 
that case is already much harder than checking if all potential values are the 
same boolean constant.



Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert updated this revision to Diff 190868.
jdoerfert added a comment.

Fix a typo (use of wrong variable) and improve comments


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,205 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "interface.h"
+#include "omptarget-nvptx.h"
+
+#include "../../common/target_region.h"
+
+EXTERN void *__kmpc_target_region_kernel_get_shared_memory() {
+  return _shared_bytes_buffer_memory.begin();
+}
+EXTERN void *__kmpc_target_region_kernel_get_private_memory() {
+  return ((char *)_shared_bytes_buffer_memory.begin()) +
+ _shared_bytes_buffer_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(ident_t *Ident,
+   bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  void *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  void *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(ident_t *Ident, unsigned ThreadLimit,
+   bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(Ident, IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool UseSPMDMode,
+   bool UseStateMachine,
+   bool RequiresOMPRuntime,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+  __kmpc_data_sharing_init_stack_spmd();
+
+return 1;
+  }
+
+  // Reserve one WARP in non-SPMD mode for the masters.
+  unsigned ThreadLimit = NumThreads - WARPSIZE;
+  int8_t FilterVal = __kmpc_target_region_thread_filter(
+  Ident, ThreadLimit, UseStateMachine, RequiresOMPRuntime);
+
+  // If the filter returns 1 the executing thread is a team master which will
+  // initialize the kernel in the following.
+  if (FilterVal == 1) {
+__kmpc_kernel_init(ThreadLimit, RequiresOMPRuntime);
+__kmpc_data_sharing_init_stack();
+_shared_bytes_buffer_memory.init();
+  }
+
+  return FilterVal;
+}
+
+EXTERN void __kmpc_target_region_kernel_deinit(ident_t *Ident, bool UseSPMDMode,
+   bool RequiredOMPRuntime) {
+  

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:104
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool 
UseSPMDMode,
+   bool RequiresOMPRuntime,

jdoerfert wrote:
> ABataev wrote:
> > If you're using `ident_t` `UseSPMDMode` and `RequiresOMPRuntime` parameters 
> > are not needed anymore. They are passed in `ident_t` structure.
> > If you're using ident_t UseSPMDMode and RequiresOMPRuntime parameters are 
> > not needed anymore. They are passed in ident_t structure.
> 
> They are not in the TRegion interface, at least not by the TRegion code 
> generation. If required, we can add that or require the 
> `__kmpc_target_region_kernel_init` implementation to store the values in the 
> `ident_t`. Regardless, we do not want to hide the variables in the `ident_t` 
> because that would result in worse analysis results and cause optimizations 
> to be harder. The main point of all these changes is, after all, to make 
> optimizations easy. Given that we expect these functions to be inlined, there 
> is also no harm done wrt. runtime costs.
> 
> 
> 
> 
This is why we used them. Those `ident_t`s  are constant and it allows us to 
perform an additional optimization in the functions, that do not have 
`isSPMDMpde` and `RequiresFullRuntime`. Because of this parameter, we gained a 
significant performance boost. LLVM knows how to deal with the structures, 
don't worry about the optimization.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert marked an inline comment as done.
jdoerfert added a comment.

> What is this buffer used for? [...]

I'll copy your comment and respond in this review D59424 
.




Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:104
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool 
UseSPMDMode,
+   bool RequiresOMPRuntime,

ABataev wrote:
> If you're using `ident_t` `UseSPMDMode` and `RequiresOMPRuntime` parameters 
> are not needed anymore. They are passed in `ident_t` structure.
> If you're using ident_t UseSPMDMode and RequiresOMPRuntime parameters are not 
> needed anymore. They are passed in ident_t structure.

They are not in the TRegion interface, at least not by the TRegion code 
generation. If required, we can add that or require the 
`__kmpc_target_region_kernel_init` implementation to store the values in the 
`ident_t`. Regardless, we do not want to hide the variables in the `ident_t` 
because that would result in worse analysis results and cause optimizations to 
be harder. The main point of all these changes is, after all, to make 
optimizations easy. Given that we expect these functions to be inlined, there 
is also no harm done wrt. runtime costs.






Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert updated this revision to Diff 190861.
jdoerfert marked an inline comment as done.
jdoerfert added a comment.

Rebase onto D59424  and fix errors caused by 
the wrong use of ident_t


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,195 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "omptarget-nvptx.h"
+#include "interface.h"
+
+#include "../../common/target_region.h"
+
+EXTERN void *__kmpc_target_region_kernel_get_shared_memory() {
+  return _shared_bytes_buffer_memory.begin();
+}
+EXTERN void *__kmpc_target_region_kernel_get_private_memory() {
+  return ((char *)_shared_bytes_buffer_memory.begin()) +
+ _shared_bytes_buffer_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(ident_t *Ident,
+   bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  void *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  void *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(ident_t *Ident, unsigned ThreadLimit,
+   bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(Ident, IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool UseSPMDMode,
+   bool UseStateMachine,
+   bool RequiresOMPRuntime,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+  __kmpc_data_sharing_init_stack_spmd();
+
+return 1;
+  }
+
+  // Reserve one WARP in non-SPMD mode for the masters.
+  unsigned ThreadLimit = NumThreads - WARPSIZE;
+  int8_t FilterVal = __kmpc_target_region_thread_filter(
+  Ident, ThreadLimit, UseStateMachine, RequiresOMPRuntime);
+
+  // If the filter returns 1 the executing thread is a team master which will
+  // initialize the kernel in the following.
+  if (FilterVal == 1) {
+__kmpc_kernel_init(ThreadLimit, RequiresOMPRuntime);
+__kmpc_data_sharing_init_stack();
+_shared_bytes_buffer_memory.init();
+  }
+
+  return FilterVal;
+}
+
+EXTERN void __kmpc_target_region_kernel_deinit(ident_t *Ident, bool 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-15 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:104
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool 
UseSPMDMode,
+   bool RequiresOMPRuntime,

If you're using `ident_t` `UseSPMDMode` and `RequiresOMPRuntime` parameters are 
not needed anymore. They are passed in `ident_t` structure.



Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu:70
+
+__device__ __shared__ target_region_shared_buffer _target_region_shared_memory;
+

What is this buffer used for? Transferring pointers to the shread variables to 
the parallel regions? If so, it must be handled by the compiler. There are 
several reasons to do this:
1. You're using malloc/free functions for large buffers. The fact is that the 
size of this buffer is known at the compile time and compiler can generate the 
fixed size buffer in the global memory if required. We already have similar 
implementation for target regions, globalized variables etc. You can take a 
look and adapt it for your purpose.
2. Malloc/free are not very fast on the GPU, so it will get an additional 
performance with the preallocated buffers.
3. Another one problem with malloc/free is that they are using preallocated 
memory and the size of this memory is limited by 8Mb (if I do recall 
correctly). This memory is required for the correct support of the local 
variables globalization and we alredy ran into the situation when malloc could 
not allocate enough memory for it with some previous implementations.
4. You can reused the shared memory buffers already generated by the compiler 
and save  shared memory.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert updated this revision to Diff 190767.
jdoerfert marked 4 inline comments as done.
jdoerfert added a comment.

Add ident_t* to the interface functions


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
  openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,198 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "omptarget-nvptx.h"
+#include "interface.h"
+
+#include "../../common/target_region.h"
+
+/// The pointer used to share memory between team threads.
+extern __device__ __shared__ target_region_shared_buffer
+_target_region_shared_memory;
+
+EXTERN void *__kmpc_target_region_kernel_get_shared_memory() {
+  return _target_region_shared_memory.begin();
+}
+EXTERN void *__kmpc_target_region_kernel_get_private_memory() {
+  return _target_region_shared_memory.begin() +
+ _target_region_shared_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(ident_t *Ident, bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  void *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  void *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(Ident, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(ident_t *Ident, unsigned ThreadLimit,
+   bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(Ident, IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(ident_t *Ident, bool UseSPMDMode,
+   bool UseStateMachine,
+   bool RequiresOMPRuntime,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+  __kmpc_data_sharing_init_stack_spmd();
+
+return 1;
+  }
+
+  // Reserve one WARP in non-SPMD mode for the masters.
+  unsigned ThreadLimit = NumThreads - WARPSIZE;
+  int8_t FilterVal = __kmpc_target_region_thread_filter(
+  Ident, ThreadLimit, UseStateMachine, RequiresOMPRuntime);
+
+  // If the filter returns 1 the executing thread is a team master which will
+  // initialize the kernel in the following.
+  if (FilterVal == 1) {
+__kmpc_kernel_init(ThreadLimit, RequiresOMPRuntime);
+

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

jdoerfert wrote:
> ABataev wrote:
> > jdoerfert wrote:
> > > ABataev wrote:
> > > > jdoerfert wrote:
> > > > > ABataev wrote:
> > > > > > jdoerfert wrote:
> > > > > > > ABataev wrote:
> > > > > > > > Better to use `ident_loc` for passing info about execution mode 
> > > > > > > > and full/lightweight runtime.
> > > > > > > Could you please explain why you think that? Adding indirection 
> > > > > > > through a structure does not really seem beneficial to me.
> > > > > > Almost all function from libomp rely on `ident_loc`. The functions, 
> > > > > > which were added for NVPTX without this parameter had a lot of 
> > > > > > problems later and most of them were replaced with the functions 
> > > > > > with this parameter type. Plus, this parameter is used for 
> > > > > > OMPD/OMPT and it may be important for future OMPD/OMPT support.
> > > > > > Almost all function from libomp rely on ident_loc.
> > > > > 
> > > > > If you look at the implementation of this interface for NVPTX you 
> > > > > will see that the called functions do not take `ident_loc` values. 
> > > > > When you create the calls from the existing NVPTX code generation in 
> > > > > clang, the current code **does not use** `ident_loc` for similar 
> > > > > functions, see:
> > > > > `___kmpc_kernel_init(kmp_int32 thread_limit, int16_t 
> > > > > RequiresOMPRuntime)`,
> > > > > `__kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized)`,
> > > > > `__kmpc_spmd_kernel_init(kmp_int32 thread_limit, int16_t 
> > > > > RequiresOMPRuntime, int16_t RequiresDataSharing)`,
> > > > > `__kmpc_kernel_parallel(void **outlined_function, int16_t 
> > > > > IsOMPRuntimeInitialized)`,
> > > > > ...
> > > > > 
> > > > > 
> > > > > 
> > > > > > Plus, this parameter is used for OMPD/OMPT and it may be important 
> > > > > > for future OMPD/OMPT support.
> > > > > 
> > > > > If we at some point need to make the options permanent in an 
> > > > > `ident_loc` we can simply pass an `ident_loc` and require it to be 
> > > > > initialized by the call. Cluttering the user code with stores and 
> > > > > indirection is exactly what I do want to avoid.
> > > > 1. The new functions rely on `ident_loc`. We had to add those new 
> > > > functions because the old ones did not use it and it was bad design 
> > > > decision. Now we need to fix this. I suggest you do everything right 
> > > > from the very beginning rather than fixing this later by adding extra 
> > > > entry points to support OMPT/OMPD or something else, for example.
> > > > 2. No, you cannot simply change the interface of the library to keep 
> > > > the compatibility with the previous versions of the compiler/library. 
> > > > You will need to add the new entries.  
> > > Let's start this one again because I still haven't understood. Why do we 
> > > need to populate the `ident_loc` again? What information has to be in 
> > > there at which point? I want this to be clear because a lot of other 
> > > "design decisions" of the existing code base are in my opinion not 
> > > necessary and consequently missing here. That includes, for example, 
> > > various global variables. If we have a description of the problem you try 
> > > to solve with the `ident_loc` we might be able to find a way that cuts 
> > > down on state.
> > > 
> > > 
> > > Regarding the "compatibility", this is not a stable interface people can 
> > > rely on. Whatever is committed in this first patch __is not__ set in 
> > > stone. Also, we can _always_ add a `__kmpc_init_ident_loc()` function 
> > > after the fact.
> > Ident_loc holds the data about current source code location, execution mode 
> > and is full runtime required or not. Also, it is used in OMPT/OMPD support.
> > Regarding "compatibility" libraries must be most stable part of the 
> > compiler, because the user migbt need to link the old object file/library 
> > with the new one. Because of this the new versions of libraries must be 
> > compatible with old ones. And you need to maintain the deprecated parts to 
> > keep the compatibility with the previous versions. All these libs already 
> > have a lot of old code that because of the initial poor design and we need 
> > to maintain them. I would like to avoid this situation with this patch.
> > Ident_loc holds the data about current source code location, execution mode 
> > and is full runtime required or not. Also, it is used in OMPT/OMPD support.
> 
> We can store that information through a `__kmpc_init_ident_loc()` call 
> once needed.
> 
> 
> > Regarding "compatibility" libraries must be most stable part of the 
> > compiler, because the user migbt need to link the old object file/library 
> > with the new one. Because of this the new 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert marked 3 inline comments as done.
jdoerfert added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

ABataev wrote:
> jdoerfert wrote:
> > ABataev wrote:
> > > jdoerfert wrote:
> > > > ABataev wrote:
> > > > > jdoerfert wrote:
> > > > > > ABataev wrote:
> > > > > > > Better to use `ident_loc` for passing info about execution mode 
> > > > > > > and full/lightweight runtime.
> > > > > > Could you please explain why you think that? Adding indirection 
> > > > > > through a structure does not really seem beneficial to me.
> > > > > Almost all function from libomp rely on `ident_loc`. The functions, 
> > > > > which were added for NVPTX without this parameter had a lot of 
> > > > > problems later and most of them were replaced with the functions with 
> > > > > this parameter type. Plus, this parameter is used for OMPD/OMPT and 
> > > > > it may be important for future OMPD/OMPT support.
> > > > > Almost all function from libomp rely on ident_loc.
> > > > 
> > > > If you look at the implementation of this interface for NVPTX you will 
> > > > see that the called functions do not take `ident_loc` values. When you 
> > > > create the calls from the existing NVPTX code generation in clang, the 
> > > > current code **does not use** `ident_loc` for similar functions, see:
> > > > `___kmpc_kernel_init(kmp_int32 thread_limit, int16_t 
> > > > RequiresOMPRuntime)`,
> > > > `__kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized)`,
> > > > `__kmpc_spmd_kernel_init(kmp_int32 thread_limit, int16_t 
> > > > RequiresOMPRuntime, int16_t RequiresDataSharing)`,
> > > > `__kmpc_kernel_parallel(void **outlined_function, int16_t 
> > > > IsOMPRuntimeInitialized)`,
> > > > ...
> > > > 
> > > > 
> > > > 
> > > > > Plus, this parameter is used for OMPD/OMPT and it may be important 
> > > > > for future OMPD/OMPT support.
> > > > 
> > > > If we at some point need to make the options permanent in an 
> > > > `ident_loc` we can simply pass an `ident_loc` and require it to be 
> > > > initialized by the call. Cluttering the user code with stores and 
> > > > indirection is exactly what I do want to avoid.
> > > 1. The new functions rely on `ident_loc`. We had to add those new 
> > > functions because the old ones did not use it and it was bad design 
> > > decision. Now we need to fix this. I suggest you do everything right from 
> > > the very beginning rather than fixing this later by adding extra entry 
> > > points to support OMPT/OMPD or something else, for example.
> > > 2. No, you cannot simply change the interface of the library to keep the 
> > > compatibility with the previous versions of the compiler/library. You 
> > > will need to add the new entries.  
> > Let's start this one again because I still haven't understood. Why do we 
> > need to populate the `ident_loc` again? What information has to be in there 
> > at which point? I want this to be clear because a lot of other "design 
> > decisions" of the existing code base are in my opinion not necessary and 
> > consequently missing here. That includes, for example, various global 
> > variables. If we have a description of the problem you try to solve with 
> > the `ident_loc` we might be able to find a way that cuts down on state.
> > 
> > 
> > Regarding the "compatibility", this is not a stable interface people can 
> > rely on. Whatever is committed in this first patch __is not__ set in stone. 
> > Also, we can _always_ add a `__kmpc_init_ident_loc()` function after 
> > the fact.
> Ident_loc holds the data about current source code location, execution mode 
> and is full runtime required or not. Also, it is used in OMPT/OMPD support.
> Regarding "compatibility" libraries must be most stable part of the compiler, 
> because the user migbt need to link the old object file/library with the new 
> one. Because of this the new versions of libraries must be compatible with 
> old ones. And you need to maintain the deprecated parts to keep the 
> compatibility with the previous versions. All these libs already have a lot 
> of old code that because of the initial poor design and we need to maintain 
> them. I would like to avoid this situation with this patch.
> Ident_loc holds the data about current source code location, execution mode 
> and is full runtime required or not. Also, it is used in OMPT/OMPD support.

We can store that information through a `__kmpc_init_ident_loc()` call once 
needed.


> Regarding "compatibility" libraries must be most stable part of the compiler, 
> because the user migbt need to link the old object file/library with the new 
> one. Because of this the new versions of libraries must be compatible with 
> old ones. And you need to maintain the deprecated parts to keep the 
> compatibility with the previous 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:27
+
+/// The target region _kernel_ interface for GPUs
+///

jdoerfert wrote:
> ABataev wrote:
> > jdoerfert wrote:
> > > ABataev wrote:
> > > > jdoerfert wrote:
> > > > > ABataev wrote:
> > > > > > All exported functions are declared in the `interface.h` file. I 
> > > > > > don't think we need an extra interface file here
> > > > > `interface.h`, or to be more precise for people that do not know, 
> > > > > `deviceRTLs/nvptx/src/interface.h`, is nvptx specific. This file, 
> > > > > `deviceRTLs/common/target_region.h`, is by design target agnostic and 
> > > > > not placed _under_ the nvptx subfolder. If you are willing to move 
> > > > > `interface.h` into a common space and remove the nvptx specific 
> > > > > functions we can merge the two. Otherwise, I have strong reservations 
> > > > > agains that and good reason not to do it.
> > > > I see that currently it is written in Cuda. It means, it targets NVidia 
> > > > GPUs, at least at the moment. I'm fine to put this header file into the 
> > > > common directory, if you're sure that this is really target agnostic. 
> > > > But maybe just for a start we should put it to NVPTX directory? Later, 
> > > > when you or somebody else will add support for other GPUs and he/she 
> > > > will find out that these functions are really target agnostic, they can 
> > > > be moved into the common directory?
> > > > I see that currently it is written in Cuda. It means, it targets NVidia 
> > > > GPUs, at least at the moment
> > > 
> > > How do you see that? (I hope we both talk about this file, correct?)
> > > 
> > > 
> > > > But maybe just for a start we should put it to NVPTX directory?
> > > 
> > > Why? What is the benefit? If we want it to be agnostic, regardless of the 
> > > current state, it should be developed _outside_ of the target specific 
> > > directories.
> > > 
> > I'm not talking about this particular file, just like I said we can put it 
> > into `common` subdirectory. I'm talking about the implementation files. 
> > They all are written in Cuda, no?
> > But it is not proved yet that this solution is target agnostic. Did you 
> > test it for AMD?
> > I'm not talking about this particular file, just like I said we can put it 
> > into common subdirectory.
> 
> OK. It is (the only file in the common folder for now).
> 
> 
> > I'm talking about the implementation files. They all are written in Cuda, 
> > no?
> 
> Yes, Cuda, and placed under the nvptx folder for that reason. That is what 
> you want, correct?
> 
> 
> > But it is not proved yet that this solution is target agnostic. Did you 
> > test it for AMD?
> 
> What do you mean by solution? I do not have a second implementation of the 
> interface but nothing up to the implementation of the interface is target 
> aware. By construction, this means it will work for anything we can implement 
> the interface in. 
> 
> Why do you fight so hard against this? What exactly do you want to change 
> here? Given the last comment, and assuming I understand you correctly, the 
> files are all exactly where you want them to be. That the wording sometimes 
> states "target agnostic" is a sign of intent, even if for some currently 
> unknown reason it would not hold true.
> 
> 
I'm trying to understand what is the best layout for your solution. 



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

jdoerfert wrote:
> ABataev wrote:
> > jdoerfert wrote:
> > > ABataev wrote:
> > > > jdoerfert wrote:
> > > > > ABataev wrote:
> > > > > > Better to use `ident_loc` for passing info about execution mode and 
> > > > > > full/lightweight runtime.
> > > > > Could you please explain why you think that? Adding indirection 
> > > > > through a structure does not really seem beneficial to me.
> > > > Almost all function from libomp rely on `ident_loc`. The functions, 
> > > > which were added for NVPTX without this parameter had a lot of problems 
> > > > later and most of them were replaced with the functions with this 
> > > > parameter type. Plus, this parameter is used for OMPD/OMPT and it may 
> > > > be important for future OMPD/OMPT support.
> > > > Almost all function from libomp rely on ident_loc.
> > > 
> > > If you look at the implementation of this interface for NVPTX you will 
> > > see that the called functions do not take `ident_loc` values. When you 
> > > create the calls from the existing NVPTX code generation in clang, the 
> > > current code **does not use** `ident_loc` for similar functions, see:
> > > `___kmpc_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime)`,
> > > `__kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized)`,
> > > `__kmpc_spmd_kernel_init(kmp_int32 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert updated this revision to Diff 190717.
jdoerfert marked 4 inline comments as done.
jdoerfert added a comment.

Replace more char* with void*


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
  openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,197 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "omptarget-nvptx.h"
+#include "interface.h"
+
+#include "../../common/target_region.h"
+
+/// The pointer used to share memory between team threads.
+extern __device__ __shared__ target_region_shared_buffer
+_target_region_shared_memory;
+
+EXTERN void *__kmpc_target_region_kernel_get_shared_memory() {
+  return _target_region_shared_memory.begin();
+}
+EXTERN void *__kmpc_target_region_kernel_get_private_memory() {
+  return _target_region_shared_memory.begin() +
+ _target_region_shared_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  void *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  void *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(unsigned ThreadLimit, bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool UseStateMachine,
+   bool RequiresOMPRuntime,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+  __kmpc_data_sharing_init_stack_spmd();
+
+return 1;
+  }
+
+  // Reserve one WARP in non-SPMD mode for the masters.
+  unsigned ThreadLimit = NumThreads - WARPSIZE;
+  int8_t FilterVal = __kmpc_target_region_thread_filter(
+  ThreadLimit, UseStateMachine, RequiresOMPRuntime);
+
+  // If the filter returns 1 the executing thread is a team master which will
+  // initialize the kernel in the following.
+  if (FilterVal == 1) {
+__kmpc_kernel_init(ThreadLimit, RequiresOMPRuntime);
+__kmpc_data_sharing_init_stack();
+_target_region_shared_memory.init();
+  }
+
+  return FilterVal;
+}
+
+EXTERN void 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert marked 2 inline comments as done.
jdoerfert added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:27
+
+/// The target region _kernel_ interface for GPUs
+///

ABataev wrote:
> jdoerfert wrote:
> > ABataev wrote:
> > > jdoerfert wrote:
> > > > ABataev wrote:
> > > > > All exported functions are declared in the `interface.h` file. I 
> > > > > don't think we need an extra interface file here
> > > > `interface.h`, or to be more precise for people that do not know, 
> > > > `deviceRTLs/nvptx/src/interface.h`, is nvptx specific. This file, 
> > > > `deviceRTLs/common/target_region.h`, is by design target agnostic and 
> > > > not placed _under_ the nvptx subfolder. If you are willing to move 
> > > > `interface.h` into a common space and remove the nvptx specific 
> > > > functions we can merge the two. Otherwise, I have strong reservations 
> > > > agains that and good reason not to do it.
> > > I see that currently it is written in Cuda. It means, it targets NVidia 
> > > GPUs, at least at the moment. I'm fine to put this header file into the 
> > > common directory, if you're sure that this is really target agnostic. But 
> > > maybe just for a start we should put it to NVPTX directory? Later, when 
> > > you or somebody else will add support for other GPUs and he/she will find 
> > > out that these functions are really target agnostic, they can be moved 
> > > into the common directory?
> > > I see that currently it is written in Cuda. It means, it targets NVidia 
> > > GPUs, at least at the moment
> > 
> > How do you see that? (I hope we both talk about this file, correct?)
> > 
> > 
> > > But maybe just for a start we should put it to NVPTX directory?
> > 
> > Why? What is the benefit? If we want it to be agnostic, regardless of the 
> > current state, it should be developed _outside_ of the target specific 
> > directories.
> > 
> I'm not talking about this particular file, just like I said we can put it 
> into `common` subdirectory. I'm talking about the implementation files. They 
> all are written in Cuda, no?
> But it is not proved yet that this solution is target agnostic. Did you test 
> it for AMD?
> I'm not talking about this particular file, just like I said we can put it 
> into common subdirectory.

OK. It is (the only file in the common folder for now).


> I'm talking about the implementation files. They all are written in Cuda, no?

Yes, Cuda, and placed under the nvptx folder for that reason. That is what you 
want, correct?


> But it is not proved yet that this solution is target agnostic. Did you test 
> it for AMD?

What do you mean by solution? I do not have a second implementation of the 
interface but nothing up to the implementation of the interface is target 
aware. By construction, this means it will work for anything we can implement 
the interface in. 

Why do you fight so hard against this? What exactly do you want to change here? 
Given the last comment, and assuming I understand you correctly, the files are 
all exactly where you want them to be. That the wording sometimes states 
"target agnostic" is a sign of intent, even if for some currently unknown 
reason it would not hold true.





Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

ABataev wrote:
> jdoerfert wrote:
> > ABataev wrote:
> > > jdoerfert wrote:
> > > > ABataev wrote:
> > > > > Better to use `ident_loc` for passing info about execution mode and 
> > > > > full/lightweight runtime.
> > > > Could you please explain why you think that? Adding indirection through 
> > > > a structure does not really seem beneficial to me.
> > > Almost all function from libomp rely on `ident_loc`. The functions, which 
> > > were added for NVPTX without this parameter had a lot of problems later 
> > > and most of them were replaced with the functions with this parameter 
> > > type. Plus, this parameter is used for OMPD/OMPT and it may be important 
> > > for future OMPD/OMPT support.
> > > Almost all function from libomp rely on ident_loc.
> > 
> > If you look at the implementation of this interface for NVPTX you will see 
> > that the called functions do not take `ident_loc` values. When you create 
> > the calls from the existing NVPTX code generation in clang, the current 
> > code **does not use** `ident_loc` for similar functions, see:
> > `___kmpc_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime)`,
> > `__kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized)`,
> > `__kmpc_spmd_kernel_init(kmp_int32 thread_limit, int16_t 
> > RequiresOMPRuntime, int16_t RequiresDataSharing)`,
> > `__kmpc_kernel_parallel(void **outlined_function, int16_t 
> > IsOMPRuntimeInitialized)`,
> > ...
> > 
> > 
> > 
> > > Plus, this parameter is used for 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:27
+
+/// The target region _kernel_ interface for GPUs
+///

jdoerfert wrote:
> ABataev wrote:
> > jdoerfert wrote:
> > > ABataev wrote:
> > > > All exported functions are declared in the `interface.h` file. I don't 
> > > > think we need an extra interface file here
> > > `interface.h`, or to be more precise for people that do not know, 
> > > `deviceRTLs/nvptx/src/interface.h`, is nvptx specific. This file, 
> > > `deviceRTLs/common/target_region.h`, is by design target agnostic and not 
> > > placed _under_ the nvptx subfolder. If you are willing to move 
> > > `interface.h` into a common space and remove the nvptx specific functions 
> > > we can merge the two. Otherwise, I have strong reservations agains that 
> > > and good reason not to do it.
> > I see that currently it is written in Cuda. It means, it targets NVidia 
> > GPUs, at least at the moment. I'm fine to put this header file into the 
> > common directory, if you're sure that this is really target agnostic. But 
> > maybe just for a start we should put it to NVPTX directory? Later, when you 
> > or somebody else will add support for other GPUs and he/she will find out 
> > that these functions are really target agnostic, they can be moved into the 
> > common directory?
> > I see that currently it is written in Cuda. It means, it targets NVidia 
> > GPUs, at least at the moment
> 
> How do you see that? (I hope we both talk about this file, correct?)
> 
> 
> > But maybe just for a start we should put it to NVPTX directory?
> 
> Why? What is the benefit? If we want it to be agnostic, regardless of the 
> current state, it should be developed _outside_ of the target specific 
> directories.
> 
I'm not talking about this particular file, just like I said we can put it into 
`common` subdirectory. I'm talking about the implementation files. They all are 
written in Cuda, no?
But it is not proved yet that this solution is target agnostic. Did you test it 
for AMD?



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

jdoerfert wrote:
> ABataev wrote:
> > jdoerfert wrote:
> > > ABataev wrote:
> > > > Better to use `ident_loc` for passing info about execution mode and 
> > > > full/lightweight runtime.
> > > Could you please explain why you think that? Adding indirection through a 
> > > structure does not really seem beneficial to me.
> > Almost all function from libomp rely on `ident_loc`. The functions, which 
> > were added for NVPTX without this parameter had a lot of problems later and 
> > most of them were replaced with the functions with this parameter type. 
> > Plus, this parameter is used for OMPD/OMPT and it may be important for 
> > future OMPD/OMPT support.
> > Almost all function from libomp rely on ident_loc.
> 
> If you look at the implementation of this interface for NVPTX you will see 
> that the called functions do not take `ident_loc` values. When you create the 
> calls from the existing NVPTX code generation in clang, the current code 
> **does not use** `ident_loc` for similar functions, see:
> `___kmpc_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime)`,
> `__kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized)`,
> `__kmpc_spmd_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime, 
> int16_t RequiresDataSharing)`,
> `__kmpc_kernel_parallel(void **outlined_function, int16_t 
> IsOMPRuntimeInitialized)`,
> ...
> 
> 
> 
> > Plus, this parameter is used for OMPD/OMPT and it may be important for 
> > future OMPD/OMPT support.
> 
> If we at some point need to make the options permanent in an `ident_loc` we 
> can simply pass an `ident_loc` and require it to be initialized by the call. 
> Cluttering the user code with stores and indirection is exactly what I do 
> want to avoid.
1. The new functions rely on `ident_loc`. We had to add those new functions 
because the old ones did not use it and it was bad design decision. Now we need 
to fix this. I suggest you do everything right from the very beginning rather 
than fixing this later by adding extra entry points to support OMPT/OMPD or 
something else, for example.
2. No, you cannot simply change the interface of the library to keep the 
compatibility with the previous versions of the compiler/library. You will need 
to add the new entries.  



Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu:70
+
+__device__ __shared__ target_region_shared_buffer _target_region_shared_memory;
+

jdoerfert wrote:
> ABataev wrote:
> > It would be good to store it the global memory rather than in the shared to 
> > save th 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:27
+
+/// The target region _kernel_ interface for GPUs
+///

ABataev wrote:
> jdoerfert wrote:
> > ABataev wrote:
> > > All exported functions are declared in the `interface.h` file. I don't 
> > > think we need an extra interface file here
> > `interface.h`, or to be more precise for people that do not know, 
> > `deviceRTLs/nvptx/src/interface.h`, is nvptx specific. This file, 
> > `deviceRTLs/common/target_region.h`, is by design target agnostic and not 
> > placed _under_ the nvptx subfolder. If you are willing to move 
> > `interface.h` into a common space and remove the nvptx specific functions 
> > we can merge the two. Otherwise, I have strong reservations agains that and 
> > good reason not to do it.
> I see that currently it is written in Cuda. It means, it targets NVidia GPUs, 
> at least at the moment. I'm fine to put this header file into the common 
> directory, if you're sure that this is really target agnostic. But maybe just 
> for a start we should put it to NVPTX directory? Later, when you or somebody 
> else will add support for other GPUs and he/she will find out that these 
> functions are really target agnostic, they can be moved into the common 
> directory?
> I see that currently it is written in Cuda. It means, it targets NVidia GPUs, 
> at least at the moment

How do you see that? (I hope we both talk about this file, correct?)


> But maybe just for a start we should put it to NVPTX directory?

Why? What is the benefit? If we want it to be agnostic, regardless of the 
current state, it should be developed _outside_ of the target specific 
directories.




Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

ABataev wrote:
> jdoerfert wrote:
> > ABataev wrote:
> > > Better to use `ident_loc` for passing info about execution mode and 
> > > full/lightweight runtime.
> > Could you please explain why you think that? Adding indirection through a 
> > structure does not really seem beneficial to me.
> Almost all function from libomp rely on `ident_loc`. The functions, which 
> were added for NVPTX without this parameter had a lot of problems later and 
> most of them were replaced with the functions with this parameter type. Plus, 
> this parameter is used for OMPD/OMPT and it may be important for future 
> OMPD/OMPT support.
> Almost all function from libomp rely on ident_loc.

If you look at the implementation of this interface for NVPTX you will see that 
the called functions do not take `ident_loc` values. When you create the calls 
from the existing NVPTX code generation in clang, the current code **does not 
use** `ident_loc` for similar functions, see:
`___kmpc_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime)`,
`__kmpc_kernel_deinit(int16_t IsOMPRuntimeInitialized)`,
`__kmpc_spmd_kernel_init(kmp_int32 thread_limit, int16_t RequiresOMPRuntime, 
int16_t RequiresDataSharing)`,
`__kmpc_kernel_parallel(void **outlined_function, int16_t 
IsOMPRuntimeInitialized)`,
...



> Plus, this parameter is used for OMPD/OMPT and it may be important for future 
> OMPD/OMPT support.

If we at some point need to make the options permanent in an `ident_loc` we can 
simply pass an `ident_loc` and require it to be initialized by the call. 
Cluttering the user code with stores and indirection is exactly what I do want 
to avoid.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:124
+/// unpacking code.
+typedef void (*ParallelWorkFnTy)(char * /* SharedValues */,
+ char * /* PrivateValues */);

ABataev wrote:
> We used `void *` for buffers usually, I think it is better to use `void *` 
> here too instead of `char *`.
Thanks, fixed.



Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu:70
+
+__device__ __shared__ target_region_shared_buffer _target_region_shared_memory;
+

ABataev wrote:
> It would be good to store it the global memory rather than in the shared to 
> save th shared memory. Also, we already are using several shared memory 
> buffers for different purposes, it would be good to merge them somehow to 
> reduce pressure on shared memory.
I would have reused your buffer but it is for reasons unclear to me, not a 
byte-wise buffer but an array of `void *` and also used as such. Using it as a 
byte-wise buffer might cause problems or at least confusion. Changing it to a 
byte-wise buffer would be fine with me. I don't need a separate buffer but just 
one with the functionality implemented in this one.



Comment at: 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert updated this revision to Diff 190710.
jdoerfert marked 6 inline comments as done.
jdoerfert added a comment.

Change char* to void*


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
  openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,197 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "omptarget-nvptx.h"
+#include "interface.h"
+
+#include "../../common/target_region.h"
+
+/// The pointer used to share memory between team threads.
+extern __device__ __shared__ target_region_shared_buffer
+_target_region_shared_memory;
+
+EXTERN char *__kmpc_target_region_kernel_get_shared_memory() {
+  return _target_region_shared_memory.begin();
+}
+EXTERN char *__kmpc_target_region_kernel_get_private_memory() {
+  return _target_region_shared_memory.begin() +
+ _target_region_shared_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  char *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  char *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(unsigned ThreadLimit, bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool UseStateMachine,
+   bool RequiresOMPRuntime,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+  __kmpc_data_sharing_init_stack_spmd();
+
+return 1;
+  }
+
+  // Reserve one WARP in non-SPMD mode for the masters.
+  unsigned ThreadLimit = NumThreads - WARPSIZE;
+  int8_t FilterVal = __kmpc_target_region_thread_filter(
+  ThreadLimit, UseStateMachine, RequiresOMPRuntime);
+
+  // If the filter returns 1 the executing thread is a team master which will
+  // initialize the kernel in the following.
+  if (FilterVal == 1) {
+__kmpc_kernel_init(ThreadLimit, RequiresOMPRuntime);
+__kmpc_data_sharing_init_stack();
+_target_region_shared_memory.init();
+  }
+
+  return FilterVal;
+}
+
+EXTERN void 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-14 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:27
+
+/// The target region _kernel_ interface for GPUs
+///

jdoerfert wrote:
> ABataev wrote:
> > All exported functions are declared in the `interface.h` file. I don't 
> > think we need an extra interface file here
> `interface.h`, or to be more precise for people that do not know, 
> `deviceRTLs/nvptx/src/interface.h`, is nvptx specific. This file, 
> `deviceRTLs/common/target_region.h`, is by design target agnostic and not 
> placed _under_ the nvptx subfolder. If you are willing to move `interface.h` 
> into a common space and remove the nvptx specific functions we can merge the 
> two. Otherwise, I have strong reservations agains that and good reason not to 
> do it.
I see that currently it is written in Cuda. It means, it targets NVidia GPUs, 
at least at the moment. I'm fine to put this header file into the common 
directory, if you're sure that this is really target agnostic. But maybe just 
for a start we should put it to NVPTX directory? Later, when you or somebody 
else will add support for other GPUs and he/she will find out that these 
functions are really target agnostic, they can be moved into the common 
directory?



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

jdoerfert wrote:
> ABataev wrote:
> > Better to use `ident_loc` for passing info about execution mode and 
> > full/lightweight runtime.
> Could you please explain why you think that? Adding indirection through a 
> structure does not really seem beneficial to me.
Almost all function from libomp rely on `ident_loc`. The functions, which were 
added for NVPTX without this parameter had a lot of problems later and most of 
them were replaced with the functions with this parameter type. Plus, this 
parameter is used for OMPD/OMPT and it may be important for future OMPD/OMPT 
support.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:124
+/// unpacking code.
+typedef void (*ParallelWorkFnTy)(char * /* SharedValues */,
+ char * /* PrivateValues */);

We used `void *` for buffers usually, I think it is better to use `void *` here 
too instead of `char *`.



Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu:70
+
+__device__ __shared__ target_region_shared_buffer _target_region_shared_memory;
+

It would be good to store it the global memory rather than in the shared to 
save th shared memory. Also, we already are using several shared memory buffers 
for different purposes, it would be good to merge them somehow to reduce 
pressure on shared memory.



Comment at: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu:64
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.

What is the criteria for `UseStateMachine`? Under what conditions it can be set 
to `true` and `false`? Also, what if have several parallel regions in non-SPMD 
kernel and `UseStateMachine` is `true`?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-13 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert marked 5 inline comments as done.
jdoerfert added inline comments.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:27
+
+/// The target region _kernel_ interface for GPUs
+///

ABataev wrote:
> All exported functions are declared in the `interface.h` file. I don't think 
> we need an extra interface file here
`interface.h`, or to be more precise for people that do not know, 
`deviceRTLs/nvptx/src/interface.h`, is nvptx specific. This file, 
`deviceRTLs/common/target_region.h`, is by design target agnostic and not 
placed _under_ the nvptx subfolder. If you are willing to move `interface.h` 
into a common space and remove the nvptx specific functions we can merge the 
two. Otherwise, I have strong reservations agains that and good reason not to 
do it.



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

ABataev wrote:
> Better to use `ident_loc` for passing info about execution mode and 
> full/lightweight runtime.
Could you please explain why you think that? Adding indirection through a 
structure does not really seem beneficial to me.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-13 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: 
openmp/libomptarget/cmake/Modules/LibomptargetNVPTXBitcodeLibrary.cmake:81
 # if any of them are not supported, there is no point in finding out which are.
-set(compiler_flags_required -emit-llvm -O1 --cuda-device-only 
--cuda-path=${CUDA_TOOLKIT_ROOT_DIR})
+set(compiler_flags_required -emit-llvm -std=c++11 -O1 --cuda-device-only 
--cuda-path=${CUDA_TOOLKIT_ROOT_DIR})
 set(compiler_flags_required_src "extern \"C\" __device__ int thread() { return 
threadIdx.x; }")

It must be in a separate patch



Comment at: 
openmp/libomptarget/cmake/Modules/LibomptargetNVPTXBitcodeLibrary.cmake:88
 if (NOT LIBOMPTARGET_NVPTX_CUDA_COMPILER_SUPPORTS_FLAGS_REQUIRED)
+  message(ERROR "NO FLAG SUPPORT")
   return()

Same, if you really need it - separate patch



Comment at: 
openmp/libomptarget/cmake/Modules/LibomptargetNVPTXBitcodeLibrary.cmake:105
   if (NOT LIBOMPTARGET_NVPTX_CUDA_COMPILER_SUPPORTS_FCUDA_RDC)
+message(ERROR "NO FCUDA RDC")
 return()

Same here



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:27
+
+/// The target region _kernel_ interface for GPUs
+///

All exported functions are declared in the `interface.h` file. I don't think we 
need an extra interface file here



Comment at: openmp/libomptarget/deviceRTLs/common/target_region.h:100
+///
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool RequiresOMPRuntime,

Better to use `ident_loc` for passing info about execution mode and 
full/lightweight runtime.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-13 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert updated this revision to Diff 190484.
jdoerfert added a comment.

Simplify the commmit further


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D59319/new/

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
  openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,197 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "omptarget-nvptx.h"
+#include "interface.h"
+
+#include "../../common/target_region.h"
+
+/// The pointer used to share memory between team threads.
+extern __device__ __shared__ target_region_shared_buffer
+_target_region_shared_memory;
+
+EXTERN char *__kmpc_target_region_kernel_get_shared_memory() {
+  return _target_region_shared_memory.begin();
+}
+EXTERN char *__kmpc_target_region_kernel_get_private_memory() {
+  return _target_region_shared_memory.begin() +
+ _target_region_shared_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  char *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  char *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(unsigned ThreadLimit, bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool UseStateMachine,
+   bool RequiresOMPRuntime,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+  __kmpc_data_sharing_init_stack_spmd();
+
+return 1;
+  }
+
+  // Reserve one WARP in non-SPMD mode for the masters.
+  unsigned ThreadLimit = NumThreads - WARPSIZE;
+  int8_t FilterVal = __kmpc_target_region_thread_filter(
+  ThreadLimit, UseStateMachine, RequiresOMPRuntime);
+
+  // If the filter returns 1 the executing thread is a team master which will
+  // initialize the kernel in the following.
+  if (FilterVal == 1) {
+__kmpc_kernel_init(ThreadLimit, RequiresOMPRuntime);
+__kmpc_data_sharing_init_stack();
+_target_region_shared_memory.init();
+  }
+
+  return FilterVal;
+}
+
+EXTERN void __kmpc_target_region_kernel_deinit(bool 

[PATCH] D59319: [OpenMP][Offloading][1/3] A generic and simple target region interface

2019-03-13 Thread Johannes Doerfert via Phabricator via cfe-commits
jdoerfert created this revision.
jdoerfert added reviewers: ABataev, arpith-jacob, guraypp, gtbercea, hfinkel.
Herald added a project: OpenMP.

This patch introduces an alternative OpenMP GPU kernel offloading
interface called target kernel region (or TRegion).

The commit includes the runtime library implementation for the NVPTX
device plugin, implemented mostly in terms of the existing
functionality.

The interface is deliberately simple to be easily analyzable in the
middle end. Design decisions included:

- Hide all (complex) implementation choices in the runtime library but allow 
complete removal of the abstraction once the runtime is inlined.
- Provide all runtime calls with sufficient, easy encoded information.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D59319

Files:
  openmp/libomptarget/cmake/Modules/LibomptargetNVPTXBitcodeLibrary.cmake
  openmp/libomptarget/deviceRTLs/common/target_region.h
  openmp/libomptarget/deviceRTLs/nvptx/CMakeLists.txt
  openmp/libomptarget/deviceRTLs/nvptx/src/omp_data.cu
  openmp/libomptarget/deviceRTLs/nvptx/src/omptarget-nvptx.h
  openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu

Index: openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
===
--- /dev/null
+++ openmp/libomptarget/deviceRTLs/nvptx/src/target_region.cu
@@ -0,0 +1,197 @@
+//===-- target_region.cu  CUDA impl. of the target region interface -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===--===//
+//
+// This file contains the implementation of the common target region interface.
+//
+//===--===//
+
+// Include the native definitions first as certain defines might be needed in
+// the common interface definition below.
+#include "omptarget-nvptx.h"
+#include "interface.h"
+
+#include "../../common/target_region.h"
+
+/// The pointer used to share memory between team threads.
+extern __device__ __shared__ target_region_shared_buffer
+_target_region_shared_memory;
+
+EXTERN char *__kmpc_target_region_kernel_get_shared_memory() {
+  return _target_region_shared_memory.begin();
+}
+EXTERN char *__kmpc_target_region_kernel_get_private_memory() {
+  return _target_region_shared_memory.begin() +
+ _target_region_shared_memory.get_offset();
+}
+
+/// Simple generic state machine for worker threads.
+INLINE static void
+__kmpc_target_region_state_machine(bool IsOMPRuntimeInitialized) {
+
+  do {
+void *WorkFn = 0;
+
+// Wait for the signal that we have a new work function.
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+// Retrieve the work function from the runtime.
+bool IsActive = __kmpc_kernel_parallel(, IsOMPRuntimeInitialized);
+
+// If there is nothing more to do, break out of the state machine by
+// returning to the caller.
+if (!WorkFn)
+  return;
+
+if (IsActive) {
+  char *SharedVars = __kmpc_target_region_kernel_get_shared_memory();
+  char *PrivateVars = __kmpc_target_region_kernel_get_private_memory();
+
+  ((ParallelWorkFnTy)WorkFn)(SharedVars, PrivateVars);
+
+  __kmpc_kernel_end_parallel();
+}
+
+__kmpc_barrier_simple_spmd(NULL, 0);
+
+  } while (true);
+}
+
+/// Filter threads into masters and workers. If \p UseStateMachine is true,
+/// required workers will enter a state machine through and be trapped there.
+/// Master and surplus worker threads will return from this function immediately
+/// while required workers will only return once there is no more work. The
+/// return value indicates if the thread is a master (1), a surplus worker (0),
+/// or a finished required worker released from the state machine (-1).
+INLINE static int8_t
+__kmpc_target_region_thread_filter(unsigned ThreadLimit, bool UseStateMachine,
+   bool IsOMPRuntimeInitialized) {
+
+  unsigned TId = GetThreadIdInBlock();
+  bool IsWorker = TId < ThreadLimit;
+
+  if (IsWorker) {
+if (UseStateMachine)
+  __kmpc_target_region_state_machine(IsOMPRuntimeInitialized);
+return -1;
+  }
+
+  return TId == GetMasterThreadID();
+}
+
+EXTERN int8_t __kmpc_target_region_kernel_init(bool UseSPMDMode,
+   bool UseStateMachine,
+   bool RequiresOMPRuntime,
+   bool RequiresDataSharing) {
+  unsigned NumThreads = GetNumberOfThreadsInBlock();
+
+  // Handle the SPMD case first.
+  if (UseSPMDMode) {
+
+__kmpc_spmd_kernel_init(NumThreads, RequiresOMPRuntime,
+RequiresDataSharing);
+
+if (RequiresDataSharing)
+