Re: [hsa 2/10] Modifications to libgomp proper

2016-01-15 Thread Jakub Jelinek
On Tue, Jan 12, 2016 at 03:23:32PM +0100, Jakub Jelinek wrote:
> But looking at GOMP_PLUGIN_target_task_completion, I see we have a bug in
> there,
>   gomp_mutex_lock (>task_lock);
>   if (ttask->state == GOMP_TARGET_TASK_READY_TO_RUN)
> {
>   ttask->state = GOMP_TARGET_TASK_FINISHED;
>   gomp_mutex_unlock (>task_lock);
> }
>   ttask->state = GOMP_TARGET_TASK_FINISHED;
>   gomp_target_task_completion (team, task);
>   gomp_mutex_unlock (>task_lock);
> there was meant to be I think return; after the first unlock, otherwise
> it doubly unlocks the same lock, and performs gomp_target_task_completion
> without the lock held, which may cause great havoc.

I've bootstrapped/regtested this on x86_64-linux and i686-linux (no
offloading), and regtested on x86_64-linux -> x86_64-intelmicemul-linux
offloading, and then tested also with sleep (1) added in between gomp_mutex_lock
and preceeding gomp_target_task_fn call, both without and with the fix.
Without the fix with the extra sleeps, 3 target-3*.c tests FAILed (crashed,
hanged forever), with the fix everything was ok.

Installed on the trunk.

2016-01-15  Jakub Jelinek  

* task.c (GOMP_PLUGIN_target_task_completion): Add missing return.

--- libgomp/task.c.jj   2016-01-04 14:38:59.0 +0100
+++ libgomp/task.c  2016-01-15 00:11:08.851909133 +0100
@@ -579,6 +579,7 @@ GOMP_PLUGIN_target_task_completion (void
 {
   ttask->state = GOMP_TARGET_TASK_FINISHED;
   gomp_mutex_unlock (>task_lock);
+  return;
 }
   ttask->state = GOMP_TARGET_TASK_FINISHED;
   gomp_target_task_completion (team, task);


Jakub


Re: [hsa 2/10] Modifications to libgomp proper

2016-01-12 Thread Jakub Jelinek
On Tue, Jan 12, 2016 at 02:29:06PM +0100, Martin Jambor wrote:
> GOMP_kernel_launch_attributes should not be there (it is a
> reminiscence from before the device-specific target arguments) and
> should be moved just to the HSA plugin.  I'll prepare a patch today.
> 
> While we do not have to share GOMP_hsa_kernel_dispatch, we actually do
> use them in both the plugin and the compiler, where we only use it in
> an offsetof, so that we only have the structure defined once.

But, even using it in offsetof might be wrong, the compiler could be a
cross-compiler, and you'd use offsetof on the host, while you want it for
the target, and that would be different.
So, IMHO you need (unless you already have) built the structure as a tree
type, lay it out, and then you can use at TYPE_SIZE_UNIT or
DECL_FIELD_OFFSET and the like.

Jakub


Re: [hsa 2/10] Modifications to libgomp proper

2016-01-12 Thread Martin Jambor
Hi,

On Fri, Dec 11, 2015 at 07:05:29PM +0100, Jakub Jelinek wrote:
> On Thu, Dec 10, 2015 at 06:52:23PM +0100, Martin Jambor wrote:
> > > > --- a/libgomp/task.c
> > > > +++ b/libgomp/task.c
> > > > @@ -581,6 +581,7 @@ GOMP_PLUGIN_target_task_completion (void *data)
> > > >gomp_mutex_unlock (>task_lock);
> > > >  }
> > > >ttask->state = GOMP_TARGET_TASK_FINISHED;
> > > > +  free (ttask->firstprivate_copies);
> > > >gomp_target_task_completion (team, task);
> > > >gomp_mutex_unlock (>task_lock);
> > > >  }
> > > 
> > > So, this function should have a special case for the SHARED_MEM case, 
> > > handle
> > > it closely to say how GOMP_taskgroup_end handles the finish_cancelled:
> > > case.  Just note that the target task is missing from certain queues at 
> > > that
> > > point.
> > 
> > I'm afraid I need some help here.  I do not quite understand how is
> > finish_cancelled in GOMP_taskgroup_end similar, it seems to be doing
> > much more than freeing one pointer.  What is exactly the issue with
> > the above?
> > 
> > Nevertheless, after reading through bits of task.c again, I wonder
> > whether any copying (for both shared memory target and the host) in
> > gomp_target_task_fn is actually necessary because it seems to be also
> > done in gomp_create_target_task.  Does that not apply somehow?
> 
> The target task is scheduled for the first action as normal task, and the
> scheduling of it already removes it from some of the queues (each task is
> put into 1-3 queues), i.e. actions performed mostly by
> gomp_task_run_pre.  Then the team task lock is unlocked and the task is run.
> Finally, for normal tasks, gomp_task_run_post_handle_depend,
> gomp_task_run_post_remove_parent, etc. is run.  Now, for async target tasks
> that have something running on some other device at that point, we don't do
> that, but instead make it GOMP_TASK_ASYNC_RUNNING.  And continue with other
> stuff, until gomp_target_task_completion is run.
> For non-shared mem that needs to readd the task again into the queues, so
> that it will be scheduled again.  But you don't need that for shared mem
> target tasks, they can just free the firstprivate_copies and finalize the
> task.
> At the time gomp_target_task_completion is called, the task is pretty much
> in the same state as it is around the finish_cancelled:; label.
> So instead of what the gomp_target_task_completion function does,
> you would for SHARED_MEM do something like:
>   size_t new_tasks
> = gomp_task_run_post_handle_depend (task, team);
>   gomp_task_run_post_remove_parent (task);
>   gomp_clear_parent (>children_queue);
>   gomp_task_run_post_remove_taskgroup (task);
>   team->task_count--;
> do_wake = 0;
>   if (new_tasks > 1)
> {
>   do_wake = team->nthreads - team->task_running_count
> - !task->in_tied_task;
>   if (do_wake > new_tasks)
> do_wake = new_tasks;
> }
> // Unlike other places, the following will be also run with the
> // task_lock held, but I'm afraid there is nothing to do about it.
> // See the comment in gomp_target_task_completion.
> gomp_finish_task (task);
> free (task);
> if (do_wake)
>   gomp_team_barrier_wake (>barrier, do_wake);
> 

I tried the above but libgomp testcase target-33.c always got stuck
within GOMP_taskgroup_end call, more specifically in
gomp_team_barrier_wait_end in config/linux/bar.c where the the first
call to gomp_barrier_handle_tasks left the barrier->generation as
BAR_WAITING_FOR_TASK and then nothing ever happened, even as the
callbacks fired.

After looking into the tasking mechanism for basically the whole day
yesterday, I *think* I fixed it by calling
gomp_team_barrier_set_task_pending from the callback and another hunk
in gomp_barrier_handle_tasks so that it clears that barrier flag even
if it has not picked up any tasks.  Please let me know if you think it
makes sense.

If so, I'll include it in an HSA patch set I hope to generate today.
Otherwise I guess I'd prefer to remove the shared-memory path and
revert to old behavior as a temporary measure until we find out what
was wrong.

Thanks and sorry that it took me so long to resolve this,

Martin


diff --git a/libgomp/task.c b/libgomp/task.c
index ab5df51..828c1fb 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -566,6 +566,14 @@ gomp_target_task_completion (struct gomp_team *team, 
struct gomp_task *task)
 gomp_team_barrier_wake (>barrier, 1);
 }
 
+static inline size_t
+gomp_task_run_post_handle_depend (struct gomp_task *child_task,
+ struct gomp_team *team);
+static inline void
+gomp_task_run_post_remove_parent (struct gomp_task *child_task);
+static inline void
+gomp_task_run_post_remove_taskgroup (struct gomp_task *child_task);
+
 /* Signal that a target task TTASK has completed the asynchronously
running phase 

Re: [hsa 2/10] Modifications to libgomp proper

2016-01-12 Thread Jakub Jelinek
On Tue, Jan 12, 2016 at 02:46:52PM +0100, Martin Jambor wrote:
> diff --git a/libgomp/task.c b/libgomp/task.c
> index ab5df51..828c1fb 100644
> --- a/libgomp/task.c
> +++ b/libgomp/task.c
> @@ -584,8 +592,34 @@ GOMP_PLUGIN_target_task_completion (void *data)
>gomp_mutex_unlock (>task_lock);
>  }
>ttask->state = GOMP_TARGET_TASK_FINISHED;
> -  free (ttask->firstprivate_copies);
> -  gomp_target_task_completion (team, task);
> +
> +  if (ttask->devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)

First of all, I'm surprised you've changed
GOMP_PLUGIN_target_task_completion rather than gomp_target_task_completion.
The difference between those two is that the latter is run nost just from
the async event, but also if GOMP_PLUGIN_target_task_completion happens to
run before the gomp_mutex_lock (>task_lock); is acquired in the
various spots before
child_task->kind = GOMP_TASK_ASYNC_RUNNING;
The point is if the async completion happens too early for the thread
spawning it to notice, we want to complete it only when the spawning thread
is ready for that.

But looking at GOMP_PLUGIN_target_task_completion, I see we have a bug in
there,
  gomp_mutex_lock (>task_lock);
  if (ttask->state == GOMP_TARGET_TASK_READY_TO_RUN)
{
  ttask->state = GOMP_TARGET_TASK_FINISHED;
  gomp_mutex_unlock (>task_lock);
}
  ttask->state = GOMP_TARGET_TASK_FINISHED;
  gomp_target_task_completion (team, task);
  gomp_mutex_unlock (>task_lock);
there was meant to be I think return; after the first unlock, otherwise
it doubly unlocks the same lock, and performs gomp_target_task_completion
without the lock held, which may cause great havoc.

I'll test the obvious change here.

> +{
> +  free (ttask->firstprivate_copies);
> +  size_t new_tasks
> + = gomp_task_run_post_handle_depend (task, team);
> +  gomp_task_run_post_remove_parent (task);
> +  gomp_clear_parent (>children_queue);
> +  gomp_task_run_post_remove_taskgroup (task);
> +  team->task_count--;
> +  int do_wake = 0;
> +  if (new_tasks)
> + {
> +   do_wake = team->nthreads - team->task_running_count
> + - !task->in_tied_task;
> +   if (do_wake > new_tasks)
> + do_wake = new_tasks;
> + }
> +  /* Unlike other places, the following will be also run with the 
> task_lock
> + held, but there is nothing to do about it.  See the comment in
> + gomp_target_task_completion.  */
> +  gomp_finish_task (task);
> +  free (task);
> +  gomp_team_barrier_set_task_pending (>barrier);

This one really looks weird.  I mean, this should be done if we increase the
number of team's tasks, and gomp_task_run_post_handle_depend should do that
if it adds new tasks (IMHO it does), but if new_tasks is 0, then
there is no new task to schedule and therefore it should not be set.

> +  gomp_team_barrier_wake (>barrier, do_wake ? do_wake : 1);
> +}
> +  else
> +gomp_target_task_completion (team, task);
>gomp_mutex_unlock (>task_lock);
>  }
>  
> @@ -1275,7 +1309,12 @@ gomp_barrier_handle_tasks (gomp_barrier_state_t state)
> thr->task = task;
>   }
>else
> - return;
> + {
> +   if (team->task_count == 0
> +   && gomp_team_barrier_waiting_for_tasks (>barrier))
> + gomp_team_barrier_done (>barrier, state);
> +   return;
> + }
>gomp_mutex_lock (>task_lock);
>if (child_task)
>   {

And this hunk looks wrong too.  gomp_team_barrier_done shouldn't be done
outside of the lock held, there is no waking and I don't understand the
rationale for why you think current gomp_barrier_handle_tasks is wrong.

Anyway, if you make the HSA branch work the less efficient way of creating a
task that just frees the firstprivate copies, and post after the merge into
trunk a WIP patch that includes this, plus if there are clear instructions
how to build the HSA stuff on the wiki, my son has a box with AMD Kaveri,
so I'll try to debug it there.

Jakub


Re: [hsa 2/10] Modifications to libgomp proper

2016-01-12 Thread Martin Jambor
Hi,

On Tue, Jan 12, 2016 at 02:38:15PM +0100, Jakub Jelinek wrote:
> On Tue, Jan 12, 2016 at 02:29:06PM +0100, Martin Jambor wrote:
> > GOMP_kernel_launch_attributes should not be there (it is a
> > reminiscence from before the device-specific target arguments) and
> > should be moved just to the HSA plugin.  I'll prepare a patch today.
> > 
> > While we do not have to share GOMP_hsa_kernel_dispatch, we actually do
> > use them in both the plugin and the compiler, where we only use it in
> > an offsetof, so that we only have the structure defined once.
> 
> But, even using it in offsetof might be wrong, the compiler could be a
> cross-compiler, and you'd use offsetof on the host, while you want it for
> the target, and that would be different.
> So, IMHO you need (unless you already have) built the structure as a tree
> type, lay it out, and then you can use at TYPE_SIZE_UNIT or
> DECL_FIELD_OFFSET and the like.
> 

I see. For now I have just put a FIXME there but have talked to Martin
about laying out the type properly.  This is what I have committed to
the branch.

Thanks,

Martin

2016-01-12  Martin Jambor  

include/
* gomp-constants.h (GOMP_kernel_launch_attributes): Removed.
(GOMP_hsa_kernel_dispatch): Likewise.

libgomp/
* plugin/plugin-hsa.c (GOMP_kernel_launch_attributes): Moved here.
(GOMP_hsa_kernel_dispatch): Likewise.

gcc/
* hsa-gen.c (GOMP_hsa_kernel_dispatch): Moved here.
---
 gcc/hsa-gen.c   | 35 +
 include/gomp-constants.h| 44 --
 libgomp/plugin/plugin-hsa.c | 47 +
 3 files changed, 82 insertions(+), 44 deletions(-)

diff --git a/gcc/hsa-gen.c b/gcc/hsa-gen.c
index 1715b57..f633dfd 100644
--- a/gcc/hsa-gen.c
+++ b/gcc/hsa-gen.c
@@ -3747,6 +3747,41 @@ gen_set_num_threads (tree value, hsa_bb *hbb)
   hbb->append_insn (basic);
 }
 
+/* Collection of information needed for a dispatch of a kernel from a
+   kernel.  Keep in sync with libgomp's plugin-hsa.c.
+
+   FIXME: In order to support cross-compilations, we need to lay ot the type as
+   a tree and then use field_decl positions.
+ */
+
+struct GOMP_hsa_kernel_dispatch
+{
+  /* Pointer to a command queue associated with a kernel dispatch agent.  */
+  void *queue;
+  /* Pointer to reserved memory for OMP data struct copying.  */
+  void *omp_data_memory;
+  /* Pointer to a memory space used for kernel arguments passing.  */
+  void *kernarg_address;
+  /* Kernel object.  */
+  uint64_t object;
+  /* Synchronization signal used for dispatch synchronization.  */
+  uint64_t signal;
+  /* Private segment size.  */
+  uint32_t private_segment_size;
+  /* Group segment size.  */
+  uint32_t group_segment_size;
+  /* Number of children kernel dispatches.  */
+  uint64_t kernel_dispatch_count;
+  /* Number of threads.  */
+  uint32_t omp_num_threads;
+  /* Debug purpose argument.  */
+  uint64_t debug;
+  /* Levels-var ICV.  */
+  uint64_t omp_level;
+  /* Kernel dispatch structures created for children kernel dispatches.  */
+  struct GOMP_hsa_kernel_dispatch **children_dispatches;
+};
+
 /* Return an HSA register that will contain number of threads for
a future dispatched kernel.  Instructions are added to HBB.  */
 
diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index 1dae474..a8e7723 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -256,48 +256,4 @@ enum gomp_map_kind
 /* Identifiers of device-specific target arguments.  */
 #define GOMP_TARGET_ARG_HSA_KERNEL_ATTRIBUTES  (1 << 8)
 
-/* Structure describing the run-time and grid properties of an HSA kernel
-   lauch.  */
-
-struct GOMP_kernel_launch_attributes
-{
-  /* Number of dimensions the workload has.  Maximum number is 3.  */
-  uint32_t ndim;
-  /* Size of the grid in the three respective dimensions.  */
-  uint32_t gdims[3];
-  /* Size of work-groups in the respective dimensions.  */
-  uint32_t wdims[3];
-};
-
-/* Collection of information needed for a dispatch of a kernel from a
-   kernel.  */
-
-struct GOMP_hsa_kernel_dispatch
-{
-  /* Pointer to a command queue associated with a kernel dispatch agent.  */
-  void *queue;
-  /* Pointer to reserved memory for OMP data struct copying.  */
-  void *omp_data_memory;
-  /* Pointer to a memory space used for kernel arguments passing.  */
-  void *kernarg_address;
-  /* Kernel object.  */
-  uint64_t object;
-  /* Synchronization signal used for dispatch synchronization.  */
-  uint64_t signal;
-  /* Private segment size.  */
-  uint32_t private_segment_size;
-  /* Group segment size.  */
-  uint32_t group_segment_size;
-  /* Number of children kernel dispatches.  */
-  uint64_t kernel_dispatch_count;
-  /* Number of threads.  */
-  uint32_t omp_num_threads;
-  /* Debug purpose argument.  */
-  uint64_t debug;
-  /* Levels-var ICV.  */
-  uint64_t omp_level;
-  /* Kernel dispatch structures created for 

Re: [hsa 2/10] Modifications to libgomp proper

2016-01-12 Thread Jakub Jelinek
On Tue, Jan 12, 2016 at 04:00:11PM +0300, Alexander Monakov wrote:
> Hello, Martin, Jakub, community,
> 
> This part of the patch:
> 
> On Mon, 7 Dec 2015, Martin Jambor wrote:
> > include/
> > * gomp-constants.h (GOMP_DEVICE_HSA): New macro.
> [snip]
> > (GOMP_kernel_launch_attributes): New type.
> > (GOMP_hsa_kernel_dispatch): New type.
> 
> is going to break build of NVPTX cross-compiler, because it uses uint32_t,
> uint64_t types like below, but those types will not be available when building
> nvptx libgcc.  gomp-constants.h is #include'd in libgcc via tm.h and
> offload.h.
> 
> Note how other files in include/ need to do a special dance with #ifdef
> HAVE_STDINT_H to include  and obtain uint64_t.
> 
> Shall I move the problematic structs into a separate file, gomp-types.h?

Or just move those into libgomp-plugin.h, those type definitions don't have
to be shared between the compiler and libgomp, the compiler has to duplicate
those definitions anyway, as it needs to create the IL of those types and
can't use the host structure type for that purpose.

> > diff --git a/include/gomp-constants.h b/include/gomp-constants.h
> > index dffd631..1dae474 100644
> > --- a/include/gomp-constants.h
> > +++ b/include/gomp-constants.h
> [snip]
> > +/* Structure describing the run-time and grid properties of an HSA kernel
> > +   lauch.  */
> > +
> > +struct GOMP_kernel_launch_attributes
> > +{
> > +  /* Number of dimensions the workload has.  Maximum number is 3.  */
> > +  uint32_t ndim;
> > +  /* Size of the grid in the three respective dimensions.  */
> > +  uint32_t gdims[3];
> > +  /* Size of work-groups in the respective dimensions.  */
> > +  uint32_t wdims[3];
> > +};

Jakub


Re: [hsa 2/10] Modifications to libgomp proper

2016-01-12 Thread Alexander Monakov
Hello, Martin, Jakub, community,

This part of the patch:

On Mon, 7 Dec 2015, Martin Jambor wrote:
> include/
>   * gomp-constants.h (GOMP_DEVICE_HSA): New macro.
[snip]
>   (GOMP_kernel_launch_attributes): New type.
>   (GOMP_hsa_kernel_dispatch): New type.

is going to break build of NVPTX cross-compiler, because it uses uint32_t,
uint64_t types like below, but those types will not be available when building
nvptx libgcc.  gomp-constants.h is #include'd in libgcc via tm.h and
offload.h.

Note how other files in include/ need to do a special dance with #ifdef
HAVE_STDINT_H to include  and obtain uint64_t.

Shall I move the problematic structs into a separate file, gomp-types.h?

Thanks.
Alexander

> diff --git a/include/gomp-constants.h b/include/gomp-constants.h
> index dffd631..1dae474 100644
> --- a/include/gomp-constants.h
> +++ b/include/gomp-constants.h
[snip]
> +/* Structure describing the run-time and grid properties of an HSA kernel
> +   lauch.  */
> +
> +struct GOMP_kernel_launch_attributes
> +{
> +  /* Number of dimensions the workload has.  Maximum number is 3.  */
> +  uint32_t ndim;
> +  /* Size of the grid in the three respective dimensions.  */
> +  uint32_t gdims[3];
> +  /* Size of work-groups in the respective dimensions.  */
> +  uint32_t wdims[3];
> +};


Re: [hsa 2/10] Modifications to libgomp proper

2015-12-11 Thread Jakub Jelinek
On Thu, Dec 10, 2015 at 06:52:23PM +0100, Martin Jambor wrote:
> I see, I prefer the clean approach, even if it is more work, this
> interface looks like it is going to be extended in the future.  But I
> am wondering whether embedding the value into the identifier element
> is actually worth it.  The passed array is going to be a small local
> variable and I wonder whether there is going to be any benefit in it
> having two elements instead of four (or four instead of six for
> gridified kernels), especially if it means introducing control flow on
> the part of the caller.  But if you really want it that way, I will
> implement that.

I'm fine with implementing the two (num_threads and thread_limit) always
as separate argument, or perhaps what you could do is if the argument
is constant and fits into the signed 16 bits on 32-bit arches (or any
constant, that fits into 48 bits), use embedded argument (then there is no
extra runtime cost), and if it is variable and you'd need to shift,
put it as separate argument.

> OK, so if I understand this correctly, I should not be re-queing the
> task in gomp_target_task_completion.  I have left the check to be
> NULLness of ttask->tgt but can test the capability if you prefer (but
> I at least hope the two options are equivalent).

> > > --- a/libgomp/task.c
> > > +++ b/libgomp/task.c
> > > @@ -581,6 +581,7 @@ GOMP_PLUGIN_target_task_completion (void *data)
> > >gomp_mutex_unlock (>task_lock);
> > >  }
> > >ttask->state = GOMP_TARGET_TASK_FINISHED;
> > > +  free (ttask->firstprivate_copies);
> > >gomp_target_task_completion (team, task);
> > >gomp_mutex_unlock (>task_lock);
> > >  }
> > 
> > So, this function should have a special case for the SHARED_MEM case, handle
> > it closely to say how GOMP_taskgroup_end handles the finish_cancelled:
> > case.  Just note that the target task is missing from certain queues at that
> > point.
> 
> I'm afraid I need some help here.  I do not quite understand how is
> finish_cancelled in GOMP_taskgroup_end similar, it seems to be doing
> much more than freeing one pointer.  What is exactly the issue with
> the above?
> 
> Nevertheless, after reading through bits of task.c again, I wonder
> whether any copying (for both shared memory target and the host) in
> gomp_target_task_fn is actually necessary because it seems to be also
> done in gomp_create_target_task.  Does that not apply somehow?

The target task is scheduled for the first action as normal task, and the
scheduling of it already removes it from some of the queues (each task is
put into 1-3 queues), i.e. actions performed mostly by
gomp_task_run_pre.  Then the team task lock is unlocked and the task is run.
Finally, for normal tasks, gomp_task_run_post_handle_depend,
gomp_task_run_post_remove_parent, etc. is run.  Now, for async target tasks
that have something running on some other device at that point, we don't do
that, but instead make it GOMP_TASK_ASYNC_RUNNING.  And continue with other
stuff, until gomp_target_task_completion is run.
For non-shared mem that needs to readd the task again into the queues, so
that it will be scheduled again.  But you don't need that for shared mem
target tasks, they can just free the firstprivate_copies and finalize the
task.
At the time gomp_target_task_completion is called, the task is pretty much
in the same state as it is around the finish_cancelled:; label.
So instead of what the gomp_target_task_completion function does,
you would for SHARED_MEM do something like:
  size_t new_tasks
= gomp_task_run_post_handle_depend (task, team);
  gomp_task_run_post_remove_parent (task);
  gomp_clear_parent (>children_queue);
  gomp_task_run_post_remove_taskgroup (task);
  team->task_count--;
  do_wake = 0;
  if (new_tasks > 1)
{
  do_wake = team->nthreads - team->task_running_count
- !task->in_tied_task;
  if (do_wake > new_tasks)
do_wake = new_tasks;
}
// Unlike other places, the following will be also run with the
// task_lock held, but I'm afraid there is nothing to do about it.
// See the comment in gomp_target_task_completion.
  gomp_finish_task (task);
  free (task);
  if (do_wake)
gomp_team_barrier_wake (>barrier, do_wake);

Jakub


Re: [hsa 2/10] Modifications to libgomp proper

2015-12-10 Thread Martin Jambor
Hi,

thanks for the feedback.  I have incorporated most of it into the
branch (the diff is below) but  also have a few questions.

On Wed, Dec 09, 2015 at 12:35:36PM +0100, Jakub Jelinek wrote:
> On Mon, Dec 07, 2015 at 12:19:57PM +0100, Martin Jambor wrote:
> > +/* Flag set when the subsequent element in the device-specific argument
> > +   values.  */
> > +#define GOMP_TARGET_ARG_SUBSEQUENT_PARAM   (1 << 7)
> > +
> > +/* Bitmask to apply to a target argument to find out the value identifier. 
> >  */
> > +#define GOMP_TARGET_ARG_ID_MASK(((1 << 8) - 1) << 8)
> > +/* Target argument index of NUM_TEAMS.  */
> > +#define GOMP_TARGET_ARG_NUM_TEAMS  (1 << 8)
> > +/* Target argument index of THREAD_LIMIT.  */
> > +#define GOMP_TARGET_ARG_THREAD_LIMIT   (2 << 8)
> 
> I meant that these two would be just special, passed as the first two
> pointers in the array, without the markup.  Because, otherwise you either
> need to use GOMP_TARGET_ARG_SUBSEQUENT_PARAM for these always, or for 32-bit
> arches and for 64-bit ones shift often at runtime.  Having the markup even
> for them is perhaps cleaner, but less efficient, so if you really want to go
> that way, please make sure you handle it properly for 32-bit pointers
> architectures though.  num_teams or thread_limit could be > 32767 or >
> 65535.

I see, I prefer the clean approach, even if it is more work, this
interface looks like it is going to be extended in the future.  But I
am wondering whether embedding the value into the identifier element
is actually worth it.  The passed array is going to be a small local
variable and I wonder whether there is going to be any benefit in it
having two elements instead of four (or four instead of six for
gridified kernels), especially if it means introducing control flow on
the part of the caller.  But if you really want it that way, I will
implement that.

> 
> > -static void
> > -gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
> > -  void **hostaddrs, size_t *sizes,
> > -  unsigned short *kinds)
> > +static void *
> > +gomp_target_unshare_firstprivate (size_t mapnum, void **hostaddrs,
> > + size_t *sizes, unsigned short *kinds)
> >  {
> >size_t i, tgt_align = 0, tgt_size = 0;
> >char *tgt = NULL;
> > @@ -1281,7 +1282,7 @@ gomp_target_fallback_firstprivate (void (*fn) (void 
> > *), size_t mapnum,
> >}
> >if (tgt_align)
> >  {
> > -  tgt = gomp_alloca (tgt_size + tgt_align - 1);
> > +  tgt = gomp_malloc (tgt_size + tgt_align - 1);
> 
> I don't like using gomp_malloc here, either copy/paste the function, or
> create separate inline functions for the two loops, one for the first loop
> which returns you tgt_align and tgt_size, and another for the stuff after
> the allocation.  Then you can use those two inline functions to implement
> both gomp_target_fallback_firstprivate which will use alloca, and
> gomp_target_unshare_firstprivate which will use gomp_malloc instead.

OK, I did that.

> 
> > @@ -1356,6 +1377,11 @@ GOMP_target (int device, void (*fn) (void *), const 
> > void *unused,
> > and several arguments have been added:
> > FLAGS is a bitmask, see GOMP_TARGET_FLAG_* in gomp-constants.h.
> > DEPEND is array of dependencies, see GOMP_task for details.
> > +   ARGS is a pointer to an array consisting of NUM_TEAMS, THREAD_LIMIT and 
> > a
> > +   variable number of device-specific arguments, which always take two 
> > elements
> > +   where the first specifies the type and the second the actual value.  The
> > +   last element of the array is a single NULL.
> 
> Note, here you document NUM_TEAMS and THREAD_LIMIT as special values, not
> encoded.

I have changed the comment but will remember to do it again if
necessary after changing omp-low.c

> 
> > @@ -1473,6 +1508,7 @@ GOMP_target_data (int device, const void *unused, 
> > size_t mapnum,
> >struct gomp_device_descr *devicep = resolve_device (device);
> >  
> >if (devicep == NULL
> > +  || (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> >|| !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
> 
> Would be nice to have some consistency in the order of capabilities checks.
> Usually you check SHARED_MEM after OPENMP_400, so perhaps do it this way
> here too.

Sure.

> 
> > @@ -1741,23 +1784,38 @@ gomp_target_task_fn (void *data)
> >  
> >if (ttask->state == GOMP_TARGET_TASK_FINISHED)
> > {
> > - gomp_unmap_vars (ttask->tgt, true);
> > + if (ttask->tgt)
> > +   gomp_unmap_vars (ttask->tgt, true);
> >   return false;
> > }
> 
> This doesn't make sense.  For the GOMP_OFFLOAD_CAP_SHARED_MEM case, unless
> you want to run the free (ttask->firstprivate_copies); as a task, you
> shouldn't be queing the target task again for further execution, instead
> it should just be handled like a finished task at that point.  The reason
> 

Re: [hsa 2/10] Modifications to libgomp proper

2015-12-09 Thread Jakub Jelinek
On Mon, Dec 07, 2015 at 12:19:57PM +0100, Martin Jambor wrote:
> +/* Flag set when the subsequent element in the device-specific argument
> +   values.  */
> +#define GOMP_TARGET_ARG_SUBSEQUENT_PARAM (1 << 7)
> +
> +/* Bitmask to apply to a target argument to find out the value identifier.  
> */
> +#define GOMP_TARGET_ARG_ID_MASK  (((1 << 8) - 1) << 8)
> +/* Target argument index of NUM_TEAMS.  */
> +#define GOMP_TARGET_ARG_NUM_TEAMS(1 << 8)
> +/* Target argument index of THREAD_LIMIT.  */
> +#define GOMP_TARGET_ARG_THREAD_LIMIT (2 << 8)

I meant that these two would be just special, passed as the first two
pointers in the array, without the markup.  Because, otherwise you either
need to use GOMP_TARGET_ARG_SUBSEQUENT_PARAM for these always, or for 32-bit
arches and for 64-bit ones shift often at runtime.  Having the markup even
for them is perhaps cleaner, but less efficient, so if you really want to go
that way, please make sure you handle it properly for 32-bit pointers
architectures though.  num_teams or thread_limit could be > 32767 or >
65535.

> -static void
> -gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
> -void **hostaddrs, size_t *sizes,
> -unsigned short *kinds)
> +static void *
> +gomp_target_unshare_firstprivate (size_t mapnum, void **hostaddrs,
> +   size_t *sizes, unsigned short *kinds)
>  {
>size_t i, tgt_align = 0, tgt_size = 0;
>char *tgt = NULL;
> @@ -1281,7 +1282,7 @@ gomp_target_fallback_firstprivate (void (*fn) (void *), 
> size_t mapnum,
>}
>if (tgt_align)
>  {
> -  tgt = gomp_alloca (tgt_size + tgt_align - 1);
> +  tgt = gomp_malloc (tgt_size + tgt_align - 1);

I don't like using gomp_malloc here, either copy/paste the function, or
create separate inline functions for the two loops, one for the first loop
which returns you tgt_align and tgt_size, and another for the stuff after
the allocation.  Then you can use those two inline functions to implement
both gomp_target_fallback_firstprivate which will use alloca, and
gomp_target_unshare_firstprivate which will use gomp_malloc instead.

> @@ -1356,6 +1377,11 @@ GOMP_target (int device, void (*fn) (void *), const 
> void *unused,
> and several arguments have been added:
> FLAGS is a bitmask, see GOMP_TARGET_FLAG_* in gomp-constants.h.
> DEPEND is array of dependencies, see GOMP_task for details.
> +   ARGS is a pointer to an array consisting of NUM_TEAMS, THREAD_LIMIT and a
> +   variable number of device-specific arguments, which always take two 
> elements
> +   where the first specifies the type and the second the actual value.  The
> +   last element of the array is a single NULL.

Note, here you document NUM_TEAMS and THREAD_LIMIT as special values, not
encoded.

> @@ -1473,6 +1508,7 @@ GOMP_target_data (int device, const void *unused, 
> size_t mapnum,
>struct gomp_device_descr *devicep = resolve_device (device);
>  
>if (devicep == NULL
> +  || (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
>|| !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))

Would be nice to have some consistency in the order of capabilities checks.
Usually you check SHARED_MEM after OPENMP_400, so perhaps do it this way
here too.

> @@ -1741,23 +1784,38 @@ gomp_target_task_fn (void *data)
>  
>if (ttask->state == GOMP_TARGET_TASK_FINISHED)
>   {
> -   gomp_unmap_vars (ttask->tgt, true);
> +   if (ttask->tgt)
> + gomp_unmap_vars (ttask->tgt, true);
> return false;
>   }

This doesn't make sense.  For the GOMP_OFFLOAD_CAP_SHARED_MEM case, unless
you want to run the free (ttask->firstprivate_copies); as a task, you
shouldn't be queing the target task again for further execution, instead
it should just be handled like a finished task at that point.  The reason
why for XeonPhi or PTX gomp_target_task_fn is run with
GOMP_TARGET_TASK_FINISHED state is that gomp_unmap_vars can perform IO
actions and doing it with the tasking lock held is highly undesirable.
>  
> -  void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
> -  ttask->tgt
> - = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
> -  ttask->sizes, ttask->kinds, true,
> -  GOMP_MAP_VARS_TARGET);
> +  bool shared_mem;
> +  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> + {
> +   shared_mem = true;
> +   ttask->tgt = NULL;
> +   ttask->firstprivate_copies
> + = gomp_target_unshare_firstprivate (ttask->mapnum, ttask->hostaddrs,
> + ttask->sizes, ttask->kinds);
> + }
> +  else
> + {
> +   shared_mem = false;
> +   ttask->tgt = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs,
> +   NULL, ttask->sizes, ttask->kinds, 

[hsa 2/10] Modifications to libgomp proper

2015-12-07 Thread Martin Jambor
Hi,

The patch below contains all changes to libgomp files except for the
hsa plugin (which is in the following patch).

The changes can roughly divided into three categories.  First, it
contains changes I that are necessary to support shared-memory
devices.  In majority of cases this means treating them like the host
fallback because there is no need to copy, host malloc can be used for
allocating etc.  It also means that GOMP_target_ext and
gomp_target_task_fn should not be remapping arguments but should pass
to the plugin the same thing host fallback function would receive.

Second, because GCC HSA backend often does not emit HSAIL for function
it knows it cannot handle, these two functions need to gracefully
handle the case when there is no device implementation of a particular
function available by doing host fallback too.

Third, the patch implements libgomp-part of the device-specific
arguments passed to GOMP_target as requested Jakub (well, some are
actually for all devices but that is what we call them).  Because of
nowait target constructs, the arguments have proliferated into tasking
too, as did firstprivate copies.

Any feedback will be greatly appreciated,

Martin


2015-12-04  Martin Jambor  
Martin Liska  

include/
* gomp-constants.h (GOMP_DEVICE_HSA): New macro.
(GOMP_VERSION_HSA): Likewise.
(GOMP_TARGET_ARG_DEVICE_MASK): Likewise.
(GOMP_TARGET_ARG_DEVICE_ALL): Likewise.
(GOMP_TARGET_ARG_SUBSEQUENT_PARAM): Likewise.
(GOMP_TARGET_ARG_ID_MASK): Likewise.
(GOMP_TARGET_ARG_NUM_TEAMS): Likewise.
(GOMP_TARGET_ARG_THREAD_LIMIT): Likewise.
(GOMP_TARGET_ARG_VALUE_SHIFT): Likewise.
(GOMP_TARGET_ARG_HSA_KERNEL_ATTRIBUTES): Likewise.
(GOMP_kernel_launch_attributes): New type.
(GOMP_hsa_kernel_dispatch): New type.

libgomp/
* libgomp-plugin.h (offload_target_type): New element
OFFLOAD_TARGET_TYPE_HSA.
* libgomp.h (gomp_target_task): New field args.
(bool gomp_create_target_task): Updated.
(gomp_device_descr): Extra parameter of run_func and async_run_func,
new field can_run_func.
* libgomp_g.h (GOMP_target_ext): Change prototype.
* oacc-host.c (host_run): Added a new parameter args.
* target.c (gomp_target_fallback_firstprivate): New function.
(gomp_target_fallback_firstprivate): Use
gomp_target_fallback_firstprivate.
(gomp_get_target_fn_addr): Allow returning NULL for shared memory
devices.
(GOMP_target): Do host fallback for all shared memory devices.  Do not
pass any args to plugins.
(GOMP_target_ext): Add new parameter args.  Allow host fallback if
device shares memory.  Do not remap data if device has shared memory.
(gomp_target_task_fn): Likewise.  Also Treat shared memory devices
like host fallback for mappings.
(GOMP_target_data): Treat shared memory devices like host fallback.
(GOMP_target_data_ext): Likewise.
(GOMP_target_update): Likewise.
(GOMP_target_update_ext): Likewise.  Also pass NULL as args to
gomp_create_target_task.
(GOMP_target_enter_exit_data): Likewise.
(omp_target_alloc): Treat shared memory devices like host fallback.
(omp_target_free): Likewise.
(omp_target_is_present): Likewise.
(omp_target_memcpy): Likewise.
(omp_target_memcpy_rect): Likewise.
(omp_target_associate_ptr): Likewise.
(gomp_load_plugin_for_device): Also load can_run.
* task.c (GOMP_PLUGIN_target_task_completion): Free
firstprivate_copies.
(gomp_create_target_task): Accept new argument args and store it to
ttask.

liboffloadmic/plugin
* libgomp-plugin-intelmic.cpp (GOMP_OFFLOAD_async_run): New unused
parameter.
(GOMP_OFFLOAD_run): Likewise.

diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index dffd631..1dae474 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -176,6 +176,7 @@ enum gomp_map_kind
 #define GOMP_DEVICE_NOT_HOST   4
 #define GOMP_DEVICE_NVIDIA_PTX 5
 #define GOMP_DEVICE_INTEL_MIC  6
+#define GOMP_DEVICE_HSA7
 
 #define GOMP_DEVICE_ICV-1
 #define GOMP_DEVICE_HOST_FALLBACK  -2
@@ -201,6 +202,7 @@ enum gomp_map_kind
 #define GOMP_VERSION   0
 #define GOMP_VERSION_NVIDIA_PTX 1
 #define GOMP_VERSION_INTEL_MIC 0
+#define GOMP_VERSION_HSA 0
 
 #define GOMP_VERSION_PACK(LIB, DEV) (((LIB) << 16) | (DEV))
 #define GOMP_VERSION_LIB(PACK) (((PACK) >> 16) & 0x)
@@ -228,4 +230,74 @@ enum gomp_map_kind
 #define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0x)
 #define GOMP_LAUNCH_OP_MAX 0x
 
+/* Bitmask to apply in order to find out the intended device of a target
+   argument.  */
+#define GOMP_TARGET_ARG_DEVICE_MASK((1 << 7)