Re: Partial Offloading (was: [hsa merge 07/10] IPA-HSA pass)

2016-02-17 Thread Ilya Verbin
On Thu, Jan 28, 2016 at 12:36:19 +0100, Thomas Schwinge wrote:
> I made an attempt to capture the recent discussion (plus my own
> ideas/understanding) in this new section:
> .  Please
> change/extend, as required.

Thanks for summarizing this.


I'm not very happy how -foffload=disable works in GCC 6, here is a testcase:

int main ()
{
  int x = 10;
  #pragma omp target data map (from: x)
#pragma omp target map (alloc: x)
  x = 20;
  if (x != 10 && x != 20)
__builtin_abort ();
}

On the system with non-shared accelerator it will abort, because "#pragma omp
target data" behaves like offloading is enabled, but "#pragma omp target" runs
on the host.  As the result, at the end of the *target data* region, it tries to
receive x from target and receives 0, or crashes.

We can forbid -foffload=disable option, but I think it's very useful, e.g. for
comparing performance of host vs. accelerator using the same compiler, etc.
Or if the system contains 2 different accelerators, someone might want to
compile only for the first, but libgomp will load 2 plugins, and the program
will crash (instead of doing fallback) if it will try to use the second device.

So, maybe we still need something like this patch?
https://gcc.gnu.org/ml/gcc-patches/2015-04/msg01033.html

  -- Ilya


Partial Offloading (was: [hsa merge 07/10] IPA-HSA pass)

2016-01-28 Thread Thomas Schwinge
Hi!

> [...]

I made an attempt to capture the recent discussion (plus my own
ideas/understanding) in this new section:
.  Please
change/extend, as required.


Grüße
 Thomas


signature.asc
Description: PGP signature


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-27 Thread Martin Liška
On 01/26/2016 12:41 AM, Jan Hubicka wrote:
>> On Mon, Jan 25, 2016 at 04:21:50PM +0100, Martin Liška wrote:
>>> On 01/16/2016 11:00 AM, Jan Hubicka wrote:
 Can't it be represented via explicit REF_ADDR or something like that?

 Honza
>>>
>>> Hi.
>>>
>>> Sure, I've just done a patch that can do that. However, as we're currently 
>>> in stage4,
>>> that change would probably require explicit permission of a release manager?
>>
>> If Honza is fine with it and you've tested it, this is ok for trunk.
> 
> It looks fine to me.
> 
> Honza
> 

I've just bootregtested the patch on x86_64-linux-pc and I'm going to install
it to trunk.

Martin


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-25 Thread Martin Liška
On 01/16/2016 11:00 AM, Jan Hubicka wrote:
> Can't it be represented via explicit REF_ADDR or something like that?
> 
> Honza

Hi.

Sure, I've just done a patch that can do that. However, as we're currently in 
stage4,
that change would probably require explicit permission of a release manager?

Thanks,
Martin
>From 9639fff94d043c55b55bfb12bb086032db565f0a Mon Sep 17 00:00:00 2001
From: marxin 
Date: Mon, 25 Jan 2016 16:11:00 +0100
Subject: [PATCH] HSA: simplify partitioning of HSA kernels and host impls.

gcc/lto/ChangeLog:

2016-01-25  Martin Liska  

	* lto-partition.c (add_symbol_to_partition_1): Remove usage
	of hsa_summaries.

gcc/ChangeLog:

2016-01-25  Martin Liska  

	* hsa.c (hsa_summary_t::link_functions): Create IPA_REF_ADDR
	reference for an HSA kernel and its host function.
---
 gcc/hsa.c   |  5 +
 gcc/lto/lto-partition.c | 19 ---
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/gcc/hsa.c b/gcc/hsa.c
index ec23f81..f0b3205 100644
--- a/gcc/hsa.c
+++ b/gcc/hsa.c
@@ -781,6 +781,11 @@ hsa_summary_t::link_functions (cgraph_node *gpu, cgraph_node *host,
   TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
   TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
   DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
+
+  /* Create reference between a kernel and a corresponding host implementation
+ to quarantee LTO streaming to a same LTRANS.  */
+  if (kind == HSA_KERNEL)
+gpu->create_reference (host, IPA_REF_ADDR);
 }
 
 /* Add a HOST function to HSA summaries.  */
diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
index eb28fed..9eb63c2 100644
--- a/gcc/lto/lto-partition.c
+++ b/gcc/lto/lto-partition.c
@@ -34,7 +34,6 @@ along with GCC; see the file COPYING3.  If not see
 #include "ipa-prop.h"
 #include "ipa-inline.h"
 #include "lto-partition.h"
-#include "hsa.h"
 
 vec ltrans_partitions;
 
@@ -171,24 +170,6 @@ add_symbol_to_partition_1 (ltrans_partition part, symtab_node *node)
 	 Therefore put it into the same partition.  */
   if (cnode->instrumented_version)
 	add_symbol_to_partition_1 (part, cnode->instrumented_version);
-
-  /* Add an HSA associated with the symbol.  */
-  if (hsa_summaries != NULL)
-	{
-	  hsa_function_summary *s = hsa_summaries->get (cnode);
-	  if (s->m_kind == HSA_KERNEL)
-	{
-	  /* Add binded function.  */
-	  bool added = add_symbol_to_partition_1 (part,
-		  s->m_binded_function);
-	  gcc_assert (added);
-	  if (symtab->dump_file)
-		fprintf (symtab->dump_file,
-			 "adding an HSA function (host/gpu) to the "
-			 "partition: %s\n",
-			 s->m_binded_function->name ());
-	}
-	}
 }
 
   add_references_to_partition (part, node);
-- 
2.7.0



Re: [hsa merge 07/10] IPA-HSA pass

2016-01-25 Thread Jakub Jelinek
On Mon, Jan 25, 2016 at 04:21:50PM +0100, Martin Liška wrote:
> On 01/16/2016 11:00 AM, Jan Hubicka wrote:
> > Can't it be represented via explicit REF_ADDR or something like that?
> > 
> > Honza
> 
> Hi.
> 
> Sure, I've just done a patch that can do that. However, as we're currently in 
> stage4,
> that change would probably require explicit permission of a release manager?

If Honza is fine with it and you've tested it, this is ok for trunk.

> >From 9639fff94d043c55b55bfb12bb086032db565f0a Mon Sep 17 00:00:00 2001
> From: marxin 
> Date: Mon, 25 Jan 2016 16:11:00 +0100
> Subject: [PATCH] HSA: simplify partitioning of HSA kernels and host impls.
> 
> gcc/lto/ChangeLog:
> 
> 2016-01-25  Martin Liska  
> 
>   * lto-partition.c (add_symbol_to_partition_1): Remove usage
>   of hsa_summaries.
> 
> gcc/ChangeLog:
> 
> 2016-01-25  Martin Liska  
> 
>   * hsa.c (hsa_summary_t::link_functions): Create IPA_REF_ADDR
>   reference for an HSA kernel and its host function.
> ---
>  gcc/hsa.c   |  5 +
>  gcc/lto/lto-partition.c | 19 ---
>  2 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/gcc/hsa.c b/gcc/hsa.c
> index ec23f81..f0b3205 100644
> --- a/gcc/hsa.c
> +++ b/gcc/hsa.c
> @@ -781,6 +781,11 @@ hsa_summary_t::link_functions (cgraph_node *gpu, 
> cgraph_node *host,
>TREE_OPTIMIZATION (fn_opts)->x_flag_tree_loop_vectorize = false;
>TREE_OPTIMIZATION (fn_opts)->x_flag_tree_slp_vectorize = false;
>DECL_FUNCTION_SPECIFIC_OPTIMIZATION (gdecl) = fn_opts;
> +
> +  /* Create reference between a kernel and a corresponding host 
> implementation
> + to quarantee LTO streaming to a same LTRANS.  */
> +  if (kind == HSA_KERNEL)
> +gpu->create_reference (host, IPA_REF_ADDR);
>  }
>  
>  /* Add a HOST function to HSA summaries.  */
> diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
> index eb28fed..9eb63c2 100644
> --- a/gcc/lto/lto-partition.c
> +++ b/gcc/lto/lto-partition.c
> @@ -34,7 +34,6 @@ along with GCC; see the file COPYING3.  If not see
>  #include "ipa-prop.h"
>  #include "ipa-inline.h"
>  #include "lto-partition.h"
> -#include "hsa.h"
>  
>  vec ltrans_partitions;
>  
> @@ -171,24 +170,6 @@ add_symbol_to_partition_1 (ltrans_partition part, 
> symtab_node *node)
>Therefore put it into the same partition.  */
>if (cnode->instrumented_version)
>   add_symbol_to_partition_1 (part, cnode->instrumented_version);
> -
> -  /* Add an HSA associated with the symbol.  */
> -  if (hsa_summaries != NULL)
> - {
> -   hsa_function_summary *s = hsa_summaries->get (cnode);
> -   if (s->m_kind == HSA_KERNEL)
> - {
> -   /* Add binded function.  */
> -   bool added = add_symbol_to_partition_1 (part,
> -   s->m_binded_function);
> -   gcc_assert (added);
> -   if (symtab->dump_file)
> - fprintf (symtab->dump_file,
> -  "adding an HSA function (host/gpu) to the "
> -  "partition: %s\n",
> -  s->m_binded_function->name ());
> - }
> - }
>  }
>  
>add_references_to_partition (part, node);
> -- 
> 2.7.0
> 


Jakub


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-25 Thread Jan Hubicka
> On Mon, Jan 25, 2016 at 04:21:50PM +0100, Martin Liška wrote:
> > On 01/16/2016 11:00 AM, Jan Hubicka wrote:
> > > Can't it be represented via explicit REF_ADDR or something like that?
> > > 
> > > Honza
> > 
> > Hi.
> > 
> > Sure, I've just done a patch that can do that. However, as we're currently 
> > in stage4,
> > that change would probably require explicit permission of a release manager?
> 
> If Honza is fine with it and you've tested it, this is ok for trunk.

It looks fine to me.

Honza


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-22 Thread Jakub Jelinek
On Wed, Jan 20, 2016 at 09:53:30PM +0300, Ilya Verbin wrote:
> If you're OK with this, I'll install this patch:
> 
> 
> libgomp/
>   * target.c (gomp_get_target_fn_addr): Allow host fallback if target
>   function wasn't mapped to the device with non-shared memory.

Ok, thanks.

> diff --git a/libgomp/target.c b/libgomp/target.c
> index f1f5849..96fe3d5 100644
> --- a/libgomp/target.c
> +++ b/libgomp/target.c
> @@ -1436,12 +1436,7 @@ gomp_get_target_fn_addr (struct gomp_device_descr 
> *devicep,
>splay_tree_key tgt_fn = splay_tree_lookup (>mem_map, );
>gomp_mutex_unlock (>lock);
>if (tgt_fn == NULL)
> - {
> -   if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> - return NULL;
> -   else
> - gomp_fatal ("Target function wasn't mapped");
> - }
> + return NULL;
>  
>return (void *) tgt_fn->tgt_offset;
>  }
> 
>   -- Ilya

Jakub


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-21 Thread Alexander Monakov
On Wed, 20 Jan 2016, Ilya Verbin wrote:
> I agree that OpenMP doesn't guarantee that all target regions must be executed
> on the device, but in this case a user can't be sure that some library 
> function
> always will offload (because the library might be replaced by fallback 
> version),
> and he/she will have to write something like:

I think there should be a way to allow the OpenMP runtime deduce what data
needs to be resynced on target region entries/exits in presence of fallback
execution; explicit copying via map(from/to:...) is a too big hammer for that.
I wonder if it was discussed.

It would be nice to be able to apply the idea of "debug counters" to target
region offloading in order to automatically bisect offload miscompilations:
force fallback execution for target region entries for Nth and next
executions; bisect by N to find the first incorrectly executing offload
region.  If the implementation cannot count on source program fully handling
arbitrary fallbacks, this idea doesn't work in general.

Alexander


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-20 Thread Ilya Verbin
On Fri, Jan 15, 2016 at 21:05:47 +0300, Ilya Verbin wrote:
> On Fri, Jan 15, 2016 at 17:45:22 +0100, Jakub Jelinek wrote:
> > On Fri, Jan 15, 2016 at 07:38:14PM +0300, Ilya Verbin wrote:
> > > On Fri, Jan 15, 2016 at 17:09:54 +0100, Jakub Jelinek wrote:
> > > > On Fri, Jan 15, 2016 at 05:02:34PM +0100, Martin Jambor wrote:
> > > > > How do other accelerators cope with the situation when half of the
> > > > > application is compiled with the accelerator disabled?  (Would some of
> > > > > their calls to GOMP_target_ext lead to abort?)
> > > > 
> > > > GOMP_target_ext should never abort (unless internal error), worst case 
> > > > it
> > > > just falls back into the host fallback.
> > > 
> > > Wouldn't that lead to hard-to-find problems in case of nonshared memory?
> > > I mean when someone expects that all target regions are executed on the 
> > > device,
> > > but in fact some of them are silently executed on the host with different 
> > > data
> > > environment.
> > 
> > E.g. for HSA it really shouldn't matter, as it is shared memory accelerator.
> > For XeonPhi we hopefully can offload anything.
> 
> As you said, if compilation of target image fails with ICE or somehow, host
> fallback and offloading to other targets should still work:
> https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00951.html
> That patch was not applied, but it can be simulated by -foffload=disable,

I agree that OpenMP doesn't guarantee that all target regions must be executed
on the device, but in this case a user can't be sure that some library function
always will offload (because the library might be replaced by fallback version),
and he/she will have to write something like:

{
  map_data_to_target ();
  some_library1_fn_with_offload ();
  get_data_from_target ();   /* ! */
  send_data_to_target ();/* ! */
  some_library2_fn_with_offload ();
  get_data_from_target ();   /* ! */
  send_data_to_target ();/* ! */
  some_library3_fn_with_offload ();
  unmap_data_from_target ();
}

If you're OK with this, I'll install this patch:


libgomp/
* target.c (gomp_get_target_fn_addr): Allow host fallback if target
function wasn't mapped to the device with non-shared memory.

diff --git a/libgomp/target.c b/libgomp/target.c
index f1f5849..96fe3d5 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1436,12 +1436,7 @@ gomp_get_target_fn_addr (struct gomp_device_descr 
*devicep,
   splay_tree_key tgt_fn = splay_tree_lookup (>mem_map, );
   gomp_mutex_unlock (>lock);
   if (tgt_fn == NULL)
-   {
- if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
-   return NULL;
- else
-   gomp_fatal ("Target function wasn't mapped");
-   }
+   return NULL;
 
   return (void *) tgt_fn->tgt_offset;
 }

  -- Ilya


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-16 Thread Jan Hubicka
> On 01/15/2016 10:52 AM, Jan Hubicka wrote:
> > Do we really need to look that up in the hsa summary? Why these can not be 
> > partitioned the
> > usual way?
> 
> Hi.
> 
> Yes, it's needed as hsa-brig.c uses host function declaration of a kernel as 
> a key for libgomp.
> That's why we want to put the pair to a LTO partition.

Can't it be represented via explicit REF_ADDR or something like that?

Honza
> 
> Martin


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Martin Liška
On 01/15/2016 10:52 AM, Jan Hubicka wrote:
> Do we really need to look that up in the hsa summary? Why these can not be 
> partitioned the
> usual way?

Hi.

Yes, it's needed as hsa-brig.c uses host function declaration of a kernel as a 
key for libgomp.
That's why we want to put the pair to a LTO partition.

Martin


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Martin Liška
On 01/14/2016 01:58 PM, Jakub Jelinek wrote:
> Does it really need to be enabled whenever in_lto_p?
> I mean, if HSA is not configured in, I think the gate should be false too.

Sure, it can be removed, change will incorporated in final installed version of 
the file.

Thanks,
Martin


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Jakub Jelinek
On Fri, Jan 15, 2016 at 03:53:23PM +0100, Martin Jambor wrote:
> @@ -317,7 +319,7 @@ public:
>  bool
>  pass_ipa_hsa::gate (function *)
>  {
> -  return hsa_gen_requested_p () || in_lto_p;
> +  return hsa_gen_requested_p ();
>  }
>  
>  } // anon namespace

I actually didn't mean this, I mean more of:
  return (hsa_gen_requested_p ()
#ifdef ENABLE_HSA
  || in_lto_p
#endif
 );
or so.  Unless you arrange in lto-wrapper or where that if
HSA is enabled in any LTO input source, then it is enabled also in
lto1.  If you do that, your change is fine.

Jakub


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Martin Jambor
On Thu, Jan 14, 2016 at 01:58:58PM +0100, Jakub Jelinek wrote:
> Otherwise LGTM.
> 
>   Jakub

Thanks Jakub, I have committed the following patch from Martin Liska
that addresses your comments.

Martin

2016-01-15  Martin Liska  

* ipa-hsa.c (process_hsa_functions): Fixed coding style.
(ipa_hsa_read_section): Likewise.
(ipa_hsa_read_section): Likewise.
(pass_ipa_hsa::gate): Removed in_lto_p from the condition.
---
 gcc/ipa-hsa.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/gcc/ipa-hsa.c b/gcc/ipa-hsa.c
index dd47995..769657f 100644
--- a/gcc/ipa-hsa.c
+++ b/gcc/ipa-hsa.c
@@ -86,8 +86,9 @@ process_hsa_functions (void)
{
  if (!check_warn_node_versionable (node))
continue;
- cgraph_node *clone = node->create_virtual_clone
-   (vec  (), NULL, NULL, "hsa");
+ cgraph_node *clone
+   = node->create_virtual_clone (vec  (),
+ NULL, NULL, "hsa");
  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
 
  clone->force_output = true;
@@ -102,8 +103,9 @@ process_hsa_functions (void)
{
  if (!check_warn_node_versionable (node))
continue;
- cgraph_node *clone = node->create_virtual_clone
-   (vec  (), NULL, NULL, "hsa");
+ cgraph_node *clone
+   = node->create_virtual_clone (vec  (),
+ NULL, NULL, "hsa");
  TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
 
  if (!cgraph_local_p (node))
@@ -209,8 +211,8 @@ static void
 ipa_hsa_read_section (struct lto_file_decl_data *file_data, const char *data,
   size_t len)
 {
-  const struct lto_function_header *header =
-(const struct lto_function_header *) data;
+  const struct lto_function_header *header
+= (const struct lto_function_header *) data;
   const int cfg_offset = sizeof (struct lto_function_header);
   const int main_offset = cfg_offset + header->cfg_size;
   const int string_offset = main_offset + header->main_size;
@@ -221,9 +223,9 @@ ipa_hsa_read_section (struct lto_file_decl_data *file_data, 
const char *data,
   lto_input_block ib_main ((const char *) data + main_offset,
   header->main_size, file_data->mode_table);
 
-  data_in =
-lto_data_in_create (file_data, (const char *) data + string_offset,
-   header->string_size, vNULL);
+  data_in
+= lto_data_in_create (file_data, (const char *) data + string_offset,
+ header->string_size, vNULL);
   count = streamer_read_uhwi (_main);
 
   for (i = 0; i < count; i++)
@@ -317,7 +319,7 @@ public:
 bool
 pass_ipa_hsa::gate (function *)
 {
-  return hsa_gen_requested_p () || in_lto_p;
+  return hsa_gen_requested_p ();
 }
 
 } // anon namespace
-- 
2.6.4





Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Jakub Jelinek
On Fri, Jan 15, 2016 at 05:02:34PM +0100, Martin Jambor wrote:
> How do other accelerators cope with the situation when half of the
> application is compiled with the accelerator disabled?  (Would some of
> their calls to GOMP_target_ext lead to abort?)

GOMP_target_ext should never abort (unless internal error), worst case it
just falls back into the host fallback.

Jakub


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Jakub Jelinek
On Fri, Jan 15, 2016 at 07:38:14PM +0300, Ilya Verbin wrote:
> On Fri, Jan 15, 2016 at 17:09:54 +0100, Jakub Jelinek wrote:
> > On Fri, Jan 15, 2016 at 05:02:34PM +0100, Martin Jambor wrote:
> > > How do other accelerators cope with the situation when half of the
> > > application is compiled with the accelerator disabled?  (Would some of
> > > their calls to GOMP_target_ext lead to abort?)
> > 
> > GOMP_target_ext should never abort (unless internal error), worst case it
> > just falls back into the host fallback.
> 
> Wouldn't that lead to hard-to-find problems in case of nonshared memory?
> I mean when someone expects that all target regions are executed on the 
> device,
> but in fact some of them are silently executed on the host with different data
> environment.

E.g. for HSA it really shouldn't matter, as it is shared memory accelerator.
For XeonPhi we hopefully can offload anything.  NVPTX is problematic,
because it can't offload all the code, but if it can be e.g. compile time
detected that it will not be possible, it can just provide offloaded code
for the target.

Jakub


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Martin Jambor
Hi,

On Fri, Jan 15, 2016 at 04:01:49PM +0100, Jakub Jelinek wrote:
> On Fri, Jan 15, 2016 at 03:53:23PM +0100, Martin Jambor wrote:
> > @@ -317,7 +319,7 @@ public:
> >  bool
> >  pass_ipa_hsa::gate (function *)
> >  {
> > -  return hsa_gen_requested_p () || in_lto_p;
> > +  return hsa_gen_requested_p ();
> >  }
> >  
> >  } // anon namespace
> 
> I actually didn't mean this, I mean more of:
>   return (hsa_gen_requested_p ()
> #ifdef ENABLE_HSA
> || in_lto_p
> #endif
>);
> or so.  Unless you arrange in lto-wrapper or where that if
> HSA is enabled in any LTO input source, then it is enabled also in
> lto1.  If you do that, your change is fine.
> 

This pass only creates HSA specific clones of ungridified target and
parallel regions and functions marked with declare target.  Whether or
not any HSAIL is emitted is then controlled in the hsa-gen pass gate.
The in_lto_p part was in fact a relict of a previous implementation.

So while I agree that making such a change to lto-wrapper would be
beneficial (although then we should limit its activity only to those
nodes which come from enabled units), the change above does not make
the current situation worse.  I will make sure to look into
lto-wrapper but meanwhile I still prefer the new condition.

We have tested the new change and LTO compiled code with HSA enabled
and LTO linked it with HSA disabled and:
  1) if there was no gridified loop, the result was like HSA was
 disabled from the start

  2) if there was a gridified kernel, the compiler compiled the kernel
 for the host but did not register it with libgomp and it ended up
 as an unreachable function.

How do other accelerators cope with the situation when half of the
application is compiled with the accelerator disabled?  (Would some of
their calls to GOMP_target_ext lead to abort?)

Martin


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Ilya Verbin
On Fri, Jan 15, 2016 at 17:09:54 +0100, Jakub Jelinek wrote:
> On Fri, Jan 15, 2016 at 05:02:34PM +0100, Martin Jambor wrote:
> > How do other accelerators cope with the situation when half of the
> > application is compiled with the accelerator disabled?  (Would some of
> > their calls to GOMP_target_ext lead to abort?)
> 
> GOMP_target_ext should never abort (unless internal error), worst case it
> just falls back into the host fallback.

Wouldn't that lead to hard-to-find problems in case of nonshared memory?
I mean when someone expects that all target regions are executed on the device,
but in fact some of them are silently executed on the host with different data
environment.

  -- Ilya


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Jakub Jelinek
On Fri, Jan 15, 2016 at 10:19:13PM +0300, Alexander Monakov wrote:
> Sorry, can you clarify -- what do you mean by "can't offload"?

I meant stuff like setjmp/longjmp, exceptions?, alloca (I know your changes
might fix this one), computed goto, non-local goto, and the like, which I
believe nvptx doesn't support.

Jakub


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Ilya Verbin
On Fri, Jan 15, 2016 at 17:45:22 +0100, Jakub Jelinek wrote:
> On Fri, Jan 15, 2016 at 07:38:14PM +0300, Ilya Verbin wrote:
> > On Fri, Jan 15, 2016 at 17:09:54 +0100, Jakub Jelinek wrote:
> > > On Fri, Jan 15, 2016 at 05:02:34PM +0100, Martin Jambor wrote:
> > > > How do other accelerators cope with the situation when half of the
> > > > application is compiled with the accelerator disabled?  (Would some of
> > > > their calls to GOMP_target_ext lead to abort?)
> > > 
> > > GOMP_target_ext should never abort (unless internal error), worst case it
> > > just falls back into the host fallback.
> > 
> > Wouldn't that lead to hard-to-find problems in case of nonshared memory?
> > I mean when someone expects that all target regions are executed on the 
> > device,
> > but in fact some of them are silently executed on the host with different 
> > data
> > environment.
> 
> E.g. for HSA it really shouldn't matter, as it is shared memory accelerator.
> For XeonPhi we hopefully can offload anything.

As you said, if compilation of target image fails with ICE or somehow, host
fallback and offloading to other targets should still work:
https://gcc.gnu.org/ml/gcc-patches/2015-02/msg00951.html
That patch was not applied, but it can be simulated by -foffload=disable,
I've created a testcase:

$ cat main.c

#pragma omp declare target
int x;
#pragma omp end declare target
extern int foo ();

int main ()
{
  int shared_mem = 0;
  #pragma omp target map (alloc: x, shared_mem)
{
  x = 10;
  shared_mem = 1;
}

  x = 20;
  int r = foo ();
  if (!shared_mem && r != 100)
__builtin_abort ();
  return 0;
}


$ cat liba.c 

#pragma omp declare target
extern int x;
#pragma omp end declare target

int foo ()
{
  int r;
  #pragma omp target map (from: r) map (alloc: x)
r = x * x;
  return r;
}


$ gcc -fopenmp -fPIC -shared liba.c -o liba.so -foffload=disable
$ gcc -fopenmp -L. -la main.c


Currently it prints "libgomp: Target function wasn't mapped", but after this
change:

--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1390,7 +1390,7 @@ gomp_get_target_fn_addr (struct gomp_device_descr 
*devicep,
   splay_tree_key tgt_fn = splay_tree_lookup (>mem_map, );
   gomp_mutex_unlock (>lock);
   if (tgt_fn == NULL)
-   gomp_fatal ("Target function wasn't mapped");
+   return NULL;

... it will fail at __builtin_abort, but without -foffload=disable it will pass.

  -- Ilya


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Alexander Monakov
On Fri, 15 Jan 2016, Jakub Jelinek wrote:
> On Fri, Jan 15, 2016 at 07:38:14PM +0300, Ilya Verbin wrote:
> > On Fri, Jan 15, 2016 at 17:09:54 +0100, Jakub Jelinek wrote:
> > > On Fri, Jan 15, 2016 at 05:02:34PM +0100, Martin Jambor wrote:
> > > > How do other accelerators cope with the situation when half of the
> > > > application is compiled with the accelerator disabled?  (Would some of
> > > > their calls to GOMP_target_ext lead to abort?)
> > > 
> > > GOMP_target_ext should never abort (unless internal error), worst case it
> > > just falls back into the host fallback.

Agreed -- the way it aborts today rather than using host fallback looks rather
surprising to me.

> > Wouldn't that lead to hard-to-find problems in case of nonshared memory?
> > I mean when someone expects that all target regions are executed on the 
> > device,
> > but in fact some of them are silently executed on the host with different 
> > data
> > environment.
> 
> E.g. for HSA it really shouldn't matter, as it is shared memory accelerator.
> For XeonPhi we hopefully can offload anything.  NVPTX is problematic,
> because it can't offload all the code, 

Sorry, can you clarify -- what do you mean by "can't offload"?

> but if it can be e.g. compile time detected that it will not be possible, it
> can just provide offloaded code for the target.

(as a result of previous confusion I can't follow this part either)

Thanks.
Alexander


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Alexander Monakov
On Fri, 15 Jan 2016, Jakub Jelinek wrote:

> On Fri, Jan 15, 2016 at 10:19:13PM +0300, Alexander Monakov wrote:
> > Sorry, can you clarify -- what do you mean by "can't offload"?
> 
> I meant stuff like setjmp/longjmp, exceptions?, alloca (I know your changes
> might fix this one), computed goto, non-local goto, and the like, which I
> believe nvptx doesn't support.

Right, but such issues are diagnosed as a compile-time error; the run-time
stage is simply not reached.

Did you mean that eventually GCC might change and somehow allow compilation to
run to completion even though offloaded code cannot be fully generated?

Thanks.
Alexander


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Jakub Jelinek
On Fri, Jan 15, 2016 at 10:46:32PM +0300, Alexander Monakov wrote:
> On Fri, 15 Jan 2016, Jakub Jelinek wrote:
> 
> > On Fri, Jan 15, 2016 at 10:19:13PM +0300, Alexander Monakov wrote:
> > > Sorry, can you clarify -- what do you mean by "can't offload"?
> > 
> > I meant stuff like setjmp/longjmp, exceptions?, alloca (I know your changes
> > might fix this one), computed goto, non-local goto, and the like, which I
> > believe nvptx doesn't support.
> 
> Right, but such issues are diagnosed as a compile-time error; the run-time
> stage is simply not reached.
> 
> Did you mean that eventually GCC might change and somehow allow compilation to
> run to completion even though offloaded code cannot be fully generated?

Yeah, at least depending on some option, either
downgrade all errors in the offloading compiler into warnings that just
result in the offloading image for the particular accelerator not being
created, or issue errors, but still allow the linking.

Jakub


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-15 Thread Jan Hubicka
> 2016-01-13  Martin Liska  
>   Martin Jambor  
> 
>   * ipa-hsa.c: New file.
>   * lto-section-in.c (lto_section_name): Add hsa section name.
>   * lto-streamer.h (lto_section_type): Add hsa section.
>   * lto-partition.c: Include "hsa.h"
>   (add_symbol_to_partition_1): Put hsa implementations into the
>   same partition as host implementations.
>   * timevar.def (TV_IPA_HSA): New.
> 
> diff --git a/gcc/lto/lto-partition.c b/gcc/lto/lto-partition.c
> index 81a63a5..0a56170 100644
> --- a/gcc/lto/lto-partition.c
> +++ b/gcc/lto/lto-partition.c
> @@ -34,6 +34,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "ipa-prop.h"
>  #include "ipa-inline.h"
>  #include "lto-partition.h"
> +#include "hsa.h"
>  
>  vec ltrans_partitions;
>  
> @@ -170,6 +171,24 @@ add_symbol_to_partition_1 (ltrans_partition part, 
> symtab_node *node)
>Therefore put it into the same partition.  */
>if (cnode->instrumented_version)
>   add_symbol_to_partition_1 (part, cnode->instrumented_version);
> +
> +  /* Add an HSA associated with the symbol.  */
> +  if (hsa_summaries != NULL)
> + {
> +   hsa_function_summary *s = hsa_summaries->get (cnode);
> +   if (s->m_kind == HSA_KERNEL)
> + {
> +   /* Add binded function.  */
> +   bool added = add_symbol_to_partition_1 (part,
> +   s->m_binded_function);
> +   gcc_assert (added);
> +   if (symtab->dump_file)
> + fprintf (symtab->dump_file,
> +  "adding an HSA function (host/gpu) to the "
> +  "partition: %s\n",
> +  s->m_binded_function->name ());
> + }
> + }

Do we really need to look that up in the hsa summary? Why these can not be 
partitioned the
usual way?

The patch looks OK for me modulo Jakub's comments.

Honza
>  }
>  
>add_references_to_partition (part, node);
> diff --git a/gcc/timevar.def b/gcc/timevar.def
> index 2765179..d9a5066 100644
> --- a/gcc/timevar.def
> +++ b/gcc/timevar.def
> @@ -97,6 +97,7 @@ DEFTIMEVAR (TV_WHOPR_WPA_IO  , "whopr wpa I/O")
>  DEFTIMEVAR (TV_WHOPR_PARTITIONING, "whopr partitioning")
>  DEFTIMEVAR (TV_WHOPR_LTRANS  , "whopr ltrans")
>  DEFTIMEVAR (TV_IPA_REFERENCE , "ipa reference")
> +DEFTIMEVAR (TV_IPA_HSA, "ipa HSA")
>  DEFTIMEVAR (TV_IPA_PROFILE   , "ipa profile")
>  DEFTIMEVAR (TV_IPA_AUTOFDO   , "auto profile")
>  DEFTIMEVAR (TV_IPA_PURE_CONST, "ipa pure const")


Re: [hsa merge 07/10] IPA-HSA pass

2016-01-14 Thread Jakub Jelinek
On Wed, Jan 13, 2016 at 06:39:32PM +0100, Martin Jambor wrote:

> +   cgraph_node *clone = node->create_virtual_clone
> + (vec  (), NULL, NULL, "hsa");

Nicer formatting would be
  cgraph_node *clone
= node->create_virtual_clone (vec  (),
  NULL, NULL, "hsa");

> +   cgraph_node *clone = node->create_virtual_clone
> + (vec  (), NULL, NULL, "hsa");

Ditto.

> +  const struct lto_function_header *header =
> +(const struct lto_function_header *) data;

= goes on the next line.

> +  const int cfg_offset = sizeof (struct lto_function_header);
> +  const int main_offset = cfg_offset + header->cfg_size;
> +  const int string_offset = main_offset + header->main_size;
> +  struct data_in *data_in;
> +  unsigned int i;
> +  unsigned int count;
> +
> +  lto_input_block ib_main ((const char *) data + main_offset,
> +header->main_size, file_data->mode_table);
> +
> +  data_in =

Ditto.

> +bool
> +pass_ipa_hsa::gate (function *)
> +{
> +  return hsa_gen_requested_p () || in_lto_p;

Does it really need to be enabled whenever in_lto_p?
I mean, if HSA is not configured in, I think the gate should be false too.

Otherwise LGTM.

Jakub


[hsa merge 07/10] IPA-HSA pass

2016-01-13 Thread Martin Jambor
Hi,

this patch contains IPA-related changes that we need to bring about
for HSA.

The patch is a re-post of
https://gcc.gnu.org/ml/gcc-patches/2015-12/msg00720.html but so far we
have not received any feedback.  Let me quote the original
accompanying email here for reference:

When a target construct is gridified, the HSA GPU function is
associated with the CPU function throughout the compilation, so that
they can be registered as a pair in libgomp.

Ungridified target constructs and, more importantly, "pragma omp
declare target" marked functions emerge out of OMP expansion as one
gimple function for both the host and the accelerator. However, at
some point we need to create a special HSA function representation so
that we can modify behavior of a (very) few optimization passes for
them.

Both is done by the following new IPA pass, which creates new HSA
clones in these cases.  Moreover, it redirects the appropriate call
graph edges to be in between HSA implementations, marks HSA clones
with the flatten attribute to minimize any call overhead (which is
much more significant on GPUs) and makes sure both the CPU and GPU
functions are coupled together and remain in the same LTO partition so
that they can b registered together to libgomp.

Thanks,

Martin


2016-01-13  Martin Liska  
Martin Jambor  

* ipa-hsa.c: New file.
* lto-section-in.c (lto_section_name): Add hsa section name.
* lto-streamer.h (lto_section_type): Add hsa section.
* lto-partition.c: Include "hsa.h"
(add_symbol_to_partition_1): Put hsa implementations into the
same partition as host implementations.
* timevar.def (TV_IPA_HSA): New.

diff --git a/gcc/ipa-hsa.c b/gcc/ipa-hsa.c
new file mode 100644
index 000..dd47995
--- /dev/null
+++ b/gcc/ipa-hsa.c
@@ -0,0 +1,329 @@
+/* Callgraph based analysis of static variables.
+   Copyright (C) 2015-2016 Free Software Foundation, Inc.
+   Contributed by Martin Liska 
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+.  */
+
+/* Interprocedural HSA pass is responsible for creation of HSA clones.
+   For all these HSA clones, we emit HSAIL instructions and pass processing
+   is terminated.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "is-a.h"
+#include "hash-set.h"
+#include "vec.h"
+#include "tree.h"
+#include "tree-pass.h"
+#include "function.h"
+#include "basic-block.h"
+#include "gimple.h"
+#include "dumpfile.h"
+#include "gimple-pretty-print.h"
+#include "tree-streamer.h"
+#include "stringpool.h"
+#include "cgraph.h"
+#include "print-tree.h"
+#include "symbol-summary.h"
+#include "hsa.h"
+
+namespace {
+
+/* If NODE is not versionable, warn about not emiting HSAIL and return false.
+   Otherwise return true.  */
+
+static bool
+check_warn_node_versionable (cgraph_node *node)
+{
+  if (!node->local.versionable)
+{
+  warning_at (EXPR_LOCATION (node->decl), OPT_Whsa,
+ "could not emit HSAIL for function %s: function cannot be "
+ "cloned", node->name ());
+  return false;
+}
+  return true;
+}
+
+/* The function creates HSA clones for all functions that were either
+   marked as HSA kernels or are callable HSA functions.  Apart from that,
+   we redirect all edges that come from an HSA clone and end in another
+   HSA clone to connect these two functions.  */
+
+static unsigned int
+process_hsa_functions (void)
+{
+  struct cgraph_node *node;
+
+  if (hsa_summaries == NULL)
+hsa_summaries = new hsa_summary_t (symtab);
+
+  FOR_EACH_DEFINED_FUNCTION (node)
+{
+  hsa_function_summary *s = hsa_summaries->get (node);
+
+  /* A linked function is skipped.  */
+  if (s->m_binded_function != NULL)
+   continue;
+
+  if (s->m_kind != HSA_NONE)
+   {
+ if (!check_warn_node_versionable (node))
+   continue;
+ cgraph_node *clone = node->create_virtual_clone
+   (vec  (), NULL, NULL, "hsa");
+ TREE_PUBLIC (clone->decl) = TREE_PUBLIC (node->decl);
+
+ clone->force_output = true;
+ hsa_summaries->link_functions (clone, node, s->m_kind, false);
+
+ if (dump_file)
+   fprintf (dump_file, "Created a new HSA clone: %s, type: %s\n",
+clone->name (),
+s->m_kind == HSA_KERNEL ? "kernel"