Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers

2017-02-17 Thread Alexander Monakov
On Fri, 17 Feb 2017, Cesar Philippidis wrote:
> > And then, I don't specifically have a problem with discontinuing CUDA 5.5
> > support, and require 6.5, for example, but that should be a conscious
> > decision.
> 
> We should probably ditch CUDA 5.5. In fact, according to trunk's cuda.h,
> it requires version 8.0.

No, the define in cuda.h substitute header does not imply a requirement.

> Alex, are you using CUDA 5.5 in your environment?

No.

Alexander


Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers

2017-02-17 Thread Cesar Philippidis
On 02/15/2017 01:29 PM, Thomas Schwinge wrote:
> On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidis 
>  wrote:

>> @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void 
>> **hostaddrs, void **devaddrs,
>>CUdevice dev = nvptx_thread()->ptx_dev->dev;
>>/* 32 is the default for known hardware.  */
>>int gang = 0, worker = 32, vector = 32;
>> -  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
>> +  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm;
>>  
>>cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
>>cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
>>cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
>>cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
>> +  cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
>> +  cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
>>  
>>if (cuDeviceGetAttribute (_size, cu_tpb, dev) == CUDA_SUCCESS
>>&& cuDeviceGetAttribute (_size, cu_ws, dev) == CUDA_SUCCESS
>>&& cuDeviceGetAttribute (_size, cu_mpc, dev) == CUDA_SUCCESS
>> -  && cuDeviceGetAttribute (_size, cu_tpm, dev)  == CUDA_SUCCESS)
>> +  && cuDeviceGetAttribute (_size, cu_tpm, dev) == CUDA_SUCCESS
>> +  && cuDeviceGetAttribute (_size, cu_rf, dev)  == CUDA_SUCCESS
>> +  && cuDeviceGetAttribute (_size, cu_sm, dev)  == CUDA_SUCCESS)
> 
> Trying to compile this on CUDA 5.5/331.113, I run into:
> 
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec':
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: 
> 'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use 
> in this function)
>cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
> ^~~~
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each 
> undeclared identifier is reported only once for each function it appears in
> [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: 
> 'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first 
> use in this function)
>cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
> ^~~~
> 
> For reference, please see the code handling
> CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version
> of the nvptx_open_device function.

ACK. While this change is fairly innocuous, it might be too invasive for
GCC 7.1. Maybe we can backport it to 7.1?

> And then, I don't specifically have a problem with discontinuing CUDA 5.5
> support, and require 6.5, for example, but that should be a conscious
> decision.

We should probably ditch CUDA 5.5. In fact, according to trunk's cuda.h,
it requires version 8.0.

Alex, are you using CUDA 5.5 in your environment?

>> @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, 
>> void **devaddrs,
>>   matches the hardware configuration.  Logical gangs are
>>   scheduled onto physical hardware.  To maximize usage, we
>>   should guess a large number.  */
>> -  if (default_dims[GOMP_DIM_GANG] < 1)
>> -default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;
> 
> That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also
> known as "default_dims[0]") is used to decide whether to enter this whole
> code block, and with that assignment removed, every call of the
> nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing,
> cuDeviceGetAttribute calls, computations, and so on.  (See "GOMP_DEBUG=1"
> output.)

Good point. I made neutral values (e.g. '-' arguments as negative one).

> I think this whole code block should be moved into the nvptx_open_device
> function, to have it executed once when the device is opened -- after
> all, all these are per-device attributes.  (So, it's actually
> conceptually incorrect to have this done only once in the nvptx_exec
> function, given that this data then is used in the same process by/for
> potentially different hardware devices.)

Yeah, that's a better place. All of those hardware attributes are now
stored in ptx_device.

> And, one could argue that the GOMP_OPENACC_DIM parsing conceptually
> belongs into generic libgomp code, instead of the nvptx plugin.  (But
> that aspect can be cleaned up later: currently, the nvptx plugin is the
> only one supporting/using it.)
> 
>>/* The worker size must not exceed the hardware.  */
>>if (default_dims[GOMP_DIM_WORKER] < 1
>>|| (default_dims[GOMP_DIM_WORKER] > worker && gang))
>> @@ -998,9 +1007,56 @@ nvptx_exec (void (*fn), size_t mapnum, void 
>> **hostaddrs, void **devaddrs,
>>  }
>>pthread_mutex_unlock (_dev_lock);
>>  
>> +  int reg_used = -1;  /* Dummy value.  */
>> +  cuFuncGetAttribute (_used, CU_FUNC_ATTRIBUTE_NUM_REGS, 

Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers

2017-02-15 Thread Thomas Schwinge
Hi Cesar!

On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidis  
wrote:
> This patch does the followings:
> 
>  * Adjusts the default num_gangs to utilize more of GPU hardware.
>  * Teach libgomp to emit a diagnostic when num_workers isn't supported.
> 
> [...]

Thanks!

> This patch has been applied to gomp-4_0-branch.

For easier review, I'm quoting here your r245393 commit with whitespace
changes ignored:

> --- libgomp/plugin/plugin-nvptx.c
> +++ libgomp/plugin/plugin-nvptx.c
> @@ -917,10 +918,15 @@ nvptx_exec (void (*fn), size_t mapnum, void 
> **hostaddrs, void **devaddrs,
>   seen_zero = 1;
>  }
>  
> -  if (seen_zero)
> -{
> -  /* See if the user provided GOMP_OPENACC_DIM environment
> -  variable to specify runtime defaults. */
> +  /* Both reg_granuarlity and warp_granuularity were extracted from
> + the "Register Allocation Granularity" in Nvidia's CUDA Occupancy
> + Calculator spreadsheet.  Specifically, this required SM_30+
> + targets.  */
> +  const int reg_granularity = 256;

That is per warp, so a granularity of 256 / 32 = 8 registers per thread.
(Would be strange otherwise.)

> +  const int warp_granularity = 4;
> +
> +  /* See if the user provided GOMP_OPENACC_DIM environment variable to
> + specify runtime defaults. */
>static int default_dims[GOMP_DIM_MAX];
>  
>pthread_mutex_lock (_dev_lock);
> @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void 
> **hostaddrs, void **devaddrs,
>CUdevice dev = nvptx_thread()->ptx_dev->dev;
>/* 32 is the default for known hardware.  */
>int gang = 0, worker = 32, vector = 32;
> -   CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
> +  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm;
>  
>cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
>cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
>cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
>cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
> +  cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
> +  cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
>  
>if (cuDeviceGetAttribute (_size, cu_tpb, dev) == CUDA_SUCCESS
> && cuDeviceGetAttribute (_size, cu_ws, dev) == CUDA_SUCCESS
> && cuDeviceGetAttribute (_size, cu_mpc, dev) == CUDA_SUCCESS
> -   && cuDeviceGetAttribute (_size, cu_tpm, dev)  == CUDA_SUCCESS)
> +   && cuDeviceGetAttribute (_size, cu_tpm, dev) == CUDA_SUCCESS
> +   && cuDeviceGetAttribute (_size, cu_rf, dev)  == CUDA_SUCCESS
> +   && cuDeviceGetAttribute (_size, cu_sm, dev)  == CUDA_SUCCESS)

Trying to compile this on CUDA 5.5/331.113, I run into:

[...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec':
[...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: 
'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use in 
this function)
   cu_rf  = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR;
^~~~
[...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each 
undeclared identifier is reported only once for each function it appears in
[...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: 
'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first 
use in this function)
   cu_sm  = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR;
^~~~

For reference, please see the code handling
CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version
of the nvptx_open_device function.

And then, I don't specifically have a problem with discontinuing CUDA 5.5
support, and require 6.5, for example, but that should be a conscious
decision.

> @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, 
> void **devaddrs,
>matches the hardware configuration.  Logical gangs are
>scheduled onto physical hardware.  To maximize usage, we
>should guess a large number.  */
> -   if (default_dims[GOMP_DIM_GANG] < 1)
> - default_dims[GOMP_DIM_GANG] = gang ? gang : 1024;

That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also
known as "default_dims[0]") is used to decide whether to enter this whole
code block, and with that assignment removed, every call of the
nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing,
cuDeviceGetAttribute calls, computations, and so on.  (See "GOMP_DEBUG=1"
output.)

I think this whole code block should be moved into the nvptx_open_device
function, to have it executed once when the device is opened -- after
all, all these are per-device attributes.  (So, it's actually
conceptually incorrect to have this done only once in the nvptx_exec
function, given that this data then is used in the same 

[gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers

2017-02-13 Thread Cesar Philippidis
This patch does the followings:

 * Adjusts the default num_gangs to utilize more of GPU hardware.
 * Teach libgomp to emit a diagnostic when num_workers isn't supported.

According to the confusing CUDA literature, it appears that the previous
num_gangs wasn't fully utilizing the GPU hardware. The previous strategy
was to set num_gangs to the number of SM processors. However, SM
processors can execute multiple CUDA blocks concurrently. In this patch,
I'm using a relaxed version of the formulas from Nvidia's CUDA Occupancy
Calculator spreadsheet to determine num_gangs. More specifically, since
we're not using shared-memory that extensively right now, I've omitted
that restraint from the formula.

While I was at it, I also taught the nvptx plugin how to emit a
diagnostic when the hardware doesn't have enough registers to support
the requested num_workers at run time. There are two problems here. 1)
The register file is a shared resource between all of the threads in a
SM. The more registers each thread in the SM utilizes, the fewer threads
the CUDA blocks can contain. 2) In order to eliminate MIN_EXPRs in the
for-loop branches in worker-partitioned loops, the nvptx BE is currently
hard-coding the default num_workers to 32 for any parallel region that
doesn't contain an explicit num_workers. When I disabled that
optimization, I observed a 2.5x slow down in CloverLeaf. So rather than
disabling that optimization, I taught the runtime to give the end user
some performance hints. E.g., recompile your program with
-fopenacc-dim=-:num_workers.

This patch has been applied to gomp-4_0-branch.

Cesar
2017-02-13  Cesar Philippidis  

	libgomp/
	* plugin/plugin-nvptx.c (nvptx_exec): Adjust the default num_gangs.
	Add diagnostic when the hardware cannot support the requested
	num_workers.


diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index d1261b4..8c696eb 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -899,6 +899,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   CUdeviceptr dp;
   struct nvptx_thread *nvthd = nvptx_thread ();
   const char *maybe_abort_msg = "(perhaps abort was called)";
+  static int warp_size, block_size, dev_size, cpu_size, rf_size, sm_size;
 
   function = targ_fn->fn;
 
@@ -917,90 +918,145 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 	seen_zero = 1;
 }
 
-  if (seen_zero)
-{
-  /* See if the user provided GOMP_OPENACC_DIM environment
-	 variable to specify runtime defaults. */
-  static int default_dims[GOMP_DIM_MAX];
+  /* Both reg_granuarlity and warp_granuularity were extracted from
+ the "Register Allocation Granularity" in Nvidia's CUDA Occupancy
+ Calculator spreadsheet.  Specifically, this required SM_30+
+ targets.  */
+  const int reg_granularity = 256;
+  const int warp_granularity = 4;
 
-  pthread_mutex_lock (_dev_lock);
-  if (!default_dims[0])
+  /* See if the user provided GOMP_OPENACC_DIM environment variable to
+ specify runtime defaults. */
+  static int default_dims[GOMP_DIM_MAX];
+
+  pthread_mutex_lock (_dev_lock);
+  if (!default_dims[0])
+{
+  /* We only read the environment variable once.  You can't
+	 change it in the middle of execution.  The syntax  is
+	 the same as for the -fopenacc-dim compilation option.  */
+  const char *env_var = getenv ("GOMP_OPENACC_DIM");
+  if (env_var)
 	{
-	  /* We only read the environment variable once.  You can't
-	 change it in the middle of execution.  The syntax  is
-	 the same as for the -fopenacc-dim compilation option.  */
-	  const char *env_var = getenv ("GOMP_OPENACC_DIM");
-	  if (env_var)
-	{
-	  const char *pos = env_var;
+	  const char *pos = env_var;
 
-	  for (i = 0; *pos && i != GOMP_DIM_MAX; i++)
+	  for (i = 0; *pos && i != GOMP_DIM_MAX; i++)
+	{
+	  if (i && *pos++ != ':')
+		break;
+	  if (*pos != ':')
 		{
-		  if (i && *pos++ != ':')
+		  const char *eptr;
+
+		  errno = 0;
+		  long val = strtol (pos, (char **), 10);
+		  if (errno || val < 0 || (unsigned)val != val)
 		break;
-		  if (*pos != ':')
-		{
-		  const char *eptr;
-
-		  errno = 0;
-		  long val = strtol (pos, (char **), 10);
-		  if (errno || val < 0 || (unsigned)val != val)
-			break;
-		  default_dims[i] = (int)val;
-		  pos = eptr;
-		}
+		  default_dims[i] = (int)val;
+		  pos = eptr;
 		}
 	}
+	}
 
-	  int warp_size, block_size, dev_size, cpu_size;
-	  CUdevice dev = nvptx_thread()->ptx_dev->dev;
-	  /* 32 is the default for known hardware.  */
-	  int gang = 0, worker = 32, vector = 32;
-	  CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm;
-
-	  cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK;
-	  cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE;
-	  cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT;
-	  cu_tpm  = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR;
-
-	  if