Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
On Fri, 17 Feb 2017, Cesar Philippidis wrote: > > And then, I don't specifically have a problem with discontinuing CUDA 5.5 > > support, and require 6.5, for example, but that should be a conscious > > decision. > > We should probably ditch CUDA 5.5. In fact, according to trunk's cuda.h, > it requires version 8.0. No, the define in cuda.h substitute header does not imply a requirement. > Alex, are you using CUDA 5.5 in your environment? No. Alexander
Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
On 02/15/2017 01:29 PM, Thomas Schwinge wrote: > On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidis >wrote: >> @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void >> **hostaddrs, void **devaddrs, >>CUdevice dev = nvptx_thread()->ptx_dev->dev; >>/* 32 is the default for known hardware. */ >>int gang = 0, worker = 32, vector = 32; >> - CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm; >> + CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm; >> >>cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK; >>cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE; >>cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT; >>cu_tpm = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR; >> + cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR; >> + cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR; >> >>if (cuDeviceGetAttribute (_size, cu_tpb, dev) == CUDA_SUCCESS >>&& cuDeviceGetAttribute (_size, cu_ws, dev) == CUDA_SUCCESS >>&& cuDeviceGetAttribute (_size, cu_mpc, dev) == CUDA_SUCCESS >> - && cuDeviceGetAttribute (_size, cu_tpm, dev) == CUDA_SUCCESS) >> + && cuDeviceGetAttribute (_size, cu_tpm, dev) == CUDA_SUCCESS >> + && cuDeviceGetAttribute (_size, cu_rf, dev) == CUDA_SUCCESS >> + && cuDeviceGetAttribute (_size, cu_sm, dev) == CUDA_SUCCESS) > > Trying to compile this on CUDA 5.5/331.113, I run into: > > [...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec': > [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: > 'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use > in this function) >cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR; > ^~~~ > [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each > undeclared identifier is reported only once for each function it appears in > [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: > 'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first > use in this function) >cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR; > ^~~~ > > For reference, please see the code handling > CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version > of the nvptx_open_device function. ACK. While this change is fairly innocuous, it might be too invasive for GCC 7.1. Maybe we can backport it to 7.1? > And then, I don't specifically have a problem with discontinuing CUDA 5.5 > support, and require 6.5, for example, but that should be a conscious > decision. We should probably ditch CUDA 5.5. In fact, according to trunk's cuda.h, it requires version 8.0. Alex, are you using CUDA 5.5 in your environment? >> @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, >> void **devaddrs, >> matches the hardware configuration. Logical gangs are >> scheduled onto physical hardware. To maximize usage, we >> should guess a large number. */ >> - if (default_dims[GOMP_DIM_GANG] < 1) >> -default_dims[GOMP_DIM_GANG] = gang ? gang : 1024; > > That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also > known as "default_dims[0]") is used to decide whether to enter this whole > code block, and with that assignment removed, every call of the > nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing, > cuDeviceGetAttribute calls, computations, and so on. (See "GOMP_DEBUG=1" > output.) Good point. I made neutral values (e.g. '-' arguments as negative one). > I think this whole code block should be moved into the nvptx_open_device > function, to have it executed once when the device is opened -- after > all, all these are per-device attributes. (So, it's actually > conceptually incorrect to have this done only once in the nvptx_exec > function, given that this data then is used in the same process by/for > potentially different hardware devices.) Yeah, that's a better place. All of those hardware attributes are now stored in ptx_device. > And, one could argue that the GOMP_OPENACC_DIM parsing conceptually > belongs into generic libgomp code, instead of the nvptx plugin. (But > that aspect can be cleaned up later: currently, the nvptx plugin is the > only one supporting/using it.) > >>/* The worker size must not exceed the hardware. */ >>if (default_dims[GOMP_DIM_WORKER] < 1 >>|| (default_dims[GOMP_DIM_WORKER] > worker && gang)) >> @@ -998,9 +1007,56 @@ nvptx_exec (void (*fn), size_t mapnum, void >> **hostaddrs, void **devaddrs, >> } >>pthread_mutex_unlock (_dev_lock); >> >> + int reg_used = -1; /* Dummy value. */ >> + cuFuncGetAttribute (_used, CU_FUNC_ATTRIBUTE_NUM_REGS,
Re: [gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
Hi Cesar! On Mon, 13 Feb 2017 08:58:39 -0800, Cesar Philippidiswrote: > This patch does the followings: > > * Adjusts the default num_gangs to utilize more of GPU hardware. > * Teach libgomp to emit a diagnostic when num_workers isn't supported. > > [...] Thanks! > This patch has been applied to gomp-4_0-branch. For easier review, I'm quoting here your r245393 commit with whitespace changes ignored: > --- libgomp/plugin/plugin-nvptx.c > +++ libgomp/plugin/plugin-nvptx.c > @@ -917,10 +918,15 @@ nvptx_exec (void (*fn), size_t mapnum, void > **hostaddrs, void **devaddrs, > seen_zero = 1; > } > > - if (seen_zero) > -{ > - /* See if the user provided GOMP_OPENACC_DIM environment > - variable to specify runtime defaults. */ > + /* Both reg_granuarlity and warp_granuularity were extracted from > + the "Register Allocation Granularity" in Nvidia's CUDA Occupancy > + Calculator spreadsheet. Specifically, this required SM_30+ > + targets. */ > + const int reg_granularity = 256; That is per warp, so a granularity of 256 / 32 = 8 registers per thread. (Would be strange otherwise.) > + const int warp_granularity = 4; > + > + /* See if the user provided GOMP_OPENACC_DIM environment variable to > + specify runtime defaults. */ >static int default_dims[GOMP_DIM_MAX]; > >pthread_mutex_lock (_dev_lock); > @@ -952,25 +958,30 @@ nvptx_exec (void (*fn), size_t mapnum, void > **hostaddrs, void **devaddrs, >CUdevice dev = nvptx_thread()->ptx_dev->dev; >/* 32 is the default for known hardware. */ >int gang = 0, worker = 32, vector = 32; > - CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm; > + CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm, cu_rf, cu_sm; > >cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK; >cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE; >cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT; >cu_tpm = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR; > + cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR; > + cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR; > >if (cuDeviceGetAttribute (_size, cu_tpb, dev) == CUDA_SUCCESS > && cuDeviceGetAttribute (_size, cu_ws, dev) == CUDA_SUCCESS > && cuDeviceGetAttribute (_size, cu_mpc, dev) == CUDA_SUCCESS > - && cuDeviceGetAttribute (_size, cu_tpm, dev) == CUDA_SUCCESS) > + && cuDeviceGetAttribute (_size, cu_tpm, dev) == CUDA_SUCCESS > + && cuDeviceGetAttribute (_size, cu_rf, dev) == CUDA_SUCCESS > + && cuDeviceGetAttribute (_size, cu_sm, dev) == CUDA_SUCCESS) Trying to compile this on CUDA 5.5/331.113, I run into: [...]/source-gcc/libgomp/plugin/plugin-nvptx.c: In function 'nvptx_exec': [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR' undeclared (first use in this function) cu_rf = CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR; ^~~~ [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:970:16: note: each undeclared identifier is reported only once for each function it appears in [...]/source-gcc/libgomp/plugin/plugin-nvptx.c:971:16: error: 'CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR' undeclared (first use in this function) cu_sm = CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_MULTIPROCESSOR; ^~~~ For reference, please see the code handling CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_MULTIPROCESSOR in the trunk version of the nvptx_open_device function. And then, I don't specifically have a problem with discontinuing CUDA 5.5 support, and require 6.5, for example, but that should be a conscious decision. > @@ -980,8 +991,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, > void **devaddrs, >matches the hardware configuration. Logical gangs are >scheduled onto physical hardware. To maximize usage, we >should guess a large number. */ > - if (default_dims[GOMP_DIM_GANG] < 1) > - default_dims[GOMP_DIM_GANG] = gang ? gang : 1024; That's "bad", because a non-zero "default_dims[GOMP_DIM_GANG]" (also known as "default_dims[0]") is used to decide whether to enter this whole code block, and with that assignment removed, every call of the nvptx_exec function will now re-do all this GOMP_OPENACC_DIM parsing, cuDeviceGetAttribute calls, computations, and so on. (See "GOMP_DEBUG=1" output.) I think this whole code block should be moved into the nvptx_open_device function, to have it executed once when the device is opened -- after all, all these are per-device attributes. (So, it's actually conceptually incorrect to have this done only once in the nvptx_exec function, given that this data then is used in the same
[gomp4] adjust num_gangs and add a diagnostic for unsupported num_workers
This patch does the followings: * Adjusts the default num_gangs to utilize more of GPU hardware. * Teach libgomp to emit a diagnostic when num_workers isn't supported. According to the confusing CUDA literature, it appears that the previous num_gangs wasn't fully utilizing the GPU hardware. The previous strategy was to set num_gangs to the number of SM processors. However, SM processors can execute multiple CUDA blocks concurrently. In this patch, I'm using a relaxed version of the formulas from Nvidia's CUDA Occupancy Calculator spreadsheet to determine num_gangs. More specifically, since we're not using shared-memory that extensively right now, I've omitted that restraint from the formula. While I was at it, I also taught the nvptx plugin how to emit a diagnostic when the hardware doesn't have enough registers to support the requested num_workers at run time. There are two problems here. 1) The register file is a shared resource between all of the threads in a SM. The more registers each thread in the SM utilizes, the fewer threads the CUDA blocks can contain. 2) In order to eliminate MIN_EXPRs in the for-loop branches in worker-partitioned loops, the nvptx BE is currently hard-coding the default num_workers to 32 for any parallel region that doesn't contain an explicit num_workers. When I disabled that optimization, I observed a 2.5x slow down in CloverLeaf. So rather than disabling that optimization, I taught the runtime to give the end user some performance hints. E.g., recompile your program with -fopenacc-dim=-:num_workers. This patch has been applied to gomp-4_0-branch. Cesar 2017-02-13 Cesar Philippidislibgomp/ * plugin/plugin-nvptx.c (nvptx_exec): Adjust the default num_gangs. Add diagnostic when the hardware cannot support the requested num_workers. diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index d1261b4..8c696eb 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -899,6 +899,7 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, CUdeviceptr dp; struct nvptx_thread *nvthd = nvptx_thread (); const char *maybe_abort_msg = "(perhaps abort was called)"; + static int warp_size, block_size, dev_size, cpu_size, rf_size, sm_size; function = targ_fn->fn; @@ -917,90 +918,145 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, seen_zero = 1; } - if (seen_zero) -{ - /* See if the user provided GOMP_OPENACC_DIM environment - variable to specify runtime defaults. */ - static int default_dims[GOMP_DIM_MAX]; + /* Both reg_granuarlity and warp_granuularity were extracted from + the "Register Allocation Granularity" in Nvidia's CUDA Occupancy + Calculator spreadsheet. Specifically, this required SM_30+ + targets. */ + const int reg_granularity = 256; + const int warp_granularity = 4; - pthread_mutex_lock (_dev_lock); - if (!default_dims[0]) + /* See if the user provided GOMP_OPENACC_DIM environment variable to + specify runtime defaults. */ + static int default_dims[GOMP_DIM_MAX]; + + pthread_mutex_lock (_dev_lock); + if (!default_dims[0]) +{ + /* We only read the environment variable once. You can't + change it in the middle of execution. The syntax is + the same as for the -fopenacc-dim compilation option. */ + const char *env_var = getenv ("GOMP_OPENACC_DIM"); + if (env_var) { - /* We only read the environment variable once. You can't - change it in the middle of execution. The syntax is - the same as for the -fopenacc-dim compilation option. */ - const char *env_var = getenv ("GOMP_OPENACC_DIM"); - if (env_var) - { - const char *pos = env_var; + const char *pos = env_var; - for (i = 0; *pos && i != GOMP_DIM_MAX; i++) + for (i = 0; *pos && i != GOMP_DIM_MAX; i++) + { + if (i && *pos++ != ':') + break; + if (*pos != ':') { - if (i && *pos++ != ':') + const char *eptr; + + errno = 0; + long val = strtol (pos, (char **), 10); + if (errno || val < 0 || (unsigned)val != val) break; - if (*pos != ':') - { - const char *eptr; - - errno = 0; - long val = strtol (pos, (char **), 10); - if (errno || val < 0 || (unsigned)val != val) - break; - default_dims[i] = (int)val; - pos = eptr; - } + default_dims[i] = (int)val; + pos = eptr; } } + } - int warp_size, block_size, dev_size, cpu_size; - CUdevice dev = nvptx_thread()->ptx_dev->dev; - /* 32 is the default for known hardware. */ - int gang = 0, worker = 32, vector = 32; - CUdevice_attribute cu_tpb, cu_ws, cu_mpc, cu_tpm; - - cu_tpb = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK; - cu_ws = CU_DEVICE_ATTRIBUTE_WARP_SIZE; - cu_mpc = CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT; - cu_tpm = CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_MULTIPROCESSOR; - - if