This patch changes the amount of LDS (Local Data Store) memory requested
for offload kernels. This allows more teams/gangs to run on the same
compute unit, increasing potential data throughput.
For OpenMP we can reduce the allocation to almost nothing. This means we
can have up-to 40 single-thread teams per CU.
For OpenACC we need enough LDS to broadcast data between workers, and
the algorithm is not particularly memory efficient. This means we cannot
yet achieve the maximum thread count, but we can at least double the
current thread-count -- to 32 -- but halving the LDS usage and relying
on having 16 workers. (Note that I'm assuming Julian's multi-worker
support patches will be committed soon. Without those we can allocate no
LDS and have 40 single-worker teams. With the patches the same can also
be true, but that's still on the to-do list.)
LDS allocation remains unchanged for non-offload compiles (this is only
really used for running the testsuite).
--
Andrew Stubbs
CodeSourcery / Mentor Graphics
Limit LDS usage.
2019-11-22 Andrew Stubbs <a...@codesourcery.com>
gcc/
* config/gcn/gcn.c (OMP_LDS_SIZE): Define.
(ACC_LDS_SIZE): Define.
(OTHER_LDS_SIZE): Define.
(LDS_SIZE): Redefine using above.
(gcn_expand_prologue): Initialize m0 with LDS_SIZE-1.
diff --git a/gcc/config/gcn/gcn.c b/gcc/config/gcn/gcn.c
index 3a8c10ed8b4..f85d84bbe95 100644
--- a/gcc/config/gcn/gcn.c
+++ b/gcc/config/gcn/gcn.c
@@ -70,10 +70,15 @@ int gcn_isa = 3; /* Default to GCN3. */
worker-single mode to worker-partitioned mode), per workgroup. Global
analysis could calculate an exact bound, but we don't do that yet.
- We reserve the whole LDS, which also prevents any other workgroup
- sharing the Compute Unit. */
+ We want to permit full occupancy, so size accordingly. */
-#define LDS_SIZE 65536
+#define OMP_LDS_SIZE 0x600 /* 0x600 is 1/40 total, rounded down. */
+#define ACC_LDS_SIZE 32768 /* Half of the total should be fine. */
+#define OTHER_LDS_SIZE 65536 /* If in doubt, reserve all of it. */
+
+#define LDS_SIZE (flag_openacc ? ACC_LDS_SIZE \
+ : flag_openmp ? OMP_LDS_SIZE \
+ : OTHER_LDS_SIZE)
/* The number of registers usable by normal non-kernel functions.
The SGPR count includes any special extra registers such as VCC. */
@@ -2876,8 +2881,11 @@ gcn_expand_prologue ()
/* Ensure that the scheduler doesn't do anything unexpected. */
emit_insn (gen_blockage ());
+ /* m0 is initialized for the usual LDS DS and FLAT memory case.
+ The low-part is the address of the topmost addressable byte, which is
+ size-1. The high-part is an offset and should be zero. */
emit_move_insn (gen_rtx_REG (SImode, M0_REG),
- gen_int_mode (LDS_SIZE, SImode));
+ gen_int_mode (LDS_SIZE-1, SImode));
emit_insn (gen_prologue_use (gen_rtx_REG (SImode, M0_REG)));