Module: Mesa
Branch: main
Commit: 4420251947443e5f29ecc702900e560e66e73f0e
URL:    
http://cgit.freedesktop.org/mesa/mesa/commit/?id=4420251947443e5f29ecc702900e560e66e73f0e

Author: Francisco Jerez <[email protected]>
Date:   Wed Oct 19 16:13:24 2022 -0700

intel/rt: Fix L3 bank performance bottlenecks due to SW stack stride alignment.

Power-of-two SW stack sizes are prone to causing collisions in the
hashing function used by the L3 to map memory addresses to banks,
which can cause stack accesses from most DSSes to bottleneck on a
single L3 bank.  Fix it by padding the SW stack stride by a single
cacheline if it was a power of two.  This has been reported by Felix
DeGrood to improve Quake2 RTX performance by ~30% on DG2-512 in
combination with other RT patches Lionel Landwerlin has been working
on.

Many thanks to Felix DeGrood for doing much of the legwork and
providing several iterations of Q2RTX performance counter dumps which
eventually prompted me to consider the hash collision theory and
motivated this patch, and for providing additional performance counter
dumps confirming that there is no longer an appreciable imbalance in
traffic across L3 banks after this change.

Reviewed-by: Lionel Landwerlin <[email protected]>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/21461>

---

 src/intel/compiler/brw_rt.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/src/intel/compiler/brw_rt.h b/src/intel/compiler/brw_rt.h
index d03187636f6..15c024072f1 100644
--- a/src/intel/compiler/brw_rt.h
+++ b/src/intel/compiler/brw_rt.h
@@ -230,6 +230,18 @@ brw_rt_compute_scratch_layout(struct brw_rt_scratch_layout 
*layout,
    assert(size % 64 == 0);
    layout->sw_stack_start = size;
    layout->sw_stack_size = ALIGN(sw_stack_size, 64);
+
+   /* Currently it's always the case that sw_stack_size is a power of
+    * two, but power-of-two SW stack sizes are prone to causing
+    * collisions in the hashing function used by the L3 to map memory
+    * addresses to banks, which can cause stack accesses from most
+    * DSSes to bottleneck on a single L3 bank.  Fix it by padding the
+    * SW stack by a single cacheline if it was a power of two.
+    */
+   if (layout->sw_stack_size > 64 &&
+       util_is_power_of_two_nonzero(layout->sw_stack_size))
+      layout->sw_stack_size += 64;
+
    size += num_stack_ids * layout->sw_stack_size;
 
    layout->total_size = size;

Reply via email to