On 24/07/2025 14:25, Tobias Burnus wrote:
Hello Andrew, hello world,

some instructions take a bit longer but permit that the next
instruction is processed while finishing the work. That works
well - and speeds up the calculation - unless the next instruction
relies on those being available.
Which combinations instructions are affected are documented in the
respective ISA manual. The solution is either to move some other
(later) instruction between them - or the simple solution: adding
waiting do-nothing 'NOP' instructions.

The AMD GPU require 0 to 5 nops to be inserted; while MI200 had
already several, MI300 added some more. (See MI300's ISA manual
under "4.5. Manually Inserted Wait States (NOPs)".)

This patch is supposed to handle all missing ones - and, thus,
fix random fails with MI300 where depending on timing
the instruction result was already there or not. (But see below.)

OK for mainline?

* * *

Note that even with the patch applied, about 1.5% of the
'execution test' tests of 'make check-target-libgomp'
fail ([not all are offloading tests] or 0.5% of all tests).

When trying it brute force, by adding 5 nop between every
instruction, the number of fails was lower - about 1/3. Thus,
it seems as if some additional 's_nop' are required but that
there is also some other issue; it seems as if that's also a
timing issue - possibly locking or atomic or scope related
like sc0/sc1/nt and glc, but it could be something else.
In any case, something to debug after the s_nop patch has landed.

To put some numbers: in libgomp, Iget 126 FAIL and PASS 22202
(0.56% fail) or, looking only at 'execution test' lines, 98 fails
and 6492 pass (1.49% fails).

Tobias

+/* Device requires CDNA1-style manually inserted wait states for AVGPRs.  */
+#define TARGET_AVGPR_CDNA3_NOPS TARGET_CDNA3

This is not for CDNA1, and not for AVGPRS.

-         /* VALU writes SGPR/VCC followed by v_{read,write}lane using
-            SGPR/VCC as lane select requires 4 wait states.  */
+         /* VALU writes SGPR/VCC followed by
+            - v_{read,write}lane using SGPR/VCC as lane select requires
+              4 wait states
+            - [CDNA3] VALU reads SGPR as constant requires 1 wait state
+            - [CDNA3] VALU reads SGPR as carry-in requires no wait states  */
          if ((prev_insn->age + nops_rqd) < 4
              && prev_insn->unit == UNIT_VECTOR
-             && get_attr_laneselect (insn) == LANESELECT_YES
+             && get_attr_laneselect (insn) != LANESELECT_NO
              && (hard_reg_set_intersect_p
                    (depregs, reg_class_contents[(int) SGPR_REGS])
                  || hard_reg_set_intersect_p
                       (depregs, reg_class_contents[(int) 
VCC_CONDITIONAL_REG])))
            nops_rqd = 4 - prev_insn->age;
+         else if ((prev_insn->age + nops_rqd) < 1
+                  && prev_insn->unit == UNIT_VECTOR
+                  && iunit == UNIT_VECTOR
+                  && hard_reg_set_intersect_p
+                       (depregs, reg_class_contents[(int) SGPR_REGS]))
+           nops_rqd = 1 - prev_insn->age;

Should this conditional not depend on CDNA3?

+         /* VALU writes EXEC followed by VALU DPP op requires 5 nop.  */
+         if ((prev_insn->age + nops_rqd) < 5
+             && itype == TYPE_VOP_DPP
+             && prev_insn->unit == UNIT_VECTOR
+             && TEST_HARD_REG_BIT (prev_insn->writes, EXECZ_REG))
+           nops_rqd = 5 - prev_insn->age;

Likewise?

I don't see any other issues.

Andrew

Reply via email to