================
@@ -42,6 +43,34 @@ COMPILER_RT_VISIBILITY void 
INSTR_PROF_INSTRUMENT_GPU_FUNC(uint64_t *counter,
   }
 }
 
+// Block-level sampling for offload PGO. For GPU kernels with stationary
+// behavior (where all thread blocks execute the same code paths regardless of
+// block ID), partial sampling significantly reduces instrumentation overhead
+// without losing PGO performance gains.
+//
+// Returns 1 if this block should be instrumented, 0 to skip. Samples by
+// matching lower bits of the x-dimension block ID to zero.
+//   sampling_bits=0: all blocks (100%)
+//   sampling_bits=3: every 8th block in x (12.5%, default)
+//
+// Note: We use only block_id_x rather than a fully linearized 3D block ID.
+// The 3D linearization requires __gpu_num_blocks_x/y which expands to
+// __builtin_amdgcn_workgroup_size_x/y. With -mcode-object-version=none (used
+// to build compiler-rt profile runtime), the compiler emits a load of
+// __oclc_ABI_version to select the correct ABI path. Since the profile runtime
+// is linked after device libs are internalized, __oclc_ABI_version is no 
longer
+// available. Using block_id_x directly avoids this dependency. For typical
+// kernels with large 1D or x-dominant grids this is sufficient; blocks sharing
+// the same x-index are sampled together in 3D grids (minor uniformity loss).
----------------
arsenm wrote:

Really the runtime library should be built separately for each incompatible ABI 
version

https://github.com/llvm/llvm-project/pull/190708
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to