================ @@ -42,6 +43,34 @@ COMPILER_RT_VISIBILITY void INSTR_PROF_INSTRUMENT_GPU_FUNC(uint64_t *counter, } } +// Block-level sampling for offload PGO. For GPU kernels with stationary +// behavior (where all thread blocks execute the same code paths regardless of +// block ID), partial sampling significantly reduces instrumentation overhead +// without losing PGO performance gains. +// +// Returns 1 if this block should be instrumented, 0 to skip. Samples by +// matching lower bits of the x-dimension block ID to zero. +// sampling_bits=0: all blocks (100%) +// sampling_bits=3: every 8th block in x (12.5%, default) +// +// Note: We use only block_id_x rather than a fully linearized 3D block ID. +// The 3D linearization requires __gpu_num_blocks_x/y which expands to +// __builtin_amdgcn_workgroup_size_x/y. With -mcode-object-version=none (used +// to build compiler-rt profile runtime), the compiler emits a load of +// __oclc_ABI_version to select the correct ABI path. Since the profile runtime +// is linked after device libs are internalized, __oclc_ABI_version is no longer +// available. Using block_id_x directly avoids this dependency. For typical +// kernels with large 1D or x-dominant grids this is sufficient; blocks sharing +// the same x-index are sampled together in 3D grids (minor uniformity loss). ---------------- arsenm wrote:
Really the runtime library should be built separately for each incompatible ABI version https://github.com/llvm/llvm-project/pull/190708 _______________________________________________ cfe-commits mailing list [email protected] https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
