The SMI event interface used one ID, KFD_SMI_EVENT_QUEUE_RESTORE, for both kfd_smi_event_queue_restore() (queues were actually resumed) and kfd_smi_event_queue_restore_rescheduled() (the restore_work cmpxchg failed and the work was requeued). The only differentiator was a trailing '0' vs 'R' character in the QUEUE_RESTORE payload format, which userspace consumers commonly do not parse: rocr-runtime's svm_profiler.cpp reads only %x for the gpuid in its QUEUE_RESTORE case and discards the trailing byte. As a result, real restores and reschedule retries appear identical in HSA_SVM_PROFILE logs, making it impossible to tell whether a high QUEUE_RESTORE count reflects actual queue resume work or workqueue retry churn from a stream of mmu notifier evictions racing svm_range_restore_work.
Encoding semantically different events as the same event ID + a trailing discriminator byte also makes the QUEUE_RESTORE payload format inconsistent with every other KFD SMI event, which use a clean (timestamp, pid, gpuid [, trigger]) layout. This complicates parsers that handle multiple event types and is the kind of asymmetry a uniform format avoids in the first place. Add a new KFD_SMI_EVENT_QUEUE_RESTORE_RESCHEDULED = 14 enum value and switch kfd_smi_event_queue_restore_rescheduled() to emit it. Drop the trailing %c discriminator from KFD_EVENT_FMT_QUEUE_RESTORE so both emitters now produce a clean payload of (ns, pid, node) that matches the shape used by other events. New consumers can subscribe to the new event index via KFD_SMI_EVENT_MASK_FROM_INDEX(KFD_SMI_EVENT_QUEUE_RESTORE_RESCHEDULED) to receive only reschedule notifications. Old userspace that does not subscribe to the new bit will stop seeing reschedule events in its QUEUE_RESTORE stream, which is the desired behavior; old userspace that relied on the trailing %c byte in the QUEUE_RESTORE payload will now read nothing past the gpuid, which matches the format every other SMI event has always used. This is a uAPI-additive change for the enum (no existing index renumbered) and a payload-format simplification for QUEUE_RESTORE. The companion ROCr patch updates svm_profiler.cpp to subscribe to the new event index and surface it as a distinct log string. Signed-off-by: Amir Shetaia <[email protected]> --- drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c | 6 +++--- include/uapi/linux/kfd_ioctl.h | 12 +++++++++--- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c index d2bc169e84b0..a7870fe81ace 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_smi_events.c @@ -312,7 +312,7 @@ void kfd_smi_event_queue_restore(struct kfd_node *node, pid_t pid) { kfd_smi_event_add(pid, node, KFD_SMI_EVENT_QUEUE_RESTORE, KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(), pid, - node->id, '0')); + node->id)); } void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm) @@ -328,9 +328,9 @@ void kfd_smi_event_queue_restore_rescheduled(struct mm_struct *mm) struct kfd_process_device *pdd = p->pdds[i]; kfd_smi_event_add(p->lead_thread->pid, pdd->dev, - KFD_SMI_EVENT_QUEUE_RESTORE, + KFD_SMI_EVENT_QUEUE_RESTORE_RESCHEDULED, KFD_EVENT_FMT_QUEUE_RESTORE(ktime_get_boottime_ns(), - p->lead_thread->pid, pdd->dev->id, 'R')); + p->lead_thread->pid, pdd->dev->id)); } kfd_unref_process(p); } diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 9584b5aab727..e911edf1911e 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -545,6 +545,7 @@ enum kfd_smi_event { KFD_SMI_EVENT_UNMAP_FROM_GPU = 11, KFD_SMI_EVENT_PROCESS_START = 12, KFD_SMI_EVENT_PROCESS_END = 13, + KFD_SMI_EVENT_QUEUE_RESTORE_RESCHEDULED = 14, /* * max event number, as a flag bit to get events from all processes, @@ -623,8 +624,13 @@ struct kfd_ioctl_smi_events_args { * stops during suspend * migrate_update: GPU page fault is recovered by 'M' for migrate, 'U' for update * rw: 'W' for write page fault, 'R' for read page fault - * rescheduled: 'R' if the queue restore failed and rescheduled to try again * error_code: migrate failure error code, 0 if no error + * + * KFD_SMI_EVENT_QUEUE_RESTORE indicates queues were resumed after eviction. + * KFD_SMI_EVENT_QUEUE_RESTORE_RESCHEDULED indicates the restore work + * failed validation and the workqueue was requeued. Subscribers wanting to + * distinguish the two should subscribe to both event indices; the payload + * format is identical (ns, pid, node). */ #define KFD_EVENT_FMT_UPDATE_GPU_RESET(reset_seq_num, reset_cause)\ "%x %s\n", (reset_seq_num), (reset_cause) @@ -653,8 +659,8 @@ struct kfd_ioctl_smi_events_args { #define KFD_EVENT_FMT_QUEUE_EVICTION(ns, pid, node, evict_trigger)\ "%lld -%d %x %d\n", (ns), (pid), (node), (evict_trigger) -#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node, rescheduled)\ - "%lld -%d %x %c\n", (ns), (pid), (node), (rescheduled) +#define KFD_EVENT_FMT_QUEUE_RESTORE(ns, pid, node)\ + "%lld -%d %x\n", (ns), (pid), (node) #define KFD_EVENT_FMT_UNMAP_FROM_GPU(ns, pid, addr, size, node, unmap_trigger)\ "%lld -%d @%lx(%lx) %x %d\n", (ns), (pid), (addr), (size),\ -- 2.43.0
