I've included some additional information on the ticket and have been
discussing this with the upstream developers. I'll summarize the
information here.
On 2023-11-18 00:39, Cordell Bloor wrote:
Each time a HIP application is executed, the rocr-runtime prints the message:
KFD does not support xnack mode query.
ROCr must assume xnack is disabled.
It is unclear to me whether something is actually wrong or not. This
message is emitted from a debug_print statement in amd_topology.cpp. An
example of this message can be found in the CI logs [1].
This is a debug message. It is guarded by NDEBUG, so it would not be
printed if rocr were built in Release mode. There is a bit of discussion
upstream as to whether the debug_print should instead be guarded by an
environment variable rather than a preprocessor definition.
If there's something wrong with KFD, then that problem should be
reported to the kernel developers. If there's nothing wrong with KFD,
then this message should be suppressed.
The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
the KConfig for "Enable HMM-based shared virtual memory manager", which
is required for xnack+ operation. The xnack feature allows some AMD GPUs
to retry memory accesses that fail due to a page fault, which is used as
a mechanism for migrating managed memory automatically from host to
device. With xnack disabled, page faults in device code are not
recoverable [1].
Sincerely, Cory Bloor
[1]: https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/