Hey Cory,

On 2023-11-21 21:01, Cordell Bloor wrote:
> On 2023-11-18 00:39, Cordell Bloor wrote:
>> Each time a HIP application is executed, the rocr-runtime prints the message:
>>
>>     KFD does not support xnack mode query.
>>     ROCr must assume xnack is disabled.
>>
>> It is unclear to me whether something is actually wrong or not. This
>> message is emitted from a debug_print statement in amd_topology.cpp. An
>> example of this message can be found in the CI logs [1].
> 
> This is a debug message. It is guarded by NDEBUG, so it would not be
> printed if rocr were built in Release mode. There is a bit of discussion
> upstream as to whether the debug_print should instead be guarded by an
> environment variable rather than a preprocessor definition.

> The Linux kernel on Debian is built without HSA_AMD_SVM enabled. That is
> the KConfig for "Enable HMM-based shared virtual memory manager", which
> is required for xnack+ operation. The xnack feature allows some AMD GPUs
> to retry memory accesses that fail due to a page fault, which is used as
> a mechanism for migrating managed memory automatically from host to
> device. With xnack disabled, page faults in device code are not
> recoverable [1].

I've rebuilt our kernel with this option enabled, and the message indeed
went away. Great!

This also required DEVICE_PRIVATE (and that one also suggests
HMM_MIRROR). I don't see any downside to these; should we request them
from the Kernel Team?

That did remind me of another message I've seen in dmesg, repeated a
few dozen times, when some (but not all) tests are run:

    amdgpu: init_user_pages: Failed to get user pages: -1

rocrand is a good example where these occur.

Despite the failure, I did not observe any negative side effects, but
the above change also did not solve this. Have you seen this message in
dmesg as well?

Best,
Christian

Reply via email to