Apologies if this is not the right list for this question. Kernel MAINTAINERS file suggests it is but please let me know if I should repost elsewhere.

I have a custom OpenCL application running under Ubuntu 16.04.04, HWE Kernel 4.13 and amdgpu-pro 17.50 drivers. This is running on a Fujitsu D3313-S6 industrial mainboard (http://www.fujitsu.com/fts/products/computing/peripheral/mainboards/industrial-mainboards/d3313s.html)

After a period of running - from 5 minutes to 48 hours we begin to see these kernel traces. At some point after seeing these errors the application fails.

[   99.348774] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
[ 99.355041] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00103042 [ 99.362509] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014 [ 99.369980] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0' (0x54433000) (32)
[  100.437547] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014
[ 100.443811] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00103042 [ 100.451288] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014 [ 100.458758] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0' (0x54433000) (32)

I know from searching the web that this error can appear if there are errors in the opencl program. However we have run the exact same program on multiple other hardware configurations and have not seen problems. On linux we have had success running on all machines tested with a discrete amd gpu, just not on the GX-424CC apu. On windows we have had the code running on a large numbers of platforms including the GX-424CC without issues.

I'm prepared to believe we have an error in our opencl code, but have no clue where to start looking. What does the error actually mean and why does it happen? Is it to do with buffer transfers between host and device? During execution of a kernel?

Whilst attempting to investigate the problem I have tried a number of kernel arguments for the driver. If I reduce the amount of memory assigned to vram with vramlimit=64 then it appears to take longer for the error to occur.

If I run it with the arguments vm_debug=1 vm_fault_stop=1 the error no longer appears. I would have expected it to occur at least once due to vm_fault_stop=1 but it does not. However instead I get this error occasionally:

[ 7612.741693] amdgpu 0000:00:01.0: IH ring buffer overflow (0x00000010, 0x00000000, 0x00000020)


So is this a bug in the driver or in the opencl code? How can I progress debugging this issue?

Thanks
Will

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Reply via email to