[amdgpu] Errors with amdgpu-pro 17.50 running on GX-424CC SOC

Will Wagner Fri, 09 Mar 2018 03:45:43 -0800

Apologies if this is not the right list for this question. KernelMAINTAINERS file suggests it is but please let me know if I shouldrepost elsewhere.

I have a custom OpenCL application running under Ubuntu 16.04.04, HWEKernel 4.13 and amdgpu-pro 17.50 drivers. This is running on a FujitsuD3313-S6 industrial mainboard(http://www.fujitsu.com/fts/products/computing/peripheral/mainboards/industrial-mainboards/d3313s.html)

After a period of running - from 5 minutes to 48 hours we begin to seethese kernel traces. At some point after seeing these errors theapplication fails.


[   99.348774] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014

[ 99.355041] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR0x00103042[ 99.362509] amdgpu 0000:00:01.0:VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014[ 99.369980] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0'(0x54433000) (32)

[  100.437547] amdgpu 0000:00:01.0: GPU fault detected: 146 0x08492014

[ 100.443811] amdgpu 0000:00:01.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR0x00103042[ 100.451288] amdgpu 0000:00:01.0:VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x09020014[ 100.458758] VM fault (0x14, vmid 4) at page 1060930, write from 'TC0'(0x54433000) (32)

I know from searching the web that this error can appear if there areerrors in the opencl program. However we have run the exact same programon multiple other hardware configurations and have not seen problems. Onlinux we have had success running on all machines tested with a discreteamd gpu, just not on the GX-424CC apu. On windows we have had the coderunning on a large numbers of platforms including the GX-424CC withoutissues.

I'm prepared to believe we have an error in our opencl code, but have noclue where to start looking. What does the error actually mean and whydoes it happen? Is it to do with buffer transfers between host anddevice? During execution of a kernel?

Whilst attempting to investigate the problem I have tried a number ofkernel arguments for the driver. If I reduce the amount of memoryassigned to vram with vramlimit=64 then it appears to take longer forthe error to occur.

If I run it with the arguments vm_debug=1 vm_fault_stop=1 the error nolonger appears. I would have expected it to occur at least once due tovm_fault_stop=1 but it does not. However instead I get this erroroccasionally:

[ 7612.741693] amdgpu 0000:00:01.0: IH ring buffer overflow (0x00000010,0x00000000, 0x00000020)

So is this a bug in the driver or in the opencl code? How can I progressdebugging this issue?


Thanks
Will

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

[amdgpu] Errors with amdgpu-pro 17.50 running on GX-424CC SOC

Reply via email to