On Tue, 17 Dec 2019 11:18:58 +0100
"Michael Wibral" <michael.wib...@web.de> wrote:

> Hi,
>  
> I am a user/developer of the IDTxl toolbox 
> (https://github.com/pwollstadt/IDTxl/).
> We have the following issue and I am looking for pointers on how to further 
> debug my problem.
> We have an OpenCL kernel that  computes neighbour distances between all 
> points in a set and also looks for neighbours in a certain range.
> 
> This code used to run on our older AMD and nividia cards (Hawai, Lexa models, 
> GTX 1080), but we encountered errors on newer models.
> 
> The situation now is:
> The code runs on CPU via POCL.
> The code runs on Hawai and Lexa XT chips using the AMD fglrx and rocm drivers.
> The code fails on AMD's Vega chips using the rocm driver; more specifically 
> the kernel starts and runs, and then (as indicated by the time elapsed, 
> measure with linux time) it fails either in the very last computation or when 
> trying to return to the host. The error I get on the Vega GPUs is:
> (AMD) Memory access fault by GPU node-1 (Agent handle: 0x562731f06a00) on 
> address 0xa06200000. Reason: Page not present or supervisor privilege.
>  
> On nividia GPUs we don't use subbuffer alignment (which seems to be connected 
> to the problem) as it is not required there, but if we do, we get this error 
> before the computation starts:
> (NVIDIA) clEnqueueReadBuffer failed: OUT_OF_RESOURCES
>  
>  
> From the pattern or errors I would tentatively conclude that:
> (a) The OpenCL kernel itself is OK as it runs without problems in POCL.
> (b) The error is related to the use of subbuffers or to the padding we use 
> for subbuffer alignment, but it does not seem to matter for all architectures 
>  (which is weird).
>  
> I am wondering whether this is an OpenCL 1.2 versus 2.0 issue (where 2.0 
> fails for us)?
> Can I enforce a certain openCL version to be used by pyopenCL?
> Are there known issues or tricks when using OpenCL 1.2 and 2.0 devices in the 
> same system?
> Any other ideas on how to get more hints?

Use a CPU driver, pocl is one of them, Intel has another.
I suspect either a structure padded differently (never assume
structures are compact !) or any other error in pointer calculation.

Use valgrind!

-- 
Jérôme Kieffer
_______________________________________________
PyOpenCL mailing list -- pyopencl@tiker.net
To unsubscribe send an email to pyopencl-le...@tiker.net

Reply via email to