On Tue, 17 Dec 2019 11:18:58 +0100 "Michael Wibral" <michael.wib...@web.de> wrote:
> Hi, > > I am a user/developer of the IDTxl toolbox > (https://github.com/pwollstadt/IDTxl/). > We have the following issue and I am looking for pointers on how to further > debug my problem. > We have an OpenCL kernel that computes neighbour distances between all > points in a set and also looks for neighbours in a certain range. > > This code used to run on our older AMD and nividia cards (Hawai, Lexa models, > GTX 1080), but we encountered errors on newer models. > > The situation now is: > The code runs on CPU via POCL. > The code runs on Hawai and Lexa XT chips using the AMD fglrx and rocm drivers. > The code fails on AMD's Vega chips using the rocm driver; more specifically > the kernel starts and runs, and then (as indicated by the time elapsed, > measure with linux time) it fails either in the very last computation or when > trying to return to the host. The error I get on the Vega GPUs is: > (AMD) Memory access fault by GPU node-1 (Agent handle: 0x562731f06a00) on > address 0xa06200000. Reason: Page not present or supervisor privilege. > > On nividia GPUs we don't use subbuffer alignment (which seems to be connected > to the problem) as it is not required there, but if we do, we get this error > before the computation starts: > (NVIDIA) clEnqueueReadBuffer failed: OUT_OF_RESOURCES > > > From the pattern or errors I would tentatively conclude that: > (a) The OpenCL kernel itself is OK as it runs without problems in POCL. > (b) The error is related to the use of subbuffers or to the padding we use > for subbuffer alignment, but it does not seem to matter for all architectures > (which is weird). > > I am wondering whether this is an OpenCL 1.2 versus 2.0 issue (where 2.0 > fails for us)? > Can I enforce a certain openCL version to be used by pyopenCL? > Are there known issues or tricks when using OpenCL 1.2 and 2.0 devices in the > same system? > Any other ideas on how to get more hints? Use a CPU driver, pocl is one of them, Intel has another. I suspect either a structure padded differently (never assume structures are compact !) or any other error in pointer calculation. Use valgrind! -- Jérôme Kieffer _______________________________________________ PyOpenCL mailing list -- pyopencl@tiker.net To unsubscribe send an email to pyopencl-le...@tiker.net