I get "pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access 
was encountered" errors when I use pycuda with matrices over certain sizes. 
Only a restart of spyder remedies the issue. The matrix sizes are still well 
below what I believe my graphics card should be able to handle (a Geforce GTX 
1060, 3GB). Is there a pycuda-driven limit? 
I've created a fairly simple example which simply computes the cross products 
of two 3d-vectors.
The code works fine for up N approx. 35000 vectors. Above that, I get the 
following error:
Traceback (most recent call last):  File 
"C:\owncloud\Python\float3_example.py", line 68, in <module>    dest = 
c_gpu.get()  File 
 line 271, in get    _memcpy_discontig(ary, self, async=async, stream=stream)  
 line 1190, in _memcpy_discontig    drv.memcpy_dtoh(dst, 
src.gpudata)pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory 
access was encountered
Assuming the problem lies with my code rather than pyCuda - is there a problem 
with my usage of the float3 vector types inside but not outside the CUDA 
kernel? (The results are correct for small matrices.) I couldn't find a succint 
example of a best practice case of passing lists of 3d vectors (or float3s) to 
kernel using pyCuda. Or the way I have set up blocks and grids (I tried many)?
Many thanks!
Here's the very simple example:
from __future__ import print_functionfrom __future__ import 
absolute_importimport pycuda.autoinitimport numpyfrom pycuda.compiler import 
SourceModulefrom pycuda import gpuarray
mod = SourceModule("""__global__ void cross_products(float3* vCs, float3* vAs, 
float3* vBs, int w, int h){  const int c = blockIdx.x * blockDim.x + 
threadIdx.x;  const int r = blockIdx.y * blockDim.y + threadIdx.y;  int i = r * 
w + c; // 1D flat index    // Check if within array bounds.  if ((c >= w) || (r 
>= h))  {  return;  }    float3 vA = vAs[i];  float3 vB = vBs[i];    float3 vC 
= make_float3(vA.y*vB.z - vA.z*vB.y, vA.z*vB.x - vA.x*vB.z, vA.x*vB.y - 
vA.y*vB.x);     vCs[i] = vC;  }""")
cross_products = mod.get_function("cross_products")N = 32000 #on my machine, 
this fails if N > 36000M = 3a = numpy.ndarray((N,M), dtype = numpy.float32)b = 
numpy.ndarray((N,M), dtype = numpy.float32)for i in range(0,N):    a[i] = 
[1,0,0]    b[i] = [0,1,0]
c = numpy.zeros((N,M), dtype = numpy.float32)
print("a x b")print(numpy.cross(a,b))
M_gpu = numpy.int32(M)N_gpu = numpy.int32(N)a_gpu = gpuarray.to_gpu(a) b_gpu = 
gpuarray.to_gpu(b)c_gpu = gpuarray.to_gpu(c)

bx = 32 #256by = 32 #1gdimX = (int)((M + bx-1) / bx);gdimY = (int)((N + by-1) / 
by); print("grid")print(gdimX)print(gdimY)cross_products(c_gpu, a_gpu, b_gpu, 
M_gpu, N_gpu, block=(bx,by,1), grid = (gdimX, gdimY))
dest = c_gpu.get()
