Dnia 2010-09-28, wto o godzinie 23:56 +0200, Tomasz Rybak pisze: > Dnia 2010-09-28, wto o godzinie 00:29 -0700, jmcarval pisze: > > Thanks for your reply. > > I've read the first thread you mention, that ends without a solution > > http://pycuda.2962900.n2.nabble.com/PyCUDA-pycuda-test-failures-tp5320194p5320194.html > > > > Maybe I'm doing a huge mistake but it does not seem to be a precision > > detail. > > The following code (a simplification of test_gpuarray), returns 30 from the > > CPU and 14 from the GTX480, either with integer, float32 or float64. > > I don't get it. Can anybody explain me what I'm doing wrong please? > > Thanks > > > > import pycuda.autoinit > > import numpy > > import pycuda.gpuarray as gpuarray > > from pycuda.curandom import rand as curand > > > > a = numpy.array([1,2,3,4])#.astype(numpy.float32) > > a_gpu = gpuarray.to_gpu(a) > > b = a > > b_gpu = gpuarray.to_gpu(b) > > > > dot_ab = numpy.dot(a, b) > > > > dot_ab_gpu = gpuarray.dot(a_gpu, b_gpu).get() > > > > print "CPU dot product:", dot_ab > > print "GPU dot product:", dot_ab_gpu > > > > > > I have idea for (maybe) checking whether problem is with PyCUDA, > CUDA toolkit, or driver. > Can you force PyCUDA to generate not sm_20 code, but 1x? > I have found that it is determined in line 190 of file > pycuda/compiler.py: > arch = "sm_%d%d" % Context.get_device().compute_capability() > Try to change it to > arch = "sm_10" > and so on, and check whether you get incorrect 14 in such > a case. > > If there is simpler way of changing architecture to which > PyCUDA generates code, feel free to use it and share this > information.
Unrelated to previous analysis, but it also might be important. While looking at reduction kernels I have noticed that for most of the reductions (in the later phase) threads are synchronised to ensure that all values have been computed and stored in shared memory (array sdata) - lines 106-121. Then, for the last 64 values reduction follows, but there is no synchronisation in lines 123-132. Can someone check whether one of attached synchronize.diff helps? I have read that Fermi have more aggressive cache, and threads can be called slightly differently (do not remember details). Another change in Fermi is that the same memory can be used for L1 cache or shared memory, and ReductionKernel uses shared memory for storing intermediate results of reductions. Maybe (just a intuition) more aggressive synchronisation can help here. I have also just found in Fermi compatibility guide 1.2.2 that there is much more aggressive optimisation; I am attaching patch volatile.diff (made according to guide) - please check if it solves problem. If it does, as it is based on official guide, I would propose using it (instead of synchronize.diff). Regards and good night. -- Tomasz Rybak <[email protected]> GPG/PGP key ID: 2AD5 9860 Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 http://member.acm.org/~tomaszrybak
diff --git a/pycuda/reduction.py b/pycuda/reduction.py
index 0a17508..92f48d1 100644
--- a/pycuda/reduction.py
+++ b/pycuda/reduction.py
@@ -123,11 +123,17 @@ def get_reduction_module(out_type, block_size,
if (tid < 32)
{
if (BLOCK_SIZE >= 64) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 32]);
+ __syncthreads();
if (BLOCK_SIZE >= 32) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 16]);
+ __syncthreads();
if (BLOCK_SIZE >= 16) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 8]);
+ __syncthreads();
if (BLOCK_SIZE >= 8) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 4]);
+ __syncthreads();
if (BLOCK_SIZE >= 4) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 2]);
+ __syncthreads();
if (BLOCK_SIZE >= 2) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 1]);
+ __syncthreads();
}
if (tid == 0) out[blockIdx.x] = sdata[0];
diff --git a/pycuda/reduction.py b/pycuda/reduction.py
index 0a17508..0e1eb3c 100644
--- a/pycuda/reduction.py
+++ b/pycuda/reduction.py
@@ -122,12 +122,14 @@ def get_reduction_module(out_type, block_size,
if (tid < 32)
{
- if (BLOCK_SIZE >= 64) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 32]);
- if (BLOCK_SIZE >= 32) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 16]);
- if (BLOCK_SIZE >= 16) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 8]);
- if (BLOCK_SIZE >= 8) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 4]);
- if (BLOCK_SIZE >= 4) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 2]);
- if (BLOCK_SIZE >= 2) sdata[tid] = REDUCE(sdata[tid], sdata[tid + 1]);
+// Fermi compatibility guide 1.2.2
+ volatile out_type *smem = sdata;
+ if (BLOCK_SIZE >= 64) smem[tid] = REDUCE(smem[tid], smem[tid + 32]);
+ if (BLOCK_SIZE >= 32) smem[tid] = REDUCE(smem[tid], smem[tid + 16]);
+ if (BLOCK_SIZE >= 16) smem[tid] = REDUCE(smem[tid], smem[tid + 8]);
+ if (BLOCK_SIZE >= 8) smem[tid] = REDUCE(smem[tid], smem[tid + 4]);
+ if (BLOCK_SIZE >= 4) smem[tid] = REDUCE(smem[tid], smem[tid + 2]);
+ if (BLOCK_SIZE >= 2) smem[tid] = REDUCE(smem[tid], smem[tid + 1]);
}
if (tid == 0) out[blockIdx.x] = sdata[0];
signature.asc
Description: This is a digitally signed message part
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
