Alright, I've updated the code in my repository. It now passes all of the
unit tests in test_gpuarray (at least, the ones that were passing before).
The old behavior is the default, so hopefully that should fix compatibility
with existing code (still largely untested). To create a noncontiguous
kernel, pass use_strides=True into get_elwise_kernel, and pass
into func.prepared_async_call. I've updated all of the methods in
to check for any noncontiguous inputs/outputs, and call the appropriate

Something that I don't think I made clear before: the kernels generated by
get_elwise_module_noncontig are modified using regular expressions, so that
you don't need to change your code downstream to get strided array support.
I'm not convinced yet that this is the best approach, but it works okay so

So, per example...

Code like this:

def get_binary_op_kernel(dtype_x, dtype_y, dtype_z, operator,
    return get_elwise_kernel(
            "%(tp_x)s *x, %(tp_y)s *y, %(tp_z)s *z" % {
                "tp_x": dtype_to_ctype(dtype_x),
                "tp_y": dtype_to_ctype(dtype_y),
                "tp_z": dtype_to_ctype(dtype_z),
            "z[i] = x[i] %s y[i]" % operator,
            "multiply", use_strides=use_strides)

Generates kernels like this:

    #include <pycuda-complex.hpp>

    typedef struct
        unsigned n[2];
        long stride[2];
    } dim;

    __device__ unsigned i2m(unsigned i, dim d)
        unsigned m = 0;
        unsigned j = i;
        for(int k = 0; k < 2; k++)
            m += d.stride[k] * (j%d.n[k]);
            j = j / d.n[k];
        return m;

        __global__ void multiply(double *x, double *y, double *z,
unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)

          unsigned tid = threadIdx.x;
          unsigned total_threads = gridDim.x*blockDim.x;
          unsigned cta_start = blockDim.x*blockIdx.x;
          unsigned i;


          for (i = cta_start + tid; i < n; i += total_threads)
            z[i2m(i,*__dim2)] = x[i2m(i,*__dim0)] / y[i2m(i,*__dim1)];


and you use them like this:

def is_noncontig(*vecs):
    r = False
    for v in vecs: r = r or not v.flags.forc
    return r

def _div(self, other, out, stream=None):
    """Divides an array by another array."""

    assert self.shape == other.shape

    noncontig = is_noncontig(self, other, out)

    func = elementwise.get_binary_op_kernel(self.dtype, other.dtype,
            out.dtype, "/", use_strides=noncontig)

    if noncontig:
        func.prepared_async_call(self._grid, self._block, stream,
                self.gpudata, other.gpudata,
                out.gpudata, self.mem_size,
        func.prepared_async_call(self._grid, self._block, stream,
                self.gpudata, other.gpudata,
                out.gpudata, self.mem_size)

In principle, similar updates can be made to modules that depend on
pycuda.elementwise down the road, to also support strided arrays without
big rewrites.

I'm open to suggestions for improving performance, but I'd like to focus on
compatibility for now. Frédéric's suggestion is a good one, but I suspect
there's another source of slowdowns somewhere, because the slowdowns I'm
seeing don't scale with array size.


On Thu, Dec 1, 2016 at 11:54 AM, Keegan Owsley <> wrote:

> It seems striding in kernels can indeed incur a large penalty, though the
> result is dependent on the array size. I've just updated the code to only
> use the noncontiguous kernel if necessary.
> Using a GTX 970, I get the following results using the pow kernel (I'm
> using ipython's %timeit here, because for some reason python appears to
> hang on my machine if I use the standard timeit module):
> >>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray
> >>> b1 = gpuarray.arange(100000, dtype='float64').reshape(1000, 100)
> >>> b2 = b1[::2, :-1].copy()
> >>> # force CUDA to compile the kernels before we time
> >>> b1[::2,:-1]**2
> >>> b2**2
> >>> %timeit b1[::2,:-1]**2
> 1000 loops, best of 3: 654 µs per loop
> >>> %timeit b2**2
> 10000 loops, best of 3: 147 µs per loop
> >>> b1 = gpuarray.arange(1000000, dtype='float64').reshape(1000, 1000)
> >>> b2 = b1[::2, :-1].copy()
> >>> %timeit b1[::2,:-1]**2
> 100 loops, best of 3: 2.05 ms per loop
> >>> %timeit b2**2
> 1000 loops, best of 3: 1.66 ms per loop
> I'll try clean up the code and get it into my repo tpday.
> Keegan
> On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien <
>> wrote:
>> I can share our experience in Theano related to that.
>> I added code to "fuse" dimensions that are contiguous. So if you have a
>> 3d tensor, and only 1 dimensions is non contiguous, you won't pay the
>> indexing of a 3d tensor, but only of a 2d tensor.
>> I saw 2x speed up in such cases. It was old gpu and old cuda version,
>> maybe this changed.
>> In the new Theano back-end, we have rewriting that in a more readable
>> version:
>> ay_util.c#L153
>> This also take into account the broadcasting.
>> This could be done before call the kernel.
>> On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner <
>>> wrote:
>>> Keegan,
>>> Keegan Owsley <> writes:
>>> > I've just slapped together a patch to pycuda that makes most
>>> elementwise
>>> > operations work with noncontiguous arrays. There are a bunch of hacks
>>> in
>>> > there, and the code needs some reorg before it's ready to be
>>> considered for
>>> > upstream (I made these changes while learning the pycuda codebase, so
>>> > there's a bunch of crud that can be cleaned out), but I figure I might
>>> as
>>> > well put it out there in its current state and see what you guys think.
>>> > It's also not extremely well-tested (I have no idea if it interferes
>>> with
>>> > skcuda, for example), but all of the main functions appear to work.
>>> >
>>> > You can check out the code at
>>> omega/pycuda.
>>> >
>>> > Briefly, this works by adding new parameters into elementwise kernels
>>> that
>>> > describe the stride and shape of your arrays, then using a function
>>> that
>>> > computes the location in memory from the stride, shape, and index.
>>> > Elementwise kernel ops are modified so that they use the proper
>>> indexing.
>>> > See an example of a kernel that's generated below:
>>> Thanks for putting this together and sharing it! I have one main
>>> question about this, regarding performance:
>>> Modulo (especially variable-denominator modulo) has a habit of being
>>> fantastically slow on GPUs. Could you time contiguous
>>> vs. noncontiguous for various levels of "gappiness" and number of
>>> axes? I'm asking this because I'd be OK with a 50% slowdown, but not
>>> necessarily a factor of 5 slowdown on actual GPU hardware.
>>> Thanks!
>>> Andreas
>>> _______________________________________________
>>> PyCUDA mailing list
PyCUDA mailing list

Reply via email to