P.S; id like to emphasize that I am particularly interested in discussing how to coordinate the integration with future plans for pycuda.

Note that I compute stride information inside the kernel now, but supporting arbitrary strides does not seem that hard to me. I mean, I do not plan to embark on a complete rewrite of gpuarray myself, especially not if others are working on the same thing, but that is one of the things we should think about coordinating.

By the way, enabling basic broadcasting for kernels over contiguous arrays seems quite simple; I am thinking of a syntax where a broadcasting dim is explicitly signalled in the declaration (could be deduced at runtime, but that would mess with the const-ness of the flood of boilerplate one would hope the compiler eliminates, i think?). Something like

(dtype[10,10] output) << [10,10] << (dtype[10, 0] input):
    output(i0,i1) = input(i0,i1);

Where the zero dimension explicitly signals broadcasting, which means the runtime dimension is asserted to be one, and the emitted stride is overloaded as zero. That should 'just work', no? We could enable broadcasting for all dimension of shape one as well, but id rather have an explicit signal.


I suppose the first thing to do is to decide on an interface for the pycuda nd_kernel generator. It needs a function body, presumably in plain old cuda, and a signature declaration. As a minimum degree of fancyness, id argue the pycuda ndkernel signature needs to include shape and type information, plus kernel shape information. That is already pretty fancy though; either you need some hideous kludge of helper object constructors, or you pack that information in a string somehow, using some kind of DSL. If you are making that step, you might as well go all the way and pack in all the features I implemented for the signature; compile and runtime shape arguments are not exactly exotic luxuries either.

As for the body of code; one can easily imagine a base class that takes only plain old cuda as body argument, but calling on an abstract transform method so that derived classes can apply their own transformations. That way we dont have to fight over whether this or that syntax is confusing or helpfull.

Im not sure yet how to fit something like a stencil_kernel into this design. Stride and shape calculations are different for padded arrays, so youd have to hook into the base nd_kernel code at a pretty low level to emit correct code. Not impossible, but something to think about... One can think of a pre-transform call to the base class for the signature, so one can harvest PADDED keywords from the signature string f.i. Or always harvest all extra keywords upon parsing, and raise an error if these are not in an allowed_keyword set of the nd_kernel subclass. If our array class supported nd slicing, this whole padding problem would be moot, of course. We could mark arrays as (not necessarily) contiguous in the signature, and pass in all strides as arguments for those (kinda suboptimal to do for all arrays; for the typical static use case we do not have to pass any runtime info or use any registers; or am i prematurely optimizing here?)

Anyway, would love to hear from the other guys working on this!



Op 29-Aug-12 02:01, Eelco Hoogendoorn schreef:

Hi Andreas,

Thank you for your reply, and glad to hear there are more people working in this direction; seems I have been having some private conversations with people due to forgetting to reply all... either way, ive been working on this stuff rather obsessively the past days, so a lot has changed since I last posted to this list.

Instead of diving into the jargon, here are some examples from my latest concotion.


#2d meshgrid function
mesh_grid_2 = nd_kernel(
    """
    (int32[2,n,m] result) << [n,m] << ():
        result(0, i0,i1) = i0;
        result(1, i0,i1) = i1;
    """,
)
print mesh_grid_2(shape=(4,4)) #we have no input args; kernel shape needs to be explicitly specified

outer_product = nd_kernel(
    """
    (float32[n,m] result = 0) << [n,m] << (float32[n] x0, float32[m] x1):
        result(i0, i1) = x0(i0) * x1(i1);
    """,
    )
x0 = gpuarray.empty((10), np.float32); x0.fill(1)
x1 = gpuarray.empty((20), np.float32); x1.fill(1)
print outer_product(x0, x1) #kernel shape is deduced from input arguments


funky_product = nd_kernel(
    """
(float32[n,m] result, uint32 count=0) << [n,m] << (uint32[:,n,m] foo, uint32[8,n,m] bar, float32 scalar):
        float32 r = 0;
        for (i in 0:foo.shape[0])
            for (j in 0:bar.shape[0]:2)
                r += foo(i,i0,i1) * bar(j,i0,i1) * scalar;
        result(i0, i1) = r;
        atomicAdd(count, 1);
    """,
axes = [1,1] #override default axes parallism; gives the same result
    )

foo = gpuarray.zeros((3,100,99), np.uint32); foo.fill(1)
bar = gpuarray.zeros((8,100,99), np.uint32); bar.fill(1)
print funky_product(foo, bar, 1.1 )


And here is the source (save first as numycuda_parsing.py). Note that this is not my whole codebase for this stuff; just my recent rewrites.
http://pastebin.com/urW0kEcq
http://pastebin.com/68j1q5UB


Note that I have gone a little crazy with pyparsing, to create some syrupy-sweet syntactic sugar. I dont expect everybody to like this kind of stuff, but note that it is completely optional; you are free to use or ignore as much of this syntax as you please.

Anyway, formally, the syntax is:
(output arg list) << [kernel shape] << (input arg list):
    functionbody
Which reads as 'the input arguments mapped over a virtual kernel of a certain shape, gives the stated output'. This is likely subject to change (like it was 5 times already today), but the key features I am happy with; particurly, the formal decoupling of the shape of the kernel from the shape of any of its arguments is a crucial step; these things need not be the same!

Some key points to note about the generated code:
    it is completely n-d, not restricted by max threadblock dimensions
all memory access is coalesced; this is not guaranteed by the code, but easy to track by the end-user
    all memory is checked for types and shapes

Further syntactical gimmicks:
- All array shapes are available in the code as arrayidentifer.shape[dimension]
 - types can be specified as numpy names
 - arrays can be nd-indexed ala weave.blitz
- array shapes can be declared static (int), dynamic (:), or be constrained using dummy variables in the declaration - there is alternative looping syntax of the form for(id in start:stop:step) - that is not very exciting, but its a basis for the upcoming special syntax 'for (id in STENCIL)', which emits an (optionally) unrolled loop, with macros of the form id.weight (or whatever else you might want to know about your currently active stencil voxel) available in the loop

Furthermore, there is a bunch of stuff going on at runtime:
 - types and static shapes are checked at runtime
- output args are allocated transparantly, if not passed in and their shape can be deduced from the constraints

What I am most pleased with is how easy nd-kernels are, in retrospect. I thought that would be a mess to pull off, but it is really quite simple. Once you have decided which axes to loop over serially, and which to launch in parallel, emitting the required code is a breeze. How to make this partition in parallel and serial axes can be rather subtle, but a functioning default is easily cooked up, and this parameter can be overridden for fine control.

How much of this should make it into pycuda is a good question; a lot of it probably does not belong there. The syntactic sugar for instance, seems like something that should better have its own project. But the mechanism for launching a kernel over a virtual nd-threadblock seems like something that should definitely be part of gpuarray 2.0; similarly, emitting to boilerplate to pass in array shape information and strides is something that appears to be quite general to me. The two files linked above are an early attempt at converging on such a split.

Anyway, all feedback is welcome!



For those interested, this is what the latter kernel transforms to (yuck!). That looks like a lot of register pressure, but note that most of these terms are just aliases, and many of them are completely unused. Note that array dimensions constrained to be identical using a dummy variable are passed in as such; that is, the cuda compiler is aware that these runtime values are identical, which should lead to a lot of simplifications in the array indexing code in particular.


    __global__ void nd_kernel(
const unsigned* const __restrict__ foo, unsigned const foo_shape_0,
            const unsigned* const __restrict__ bar,
            const float const scalar,
            float* const __restrict__ result,
            unsigned* const __restrict__ count,
            unsigned const dummy_m, unsigned const dummy_n)
    {
        unsigned const kernel_shape_0 = dummy_n;
        unsigned const kernel_shape_1 = dummy_m;

        unsigned const foo_shape_1 = dummy_n;
        unsigned const foo_shape_2 = dummy_m;
        unsigned const bar_shape_0 = 8;
        unsigned const bar_shape_1 = dummy_n;
        unsigned const bar_shape_2 = dummy_m;
        unsigned const result_shape_0 = dummy_n;
        unsigned const result_shape_1 = dummy_m;
        unsigned const count_shape_0 = 1;

        unsigned const foo_stride_2 = 1;
        unsigned const foo_stride_1 = foo_stride_2 * foo_shape_2;
        unsigned const foo_stride_0 = foo_stride_1 * foo_shape_1;
        unsigned const foo_size = foo_stride_0 * foo_shape_0;
        unsigned const bar_stride_2 = 1;
        unsigned const bar_stride_1 = bar_stride_2 * bar_shape_2;
        unsigned const bar_stride_0 = bar_stride_1 * bar_shape_1;
        unsigned const bar_size = bar_stride_0 * bar_shape_0;
        unsigned const result_stride_1 = 1;
unsigned const result_stride_0 = result_stride_1 * result_shape_1;
        unsigned const result_size = result_stride_0 * result_shape_0;
        unsigned const count_stride_0 = 1;
        unsigned const count_size = count_stride_0 * count_shape_0;
        //kernel body

        //launch threads for parallel axis
        unsigned const i1 = blockDim.y*blockIdx.y + threadIdx.y;
        if (i1 >= kernel_shape_1) return;

        //launch threads for parallel axis
        unsigned const i0 = blockDim.x*blockIdx.x + threadIdx.x;
        if (i0 >= kernel_shape_0) return;
        float r = 0;
        for (int i=0; i < foo_shape_0; i+=1)
            for (int j=0; j < bar_shape_0; j+=2)
r += foo[foo_stride_0*i+foo_stride_1*i0+foo_stride_2*i1] * bar[bar_stride_0*j+bar_stride_1*i0+bar_stride_2*i1] * scalar;
        result[result_stride_0*i0+result_stride_1*i1] = r;
        atomicAdd(count, 1);

    }


On 28-Aug-12 17:08, Andreas Kloeckner wrote:
Hi Eelco,

Eelco Hoogendoorn <[email protected]> writes:
I have some code that I would like to contribute to pycuda. What would
the preferred way of doing so be? Create a branch in git?

I have been working on some generalizations of the 'elementwise'
functionality recently. Specifically, I have made a GriddedKernel class,
that operates on nd-arrays, and handles computation of grid indices. As
a simple example, this allows one to write an outer product kernel along
the lines of:

GriddedKernel(
arguments = 'float32[:,:] out, float32[:] x, float32[:] y',
body = 'out[i] = x[xi] * y[yi];')

Furthermore, I have created a StencilKernel class. It takes an arbitrary
stencil as input, and creates an unrolled kernel from that. I am using
it for a watershed segmentation algorithm at the moment, but a simpler
example would be something like a laplacian:

StencilKernel(
stencil = laplacian3x3,
arguments = 'float32[512,512] out, float32[512,512] in',
loop_start = 'float32 o = 0',
loop_body = 'o += in[j] * weight[j];',
loop_end = 'out[i] = o')

The stencilkernel contains convenience functions for allocating and
transferring padded memory, so the stencil can always read safely, and
one can easily implement boundary conditions of ones choice.

As you might have noticed, I have deviated from the elementwise way of
specifying arguments. numpy dtypes get translated into c-types (i hate
the impicitness of c-types), plus, one can specify size and
dimensionality constraints on arguments, which by default are translated
into runtime type/shape checks.

However, there is still lots of work to be done. I am quite new to all
this cuda stuff, and I bet my code could be greatly improved, and
expanded to use cases that have not yet occurred to me. For instance, my
kernel code is still quite naive, and does not use shared memory. Also,
I have only covered 2d and 3d use cases so far, but by looping over the
larger strides, arbitrary nd-kernels could be created. My template code
is a mess and I should probably use codepy instead. And so on...

If there is anybody willing to help me advance this, id love to create a
git branch for it and try my best document and clean up my code, and
integrate it with pycuda style conventions (perhaps create an
elementwise branch that adheres to the same interface, and so on?). But
if its just me being excited about this, I probably wont bother. Even if you dont want to help directly, your thoughts and comments are most welcome.
I agree that in the medium term, we need something more nd-aware than
the current ElementwiseKernel, and your stuff looks like a good first
step. Frédéric Bastien and Arnaud Bergeron have been hard at work
creating an array object that does just that, from (at least) Python and
C, and for CUDA and OpenCL.

So, in some sense this is great--we're spoiled for choice to pick the
best approach, and I'm looking forward to having this discussion. In the
meantime, it would certainly be good if we could take a look at your
code.

Given how easy it is to ship add-on packages in Python, I would suggest
that you set up your code to be separate from PyCUDA (i.e. not as a
branch, but as a separate package). For example, I'd be hesitant to
merge something like your StencilKernel--while it's certainly commonly
used, it's still a bit too special purpose to be in a supposedly
general-purpose package.

Thanks for getting the discussion started--looking forward to seeing
what you've got. :)

Andreas
.



_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda


_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to