Re: [PyCUDA] Contributing to pycuda

Eelco Hoogendoorn Wed, 29 Aug 2012 05:31:38 -0700

P.S; id like to emphasize that I am particularly interested indiscussing how to coordinate the integration with future plans for pycuda.

Note that I compute stride information inside the kernel now, butsupporting arbitrary strides does not seem that hard to me. I mean, I donot plan to embark on a complete rewrite of gpuarray myself, especiallynot if others are working on the same thing, but that is one of thethings we should think about coordinating.

By the way, enabling basic broadcasting for kernels over contiguousarrays seems quite simple; I am thinking of a syntax where abroadcasting dim is explicitly signalled in the declaration (could bededuced at runtime, but that would mess with the const-ness of the floodof boilerplate one would hope the compiler eliminates, i think?).Something like


(dtype[10,10] output) << [10,10] << (dtype[10, 0] input):
    output(i0,i1) = input(i0,i1);

Where the zero dimension explicitly signals broadcasting, which meansthe runtime dimension is asserted to be one, and the emitted stride isoverloaded as zero. That should 'just work', no? We could enablebroadcasting for all dimension of shape one as well, but id rather havean explicit signal.

I suppose the first thing to do is to decide on an interface for thepycuda nd_kernel generator. It needs a function body, presumably inplain old cuda, and a signature declaration. As a minimum degree offancyness, id argue the pycuda ndkernel signature needs to include shapeand type information, plus kernel shape information. That is alreadypretty fancy though; either you need some hideous kludge of helperobject constructors, or you pack that information in a string somehow,using some kind of DSL. If you are making that step, you might as wellgo all the way and pack in all the features I implemented for thesignature; compile and runtime shape arguments are not exactly exoticluxuries either.

As for the body of code; one can easily imagine a base class that takesonly plain old cuda as body argument, but calling on an abstracttransform method so that derived classes can apply their owntransformations. That way we dont have to fight over whether this orthat syntax is confusing or helpfull.

Im not sure yet how to fit something like a stencil_kernel into thisdesign. Stride and shape calculations are different for padded arrays,so youd have to hook into the base nd_kernel code at a pretty low levelto emit correct code. Not impossible, but something to think about...One can think of a pre-transform call to the base class for thesignature, so one can harvest PADDED keywords from the signature stringf.i. Or always harvest all extra keywords upon parsing, and raise anerror if these are not in an allowed_keyword set of the nd_kernelsubclass. If our array class supported nd slicing, this whole paddingproblem would be moot, of course. We could mark arrays as (notnecessarily) contiguous in the signature, and pass in all strides asarguments for those (kinda suboptimal to do for all arrays; for thetypical static use case we do not have to pass any runtime info or useany registers; or am i prematurely optimizing here?)


Anyway, would love to hear from the other guys working on this!



Op 29-Aug-12 02:01, Eelco Hoogendoorn schreef:

Hi Andreas,
Thank you for your reply, and glad to hear there are more peopleworking in this direction; seems I have been having some privateconversations with people due to forgetting to reply all... eitherway, ive been working on this stuff rather obsessively the past days,so a lot has changed since I last posted to this list.
Instead of diving into the jargon, here are some examples from mylatest concotion.
#2d meshgrid function
mesh_grid_2 = nd_kernel(
    """
    (int32[2,n,m] result) << [n,m] << ():
        result(0, i0,i1) = i0;
        result(1, i0,i1) = i1;
    """,
)
print mesh_grid_2(shape=(4,4)) #we have no input args; kernelshape needs to be explicitly specified
outer_product = nd_kernel(
    """
    (float32[n,m] result = 0) << [n,m] << (float32[n] x0, float32[m] x1):
        result(i0, i1) = x0(i0) * x1(i1);
    """,
    )
x0 = gpuarray.empty((10), np.float32); x0.fill(1)
x1 = gpuarray.empty((20), np.float32); x1.fill(1)
print outer_product(x0, x1) #kernel shape is deduced frominput arguments
funky_product = nd_kernel(
    """
(float32[n,m] result, uint32 count=0) << [n,m] << (uint32[:,n,m]foo, uint32[8,n,m] bar, float32 scalar):
        float32 r = 0;
        for (i in 0:foo.shape[0])
            for (j in 0:bar.shape[0]:2)
                r += foo(i,i0,i1) * bar(j,i0,i1) * scalar;
        result(i0, i1) = r;
        atomicAdd(count, 1);
    """,
axes = [1,1] #override default axes parallism; givesthe same result
    )

foo = gpuarray.zeros((3,100,99), np.uint32); foo.fill(1)
bar = gpuarray.zeros((8,100,99), np.uint32); bar.fill(1)
print funky_product(foo, bar, 1.1 )
And here is the source (save first as numycuda_parsing.py). Note thatthis is not my whole codebase for this stuff; just my recent rewrites.
http://pastebin.com/urW0kEcq
http://pastebin.com/68j1q5UB
Note that I have gone a little crazy with pyparsing, to create somesyrupy-sweet syntactic sugar. I dont expect everybody to like thiskind of stuff, but note that it is completely optional; you are freeto use or ignore as much of this syntax as you please.
Anyway, formally, the syntax is:
(output arg list) << [kernel shape] << (input arg list):
    functionbody
Which reads as 'the input arguments mapped over a virtual kernel of acertain shape, gives the stated output'. This is likely subject tochange (like it was 5 times already today), but the key features I amhappy with; particurly, the formal decoupling of the shape of thekernel from the shape of any of its arguments is a crucial step; thesethings need not be the same!
Some key points to note about the generated code:
    it is completely n-d, not restricted by max threadblock dimensions
all memory access is coalesced; this is not guaranteed by thecode, but easy to track by the end-user
    all memory is checked for types and shapes

Further syntactical gimmicks:
- All array shapes are available in the code asarrayidentifer.shape[dimension]
 - types can be specified as numpy names
 - arrays can be nd-indexed ala weave.blitz
- array shapes can be declared static (int), dynamic (:), or beconstrained using dummy variables in the declaration- there is alternative looping syntax of the form for(id instart:stop:step)- that is not very exciting, but its a basis for the upcoming specialsyntax 'for (id in STENCIL)', which emits an (optionally) unrolledloop, with macros of the form id.weight (or whatever else you mightwant to know about your currently active stencil voxel) available inthe loop
Furthermore, there is a bunch of stuff going on at runtime:
 - types and static shapes are checked at runtime
- output args are allocated transparantly, if not passed in and theirshape can be deduced from the constraints
What I am most pleased with is how easy nd-kernels are, in retrospect.I thought that would be a mess to pull off, but it is really quitesimple. Once you have decided which axes to loop over serially, andwhich to launch in parallel, emitting the required code is a breeze.How to make this partition in parallel and serial axes can be rathersubtle, but a functioning default is easily cooked up, and thisparameter can be overridden for fine control.
How much of this should make it into pycuda is a good question; a lotof it probably does not belong there. The syntactic sugar forinstance, seems like something that should better have its ownproject. But the mechanism for launching a kernel over a virtualnd-threadblock seems like something that should definitely be part ofgpuarray 2.0; similarly, emitting to boilerplate to pass in arrayshape information and strides is something that appears to be quitegeneral to me. The two files linked above are an early attempt atconverging on such a split.
Anyway, all feedback is welcome!
For those interested, this is what the latter kernel transforms to(yuck!). That looks like a lot of register pressure, but note thatmost of these terms are just aliases, and many of them are completelyunused. Note that array dimensions constrained to be identical using adummy variable are passed in as such; that is, the cuda compiler isaware that these runtime values are identical, which should lead to alot of simplifications in the array indexing code in particular.
    __global__ void nd_kernel(
const unsigned* const __restrict__ foo, unsigned constfoo_shape_0,
            const unsigned* const __restrict__ bar,
            const float const scalar,
            float* const __restrict__ result,
            unsigned* const __restrict__ count,
            unsigned const dummy_m, unsigned const dummy_n)
    {
        unsigned const kernel_shape_0 = dummy_n;
        unsigned const kernel_shape_1 = dummy_m;

        unsigned const foo_shape_1 = dummy_n;
        unsigned const foo_shape_2 = dummy_m;
        unsigned const bar_shape_0 = 8;
        unsigned const bar_shape_1 = dummy_n;
        unsigned const bar_shape_2 = dummy_m;
        unsigned const result_shape_0 = dummy_n;
        unsigned const result_shape_1 = dummy_m;
        unsigned const count_shape_0 = 1;

        unsigned const foo_stride_2 = 1;
        unsigned const foo_stride_1 = foo_stride_2 * foo_shape_2;
        unsigned const foo_stride_0 = foo_stride_1 * foo_shape_1;
        unsigned const foo_size = foo_stride_0 * foo_shape_0;
        unsigned const bar_stride_2 = 1;
        unsigned const bar_stride_1 = bar_stride_2 * bar_shape_2;
        unsigned const bar_stride_0 = bar_stride_1 * bar_shape_1;
        unsigned const bar_size = bar_stride_0 * bar_shape_0;
        unsigned const result_stride_1 = 1;
unsigned const result_stride_0 = result_stride_1 *result_shape_1;
        unsigned const result_size = result_stride_0 * result_shape_0;
        unsigned const count_stride_0 = 1;
        unsigned const count_size = count_stride_0 * count_shape_0;
        //kernel body

        //launch threads for parallel axis
        unsigned const i1 = blockDim.y*blockIdx.y + threadIdx.y;
        if (i1 >= kernel_shape_1) return;

        //launch threads for parallel axis
        unsigned const i0 = blockDim.x*blockIdx.x + threadIdx.x;
        if (i0 >= kernel_shape_0) return;
        float r = 0;
        for (int i=0; i < foo_shape_0; i+=1)
            for (int j=0; j < bar_shape_0; j+=2)
r +=foo[foo_stride_0*i+foo_stride_1*i0+foo_stride_2*i1] *bar[bar_stride_0*j+bar_stride_1*i0+bar_stride_2*i1] * scalar;
        result[result_stride_0*i0+result_stride_1*i1] = r;
        atomicAdd(count, 1);

    }


On 28-Aug-12 17:08, Andreas Kloeckner wrote:
Hi Eelco,

Eelco Hoogendoorn <[email protected]> writes:
I have some code that I would like to contribute to pycuda. What would
the preferred way of doing so be? Create a branch in git?

I have been working on some generalizations of the 'elementwise'
functionality recently. Specifically, I have made a GriddedKernelclass,
that operates on nd-arrays, and handles computation of grid indices. As
a simple example, this allows one to write an outer product kernelalong
the lines of:

GriddedKernel(
arguments = 'float32[:,:] out, float32[:] x, float32[:] y',
body = 'out[i] = x[xi] * y[yi];')
Furthermore, I have created a StencilKernel class. It takes anarbitrary
stencil as input, and creates an unrolled kernel from that. I am using
it for a watershed segmentation algorithm at the moment, but a simpler
example would be something like a laplacian:

StencilKernel(
stencil = laplacian3x3,
arguments = 'float32[512,512] out, float32[512,512] in',
loop_start = 'float32 o = 0',
loop_body = 'o += in[j] * weight[j];',
loop_end = 'out[i] = o')

The stencilkernel contains convenience functions for allocating and
transferring padded memory, so the stencil can always read safely, and
one can easily implement boundary conditions of ones choice.

As you might have noticed, I have deviated from the elementwise way of
specifying arguments. numpy dtypes get translated into c-types (i hate
the impicitness of c-types), plus, one can specify size and
dimensionality constraints on arguments, which by default aretranslated
into runtime type/shape checks.

However, there is still lots of work to be done. I am quite new to all
this cuda stuff, and I bet my code could be greatly improved, and
expanded to use cases that have not yet occurred to me. Forinstance, my
kernel code is still quite naive, and does not use shared memory. Also,
I have only covered 2d and 3d use cases so far, but by looping over the
larger strides, arbitrary nd-kernels could be created. My template code
is a mess and I should probably use codepy instead. And so on...
If there is anybody willing to help me advance this, id love tocreate a
git branch for it and try my best document and clean up my code, and
integrate it with pycuda style conventions (perhaps create an
elementwise branch that adheres to the same interface, and so on?). But
if its just me being excited about this, I probably wont bother.Even ifyou dont want to help directly, your thoughts and comments are mostwelcome.
I agree that in the medium term, we need something more nd-aware than
the current ElementwiseKernel, and your stuff looks like a good first
step. Frédéric Bastien and Arnaud Bergeron have been hard at work
creating an array object that does just that, from (at least) Python and
C, and for CUDA and OpenCL.

So, in some sense this is great--we're spoiled for choice to pick the
best approach, and I'm looking forward to having this discussion. In the
meantime, it would certainly be good if we could take a look at your
code.

Given how easy it is to ship add-on packages in Python, I would suggest
that you set up your code to be separate from PyCUDA (i.e. not as a
branch, but as a separate package). For example, I'd be hesitant to
merge something like your StencilKernel--while it's certainly commonly
used, it's still a bit too special purpose to be in a supposedly
general-purpose package.

Thanks for getting the discussion started--looking forward to seeing
what you've got. :)

Andreas
.
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda



_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Contributing to pycuda

Reply via email to