Hi Andreas,
Thank you for your reply, and glad to hear there are more people
working in this direction; seems I have been having some private
conversations with people due to forgetting to reply all... either
way, ive been working on this stuff rather obsessively the past days,
so a lot has changed since I last posted to this list.
Instead of diving into the jargon, here are some examples from my
latest concotion.
#2d meshgrid function
mesh_grid_2 = nd_kernel(
"""
(int32[2,n,m] result) << [n,m] << ():
result(0, i0,i1) = i0;
result(1, i0,i1) = i1;
""",
)
print mesh_grid_2(shape=(4,4)) #we have no input args; kernel
shape needs to be explicitly specified
outer_product = nd_kernel(
"""
(float32[n,m] result = 0) << [n,m] << (float32[n] x0, float32[m] x1):
result(i0, i1) = x0(i0) * x1(i1);
""",
)
x0 = gpuarray.empty((10), np.float32); x0.fill(1)
x1 = gpuarray.empty((20), np.float32); x1.fill(1)
print outer_product(x0, x1) #kernel shape is deduced from
input arguments
funky_product = nd_kernel(
"""
(float32[n,m] result, uint32 count=0) << [n,m] << (uint32[:,n,m]
foo, uint32[8,n,m] bar, float32 scalar):
float32 r = 0;
for (i in 0:foo.shape[0])
for (j in 0:bar.shape[0]:2)
r += foo(i,i0,i1) * bar(j,i0,i1) * scalar;
result(i0, i1) = r;
atomicAdd(count, 1);
""",
axes = [1,1] #override default axes parallism; gives
the same result
)
foo = gpuarray.zeros((3,100,99), np.uint32); foo.fill(1)
bar = gpuarray.zeros((8,100,99), np.uint32); bar.fill(1)
print funky_product(foo, bar, 1.1 )
And here is the source (save first as numycuda_parsing.py). Note that
this is not my whole codebase for this stuff; just my recent rewrites.
http://pastebin.com/urW0kEcq
http://pastebin.com/68j1q5UB
Note that I have gone a little crazy with pyparsing, to create some
syrupy-sweet syntactic sugar. I dont expect everybody to like this
kind of stuff, but note that it is completely optional; you are free
to use or ignore as much of this syntax as you please.
Anyway, formally, the syntax is:
(output arg list) << [kernel shape] << (input arg list):
functionbody
Which reads as 'the input arguments mapped over a virtual kernel of a
certain shape, gives the stated output'. This is likely subject to
change (like it was 5 times already today), but the key features I am
happy with; particurly, the formal decoupling of the shape of the
kernel from the shape of any of its arguments is a crucial step; these
things need not be the same!
Some key points to note about the generated code:
it is completely n-d, not restricted by max threadblock dimensions
all memory access is coalesced; this is not guaranteed by the
code, but easy to track by the end-user
all memory is checked for types and shapes
Further syntactical gimmicks:
- All array shapes are available in the code as
arrayidentifer.shape[dimension]
- types can be specified as numpy names
- arrays can be nd-indexed ala weave.blitz
- array shapes can be declared static (int), dynamic (:), or be
constrained using dummy variables in the declaration
- there is alternative looping syntax of the form for(id in
start:stop:step)
- that is not very exciting, but its a basis for the upcoming special
syntax 'for (id in STENCIL)', which emits an (optionally) unrolled
loop, with macros of the form id.weight (or whatever else you might
want to know about your currently active stencil voxel) available in
the loop
Furthermore, there is a bunch of stuff going on at runtime:
- types and static shapes are checked at runtime
- output args are allocated transparantly, if not passed in and their
shape can be deduced from the constraints
What I am most pleased with is how easy nd-kernels are, in retrospect.
I thought that would be a mess to pull off, but it is really quite
simple. Once you have decided which axes to loop over serially, and
which to launch in parallel, emitting the required code is a breeze.
How to make this partition in parallel and serial axes can be rather
subtle, but a functioning default is easily cooked up, and this
parameter can be overridden for fine control.
How much of this should make it into pycuda is a good question; a lot
of it probably does not belong there. The syntactic sugar for
instance, seems like something that should better have its own
project. But the mechanism for launching a kernel over a virtual
nd-threadblock seems like something that should definitely be part of
gpuarray 2.0; similarly, emitting to boilerplate to pass in array
shape information and strides is something that appears to be quite
general to me. The two files linked above are an early attempt at
converging on such a split.
Anyway, all feedback is welcome!
For those interested, this is what the latter kernel transforms to
(yuck!). That looks like a lot of register pressure, but note that
most of these terms are just aliases, and many of them are completely
unused. Note that array dimensions constrained to be identical using a
dummy variable are passed in as such; that is, the cuda compiler is
aware that these runtime values are identical, which should lead to a
lot of simplifications in the array indexing code in particular.
__global__ void nd_kernel(
const unsigned* const __restrict__ foo, unsigned const
foo_shape_0,
const unsigned* const __restrict__ bar,
const float const scalar,
float* const __restrict__ result,
unsigned* const __restrict__ count,
unsigned const dummy_m, unsigned const dummy_n)
{
unsigned const kernel_shape_0 = dummy_n;
unsigned const kernel_shape_1 = dummy_m;
unsigned const foo_shape_1 = dummy_n;
unsigned const foo_shape_2 = dummy_m;
unsigned const bar_shape_0 = 8;
unsigned const bar_shape_1 = dummy_n;
unsigned const bar_shape_2 = dummy_m;
unsigned const result_shape_0 = dummy_n;
unsigned const result_shape_1 = dummy_m;
unsigned const count_shape_0 = 1;
unsigned const foo_stride_2 = 1;
unsigned const foo_stride_1 = foo_stride_2 * foo_shape_2;
unsigned const foo_stride_0 = foo_stride_1 * foo_shape_1;
unsigned const foo_size = foo_stride_0 * foo_shape_0;
unsigned const bar_stride_2 = 1;
unsigned const bar_stride_1 = bar_stride_2 * bar_shape_2;
unsigned const bar_stride_0 = bar_stride_1 * bar_shape_1;
unsigned const bar_size = bar_stride_0 * bar_shape_0;
unsigned const result_stride_1 = 1;
unsigned const result_stride_0 = result_stride_1 *
result_shape_1;
unsigned const result_size = result_stride_0 * result_shape_0;
unsigned const count_stride_0 = 1;
unsigned const count_size = count_stride_0 * count_shape_0;
//kernel body
//launch threads for parallel axis
unsigned const i1 = blockDim.y*blockIdx.y + threadIdx.y;
if (i1 >= kernel_shape_1) return;
//launch threads for parallel axis
unsigned const i0 = blockDim.x*blockIdx.x + threadIdx.x;
if (i0 >= kernel_shape_0) return;
float r = 0;
for (int i=0; i < foo_shape_0; i+=1)
for (int j=0; j < bar_shape_0; j+=2)
r +=
foo[foo_stride_0*i+foo_stride_1*i0+foo_stride_2*i1] *
bar[bar_stride_0*j+bar_stride_1*i0+bar_stride_2*i1] * scalar;
result[result_stride_0*i0+result_stride_1*i1] = r;
atomicAdd(count, 1);
}
On 28-Aug-12 17:08, Andreas Kloeckner wrote:
Hi Eelco,
Eelco Hoogendoorn <[email protected]> writes:
I have some code that I would like to contribute to pycuda. What would
the preferred way of doing so be? Create a branch in git?
I have been working on some generalizations of the 'elementwise'
functionality recently. Specifically, I have made a GriddedKernel
class,
that operates on nd-arrays, and handles computation of grid indices. As
a simple example, this allows one to write an outer product kernel
along
the lines of:
GriddedKernel(
arguments = 'float32[:,:] out, float32[:] x, float32[:] y',
body = 'out[i] = x[xi] * y[yi];')
Furthermore, I have created a StencilKernel class. It takes an
arbitrary
stencil as input, and creates an unrolled kernel from that. I am using
it for a watershed segmentation algorithm at the moment, but a simpler
example would be something like a laplacian:
StencilKernel(
stencil = laplacian3x3,
arguments = 'float32[512,512] out, float32[512,512] in',
loop_start = 'float32 o = 0',
loop_body = 'o += in[j] * weight[j];',
loop_end = 'out[i] = o')
The stencilkernel contains convenience functions for allocating and
transferring padded memory, so the stencil can always read safely, and
one can easily implement boundary conditions of ones choice.
As you might have noticed, I have deviated from the elementwise way of
specifying arguments. numpy dtypes get translated into c-types (i hate
the impicitness of c-types), plus, one can specify size and
dimensionality constraints on arguments, which by default are
translated
into runtime type/shape checks.
However, there is still lots of work to be done. I am quite new to all
this cuda stuff, and I bet my code could be greatly improved, and
expanded to use cases that have not yet occurred to me. For
instance, my
kernel code is still quite naive, and does not use shared memory. Also,
I have only covered 2d and 3d use cases so far, but by looping over the
larger strides, arbitrary nd-kernels could be created. My template code
is a mess and I should probably use codepy instead. And so on...
If there is anybody willing to help me advance this, id love to
create a
git branch for it and try my best document and clean up my code, and
integrate it with pycuda style conventions (perhaps create an
elementwise branch that adheres to the same interface, and so on?). But
if its just me being excited about this, I probably wont bother.
Even if
you dont want to help directly, your thoughts and comments are most
welcome.
I agree that in the medium term, we need something more nd-aware than
the current ElementwiseKernel, and your stuff looks like a good first
step. Frédéric Bastien and Arnaud Bergeron have been hard at work
creating an array object that does just that, from (at least) Python and
C, and for CUDA and OpenCL.
So, in some sense this is great--we're spoiled for choice to pick the
best approach, and I'm looking forward to having this discussion. In the
meantime, it would certainly be good if we could take a look at your
code.
Given how easy it is to ship add-on packages in Python, I would suggest
that you set up your code to be separate from PyCUDA (i.e. not as a
branch, but as a separate package). For example, I'd be hesitant to
merge something like your StencilKernel--while it's certainly commonly
used, it's still a bit too special purpose to be in a supposedly
general-purpose package.
Thanks for getting the discussion started--looking forward to seeing
what you've got. :)
Andreas
.
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda