Sometimes we need some divergence in kernels, such as the random number generation example I just posted. Technically (but not practically), we could have a thread executing forever.
It's fine to discourage writing these loops, and with the proposed signatures there still won't be any temptation to do data-parallel work, but they're necessary for many real-world tasks on which parallel algorithms don't entirely map. Once this is all said and done, however, I think I'll be writing an initial implementation of a parallel Monte Carlo library based on accelerate. It (accelerate) seems really promising. I especially love how I can test my programs without an Nvidia GPU, and can switch to performance test and to run real-world code. Thanks, - Clark P.S. Don't forget about do-while! =) On Thu, Dec 13, 2012 at 7:47 AM, Trevor L. McDonell < [email protected]> wrote: > Hi Clark, > > The question of sequential loops in Accelerate has come up a few times in > the > past. The main sticking point is knowing how to implement them in a way > that > encourages efficient programs; avoiding irregular arrays (iteration > depths), > discouraging scalar versions of collective combinators, etc. Basically we > need > to avoid thread divergence where possible, else the GPU SIMD hardware > won't be > well utilised. > > I've experimented a little with this. If you check the generated code of > the > mandelbrot program [1], you'll see it has recovered the _fixed_ iteration > depth > into a scalar loop. > > I'll hack up something like a while loop for value iteration and see what > happens. Tentative proposal (perhaps just the second): > > iterate :: Exp Int -- fixed iteration count (maybe just a > regular Int) > -> (Exp a -> Exp a) -- function to repeatedly apply > -> Exp a -- initial (seed) value > -> Exp a > > while :: (Exp a -> Exp Bool) > -> (Exp a -> Exp a) > -> Exp a > -> Exp a > > It would be good if we could collect some additional concrete applications > and > see if there are better ways to map certain classes of sequential > iteration to > efficient GPU code. > > Additional thoughts/comments welcome! > > Cheers, > -Trev > > > [1]: > https://github.com/AccelerateHS/accelerate-examples/tree/master/examples/mandelbrot > > On 12/12/2012, at 4:34 PM, Clark Gaebel <[email protected]> wrote: > > Hi Trevor (and cafe), > > I've been playing more and more with accelerate, and I find it quite > annoying that there are no loops. It makes implementing many algorithms > much harder than it should be. > > For example, I would love to submit a patch to fix issue #52 [0] on github > by implementing MWC64X [1], but it's very hard to port the OpenCL code on > that page when it's impossible to write kernel expressions with loops. > Also, that means there are no high-level combinators I'm used to for my > sequential code (such as map and fold) that would work on an accelerate > CUDA kernel. > > As a nice strawman example, how would one implement the following kernel > in accelerate, assuming 'rand_next', 'rand_get', and 'rand_skip' can all be > implemented cheaply? : > > typedef uint64_t rand_state; > > __device__ rand_state rand_next(rand_state s); > __device__ uint32_t rand_get(rand_state s); > __device__ rand_state rand_skip(rand_state s, uint64_t distance); > __device__ uint32_t round_to_next_pow2(uint32_t n); > > // Fills an array with random numbers given a random seed, > // a maximum random number to generate, and an output > // array to put the result in. The output will be in the range > // [0, rand_max). > __kernel__ void fill_random(rand_state start_state, uint32_t rand_max, > uint32_t* out) { > rand_state current_state = start_state; > int i = blockDim.x*blockIdx.x + threadIdx.x; > // assumes we skip less than 1 million times per element... > current_state = rand_skip(current_state, i*1e6); > uint32_t mask = round_to_next_pow2(rand_max) - 1; > uint32_t result; > do { > result = rand_get(current_state); > current_state = rand_next(current_state); > } while(result & mask >= rand_max); > > out[i] = result; > } // note: code was neither debugged, run, nor compiled. > > Thanks, > - Clark > > [0] https://github.com/AccelerateHS/accelerate/issues/52 > [1] http://cas.ee.ic.ac.uk/people/dt10/research/rngs-gpu-mwc64x.html > > > > _______________________________________________ > Haskell-Cafe mailing list > [email protected] > http://www.haskell.org/mailman/listinfo/haskell-cafe > >
_______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
