On Wednesday, 18 February 2015 at 15:15:21 UTC, Russel Winder wrote:

The issue is to create a GPGPU kernel (usually C code with bizarre data structures and calling conventions) set it running and then pipe data in and collect data out – currently very slow but the next generation of Intel chips will fix this (*). And then there is the OpenCL/CUDA debate.

Personally I think OpenCL, for all it's deficiencies, as it is vendor neutral. CUDA binds you to NVIDIA. Anyway there is an NVIDIA back end for OpenCL. With a system like PyOpenCL, the infrastructure data and process handling is abstracted, but you still have to write the kernels in C. They really ought to do a Python DSL for that, but… So with D can we write D kernels and have them compiled and loaded using a combination
of CTFE, D → C translation, C ompiler call, and other magic?

I'd like to about the kernel languages (having done both OpenCL and CUDA).

A big speed-up factor is the multiple level of parallelism exposed in OpenCL C and CUDA C:

- contect parallelism (eg. several GPU)
- command parallelism (based on a future model)
- block parallelism
- warp/sub-block parallelism
- in each sub-block, N threads (typically 32 or 64)

All of that supported by appropriate barrier semantics. Typical C-like code only has threads as parallelism and a less complex cache.

Also most algorithms don't translate all that well to SIMD threads working in lockstep.

Example: instead of looping on that 2D image and perform an horizontal blur on 15 pixels, instead perform this operation on 32x16 blocks simultaneously, while caching stuff in block-local memory.

It is much like an auto-vectorization problem and auto-vectorization is hard.




Reply via email to