I don't think we should be looking at either CUDA or OpenCL directly.
We should be looking for a generic library that can target either and
is well maintained and actively developed. Any GPU code we write
ourselves would rapidly be overtaken by changes in the hardware and
innovations in parallel algorithms. If we find a library that provides
a sorting api and adapt our code to use it then we'll get the benefits
of any new hardware feature as the library adds support for them.

I think one option is to make the sort function plugable with a shared
library/dll. I see several benefits from this:

- It could be in the interest of the hardware vendor to provide the most powerful sort implementation (I'm sure for example that TBB sort implementation is faster that pg_sort)

- It can permit people to "play" with it without being deep involved in pg development and stuffs.

- It can relieve the postgres core group the choose about the right language/tool/implementation to use.

 - Also for people not willing (or not able for the matter) to upgrade
postgres engine to change instead the sort function upon an hardware

Of course if this happens postgres engine has to make some sort of
sanity check (that the function for example actually sorts) before to "thrust" the plugged sort.
The engine can even have multiple sort implementation available and
use the most proficient one (imagine some sorts acts better on
a certain range value or on certain element size).

