2010/8/20 Jorge Timón <[email protected]>:
> Hi, I'm just curious about the feasibility of running python code in a gpu
> by extending pypy.
Disclaimer: I am not a PyPy developer, even if I've been following the
project with interest. Nor am I an expert of GPU - I provide links to
the literature I've read.
Yet, I believe that such an attempt is unlikely to be interesting.
Quoting Wikipedia's synthesis:
"Unlike CPUs however, GPUs have a parallel throughput architecture
that emphasizes executing many concurrent threads slowly, rather than
executing a single thread very fast."
And significant optimizations are needed anyway to get performance for
GPU code (and if you don't need the last bit of performance, why
bother with a GPU?), so I think that the need to use a C-like language
is the smallest problem.

> I don't have the time (and probably the knowledge neither) to develop that
> pypy extension, but I just want to know if it's possible.
> I'm interested in languages like openCL and nvidia's CUDA because I think
> the future of supercomputing is going to be GPGPU.

I would like to point out that while for some cases it might be right,
the importance of GPGPU is probably often exaggerated:

http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1#

Researchers in the field are mostly aware of the fact that GPGPU is
the way to go only for a very restricted category of code. For that
code, fine.
Thus, instead of running Python code in a GPU, designing from scratch
an easy way to program a GPU efficiently, for those task, is better,
and projects for that already exist (i.e. what you cite).

Additionally, it would take probably a different kind of JIT to
exploit GPUs. No branch prediction, very small non-coherent caches, no
efficient synchronization primitives, as I read from this paper... I'm
no expert, but I guess you'd need to rearchitecture from scratch the
needed optimizations.
And it took 20-30 years to get from the first, slow Lisp (1958) to,
say, Self (1991), a landmark in performant high-level languages,
derived from SmallTalk. Most of that would have to be redone.

So, I guess that the effort to compile Python code for a GPU is not
worth it. There might be further reasons due to the kind of code a JIT
generates, since a GPU has no branch predictor, no caches, and so on,
but I'm no GPU expert and I would have to check again.

Finally, for general purpose code, exploiting the big expected number
of CPUs on our desktop systems is already a challenge.

> There's people working in
> bringing GPGPU to python:
>
> http://mathema.tician.de/software/pyopencl
> http://mathema.tician.de/software/pycuda
>
> Would it be possible to run python code in parallel without the need (for
> the developer) of actively parallelizing the code?

I would say that Python is not yet the language to use to write
efficient parallel code, because of the Global Interpreter Lock
(Google for "Python GIL"). The two implementations having no GIL are
IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL,
and the current focus is not on removing it.
Scientific computing uses external libraries (like NumPy) - for the
supported algorithms, one could introduce parallelism at that level.
If that's enough for your application, good.
If you want to write a parallel algorithm in Python, we're not there yet.

> I'm not talking about code of hard concurrency, but of code with intrinsic
> parallelism (let's say matrix multiplication).

Automatic parallelization is hard, see:
http://en.wikipedia.org/wiki/Automatic_parallelization

Lots of scientists have tried, lots of money has been invested, but
it's still hard.
The only practical approaches still require the programmer to
introduce parallelism, but in ways much simpler than using
multithreading directly. Google OpenMP and Cilk.

> Would a JIT compilation be capable of detecting parallelism?
Summing up what is above, probably not.

Moreover, matrix multiplication may not be so easy as one might think.
I do not know how to write it for a GPU, but in the end I reference
some suggestions from that paper (where it is one of the benchmarks).
But here, I explain why writing it for a CPU is complicated. You can
multiply two matrixes with a triply nested for, but such an algorithm
has poor performance for big matrixes because of bad cache locality.
GPUs, according to the above mentioned paper, provide no caches and
hides latency in other ways.

See here for the two main alternative ideas which allow solving this
problem of writing an efficient matrix multiplication algorithm:
http://en.wikipedia.org/wiki/Cache_blocking
http://en.wikipedia.org/wiki/Cache-oblivious_algorithm

Then, you need to parallelize the resulting code yourself, which might
or might not be easy (depending on the interactions between the
parallel blocks that are found there).
In that paper, where matrix multiplication is called as SGEMM (the
BLAS routine implementing it), they suggest using a cache-blocked
version of matrix multiplication for both CPUs and GPUs, and argue
that parallelization is then easy.

Cheers,
-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/
_______________________________________________
[email protected]
http://codespeak.net/mailman/listinfo/pypy-dev

Reply via email to