2010/8/20 Paolo Giarrusso <[email protected]>: > 2010/8/20 Jorge Timón <[email protected]>: >> Hi, I'm just curious about the feasibility of running python code in a gpu >> by extending pypy. > Disclaimer: I am not a PyPy developer, even if I've been following the > project with interest. Nor am I an expert of GPU - I provide links to > the literature I've read. > Yet, I believe that such an attempt is unlikely to be interesting. > Quoting Wikipedia's synthesis: > "Unlike CPUs however, GPUs have a parallel throughput architecture > that emphasizes executing many concurrent threads slowly, rather than > executing a single thread very fast." > And significant optimizations are needed anyway to get performance for > GPU code (and if you don't need the last bit of performance, why > bother with a GPU?), so I think that the need to use a C-like language > is the smallest problem. > >> I don't have the time (and probably the knowledge neither) to develop that >> pypy extension, but I just want to know if it's possible. >> I'm interested in languages like openCL and nvidia's CUDA because I think >> the future of supercomputing is going to be GPGPU.
Python is a very different language than CUDA or openCL, hence it's not completely to map python's semantics to something that will make sense for GPU. > > I would like to point out that while for some cases it might be right, > the importance of GPGPU is probably often exaggerated: > > http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1# > > Researchers in the field are mostly aware of the fact that GPGPU is > the way to go only for a very restricted category of code. For that > code, fine. > Thus, instead of running Python code in a GPU, designing from scratch > an easy way to program a GPU efficiently, for those task, is better, > and projects for that already exist (i.e. what you cite). > > Additionally, it would take probably a different kind of JIT to > exploit GPUs. No branch prediction, very small non-coherent caches, no > efficient synchronization primitives, as I read from this paper... I'm > no expert, but I guess you'd need to rearchitecture from scratch the > needed optimizations. > And it took 20-30 years to get from the first, slow Lisp (1958) to, > say, Self (1991), a landmark in performant high-level languages, > derived from SmallTalk. Most of that would have to be redone. > > So, I guess that the effort to compile Python code for a GPU is not > worth it. There might be further reasons due to the kind of code a JIT > generates, since a GPU has no branch predictor, no caches, and so on, > but I'm no GPU expert and I would have to check again. > > Finally, for general purpose code, exploiting the big expected number > of CPUs on our desktop systems is already a challenge. > >> There's people working in >> bringing GPGPU to python: >> >> http://mathema.tician.de/software/pyopencl >> http://mathema.tician.de/software/pycuda >> >> Would it be possible to run python code in parallel without the need (for >> the developer) of actively parallelizing the code? > > I would say that Python is not yet the language to use to write > efficient parallel code, because of the Global Interpreter Lock > (Google for "Python GIL"). The two implementations having no GIL are > IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL, > and the current focus is not on removing it. > Scientific computing uses external libraries (like NumPy) - for the > supported algorithms, one could introduce parallelism at that level. > If that's enough for your application, good. > If you want to write a parallel algorithm in Python, we're not there yet. > >> I'm not talking about code of hard concurrency, but of code with intrinsic >> parallelism (let's say matrix multiplication). > > Automatic parallelization is hard, see: > http://en.wikipedia.org/wiki/Automatic_parallelization > > Lots of scientists have tried, lots of money has been invested, but > it's still hard. > The only practical approaches still require the programmer to > introduce parallelism, but in ways much simpler than using > multithreading directly. Google OpenMP and Cilk. > >> Would a JIT compilation be capable of detecting parallelism? > Summing up what is above, probably not. > > Moreover, matrix multiplication may not be so easy as one might think. > I do not know how to write it for a GPU, but in the end I reference > some suggestions from that paper (where it is one of the benchmarks). > But here, I explain why writing it for a CPU is complicated. You can > multiply two matrixes with a triply nested for, but such an algorithm > has poor performance for big matrixes because of bad cache locality. > GPUs, according to the above mentioned paper, provide no caches and > hides latency in other ways. > > See here for the two main alternative ideas which allow solving this > problem of writing an efficient matrix multiplication algorithm: > http://en.wikipedia.org/wiki/Cache_blocking > http://en.wikipedia.org/wiki/Cache-oblivious_algorithm > > Then, you need to parallelize the resulting code yourself, which might > or might not be easy (depending on the interactions between the > parallel blocks that are found there). > In that paper, where matrix multiplication is called as SGEMM (the > BLAS routine implementing it), they suggest using a cache-blocked > version of matrix multiplication for both CPUs and GPUs, and argue > that parallelization is then easy. What's interesting in using GPU and a JIT is optimizing numpy vectorized operations to speed up things like big_array_a + big_array_b using SSE and GPU. However, I don't think anyone plans to work on it in a near future and if you don't have time this stays as a topic of interest only :) > > Cheers, > -- > Paolo Giarrusso - Ph.D. Student > http://www.informatik.uni-marburg.de/~pgiarrusso/ > _______________________________________________ > [email protected] > http://codespeak.net/mailman/listinfo/pypy-dev _______________________________________________ [email protected] http://codespeak.net/mailman/listinfo/pypy-dev
