Re: [pypy-dev] gpgpu and pypy

Paolo Giarrusso Fri, 20 Aug 2010 16:47:00 -0700

2010/8/20 Jim Baker <[email protected]>:
> The Unladen Swallow doc, which was derived from a PEP that Jeff proposed,
> seems to be a fair descriptive outline of Python memory models in general,
> and Jython's in specific.
> Obviously the underlying implementation in the JVM is happens-before
> consistency; everything else derives from there. The CHM  provides
> additional consistency constraints that should imply sequential consistency
> for a (vast) subset of Python programs. However, I can readily construct a
> program that violates sequential consistency: maybe it uses slots (stored in
> a Java array), or the array module (which also just wraps Java arrays), or
> by accesses local variables in a frame from another thread (same storage,
> same problem). Likewise I can also create Python programs that access Java
> classes (since this is Jython!), and they too will only see happens-before
> consistency.
OK, I guess that volatile semantics for fields were just a side effect.
As far as I can see, you get sequential consistency only in practice,
not in theory - you have happens-before edges only when a reader and a
writer touch the same field. In practice, the few cases where it
matters can't apply here as far as I know, because a hash function
decides to which of the submaps a mapping belongs.


Your mention of slots is very cool! You made me recall that once you
get shadow classes in Python, you can not only do inline caching, but
you also have the _same_ object layout as in slots, because adding a
member causes a hidden class transition, getting rid of any kind of
dictionary _after compilation_. Two exceptions:
* an immutable dictionary mapping field names to offsets is used both
during JIT compilation and when inline caching fails, for
* a fallback case for when __dict__ is used, I guess, is needed. Not
necessarily a dictionary must be used though: one could also make
__dict__ usage just cause class transitions.
* beyond a certain member count, i.e., if __dict__ is used as a
general-purpose dictionary, one might want to switch back to a
dictionary representation. This only applies if this is done in
Pythonic code (guess not) - I remember this case from V8, for
JavaScript, where the expected usage is different.

> Naturally, the workarounds I mentioned for improving performance in object
> allocation all rely on not using CHM and its (modestly) expensive semantics.
> So this would mean using a Java class in some way, possibly a HashMap
> (especially one that's been exposed through our type expose mechanism to
> avoid reflection overhead), or directly using a Java class of some kind
> (again exposing is best, much like are builtin types like PyInteger),
> possibly with all fields marked as volatile.
> Hope this helps! If you are interested in studying this problem in more
> depth for Jython, or other implementations, and the implications of our
> hybrid model, it would certainly be most welcome. Unfortunately, it's not
> something that Jython development itself will be working on (standard time
> constraints apply here).

Such constraints apply to me too - but I hope this to work on that.

> - Jim
> 2010/8/20 Paolo Giarrusso <[email protected]>
>>
>> 2010/8/20 Jim Baker <[email protected]>:
>> > Jython single-threaded performance has little to do with a lack of the
>> > GIL.
>>
>> Never implied that - I do believe that a GIL-less fast Python is
>> possible. I just meant we don't have one yet.
>>
>> > Probably the only direct manifestation is seen in the overhead of
>> > allocating
>> > __dict__ (or dict) objects because Python attributes have volatile
>> > memory
>> > semantics
>> Uh? "Jython memory model" doesn't seem to find anything. Is there any
>> docs on this, with the rationale for the choice you describe?
>>
>> I've only found the Unladen Swallow proposals for a memory model:
>> http://code.google.com/p/unladen-swallow/wiki/MemoryModel (and
>> python-safethread, which I don't like).
>>
>> As a Java programmer using Jython, I wouldn't expect to have any
>> volatile field ever, but I would expect to be able to act on different
>> fields indipendently - the race conditions we have to protect from are
>> the ones on structual modification (unless the table uses open
>> addressing).
>> _This_ can be implemented through ConcurrentHashMap (which also makes
>> all fields volatile), but an implementation not guaranteeing volatile
>> semantics (if possible) would have been equally valid.
>> I am interested because I want to experiment with alternatives.
>>
>> Of course, you can offer stronger semantics, but then you should also
>> advertise that fields are volatile, thus I don't need a lock to pass a
>> reference.
>>
>> > , which is ensured by the backing of a ConcurrentHashMap, which can
>> > be expensive to allocate. There are workarounds.
>>
>> I'm also curious about such workarounds - are they currently
>> implemented or speculations?
>>
>> > 2010/8/20 Paolo Giarrusso <[email protected]>
>> >>
>> >> 2010/8/20 Jorge Timón <[email protected]>:
>> >> > Hi, I'm just curious about the feasibility of running python code in
>> >> > a
>> >> > gpu
>> >> > by extending pypy.
>> >> Disclaimer: I am not a PyPy developer, even if I've been following the
>> >> project with interest. Nor am I an expert of GPU - I provide links to
>> >> the literature I've read.
>> >> Yet, I believe that such an attempt is unlikely to be interesting.
>> >> Quoting Wikipedia's synthesis:
>> >> "Unlike CPUs however, GPUs have a parallel throughput architecture
>> >> that emphasizes executing many concurrent threads slowly, rather than
>> >> executing a single thread very fast."
>> >> And significant optimizations are needed anyway to get performance for
>> >> GPU code (and if you don't need the last bit of performance, why
>> >> bother with a GPU?), so I think that the need to use a C-like language
>> >> is the smallest problem.
>> >>
>> >> > I don't have the time (and probably the knowledge neither) to develop
>> >> > that
>> >> > pypy extension, but I just want to know if it's possible.
>> >> > I'm interested in languages like openCL and nvidia's CUDA because I
>> >> > think
>> >> > the future of supercomputing is going to be GPGPU.
>> >>
>> >> I would like to point out that while for some cases it might be right,
>> >> the importance of GPGPU is probably often exaggerated:
>> >>
>> >>
>> >>
>> >> http://portal.acm.org/citation.cfm?id=1816021&coll=GUIDE&dl=GUIDE&CFID=11111111&CFTOKEN=2222222&ret=1#
>> >>
>> >> Researchers in the field are mostly aware of the fact that GPGPU is
>> >> the way to go only for a very restricted category of code. For that
>> >> code, fine.
>> >> Thus, instead of running Python code in a GPU, designing from scratch
>> >> an easy way to program a GPU efficiently, for those task, is better,
>> >> and projects for that already exist (i.e. what you cite).
>> >>
>> >> Additionally, it would take probably a different kind of JIT to
>> >> exploit GPUs. No branch prediction, very small non-coherent caches, no
>> >> efficient synchronization primitives, as I read from this paper... I'm
>> >> no expert, but I guess you'd need to rearchitecture from scratch the
>> >> needed optimizations.
>> >> And it took 20-30 years to get from the first, slow Lisp (1958) to,
>> >> say, Self (1991), a landmark in performant high-level languages,
>> >> derived from SmallTalk. Most of that would have to be redone.
>> >>
>> >> So, I guess that the effort to compile Python code for a GPU is not
>> >> worth it. There might be further reasons due to the kind of code a JIT
>> >> generates, since a GPU has no branch predictor, no caches, and so on,
>> >> but I'm no GPU expert and I would have to check again.
>> >>
>> >> Finally, for general purpose code, exploiting the big expected number
>> >> of CPUs on our desktop systems is already a challenge.
>> >>
>> >> > There's people working in
>> >> > bringing GPGPU to python:
>> >> >
>> >> > http://mathema.tician.de/software/pyopencl
>> >> > http://mathema.tician.de/software/pycuda
>> >> >
>> >> > Would it be possible to run python code in parallel without the need
>> >> > (for
>> >> > the developer) of actively parallelizing the code?
>> >>
>> >> I would say that Python is not yet the language to use to write
>> >> efficient parallel code, because of the Global Interpreter Lock
>> >> (Google for "Python GIL"). The two implementations having no GIL are
>> >> IronPython (as slow as CPython) and Jython (slower). PyPy has a GIL,
>> >> and the current focus is not on removing it.
>> >> Scientific computing uses external libraries (like NumPy) - for the
>> >> supported algorithms, one could introduce parallelism at that level.
>> >> If that's enough for your application, good.
>> >> If you want to write a parallel algorithm in Python, we're not there
>> >> yet.
>> >>
>> >> > I'm not talking about code of hard concurrency, but of code with
>> >> > intrinsic
>> >> > parallelism (let's say matrix multiplication).
>> >>
>> >> Automatic parallelization is hard, see:
>> >> http://en.wikipedia.org/wiki/Automatic_parallelization
>> >>
>> >> Lots of scientists have tried, lots of money has been invested, but
>> >> it's still hard.
>> >> The only practical approaches still require the programmer to
>> >> introduce parallelism, but in ways much simpler than using
>> >> multithreading directly. Google OpenMP and Cilk.
>> >>
>> >> > Would a JIT compilation be capable of detecting parallelism?
>> >> Summing up what is above, probably not.
>> >>
>> >> Moreover, matrix multiplication may not be so easy as one might think.
>> >> I do not know how to write it for a GPU, but in the end I reference
>> >> some suggestions from that paper (where it is one of the benchmarks).
>> >> But here, I explain why writing it for a CPU is complicated. You can
>> >> multiply two matrixes with a triply nested for, but such an algorithm
>> >> has poor performance for big matrixes because of bad cache locality.
>> >> GPUs, according to the above mentioned paper, provide no caches and
>> >> hides latency in other ways.
>> >>
>> >> See here for the two main alternative ideas which allow solving this
>> >> problem of writing an efficient matrix multiplication algorithm:
>> >> http://en.wikipedia.org/wiki/Cache_blocking
>> >> http://en.wikipedia.org/wiki/Cache-oblivious_algorithm
>> >>
>> >> Then, you need to parallelize the resulting code yourself, which might
>> >> or might not be easy (depending on the interactions between the
>> >> parallel blocks that are found there).
>> >> In that paper, where matrix multiplication is called as SGEMM (the
>> >> BLAS routine implementing it), they suggest using a cache-blocked
>> >> version of matrix multiplication for both CPUs and GPUs, and argue
>> >> that parallelization is then easy.
>> >>
>> >> Cheers,
>> >> --
>> >> Paolo Giarrusso - Ph.D. Student
>> >> http://www.informatik.uni-marburg.de/~pgiarrusso/
>> >> _______________________________________________
>> >> [email protected]
>> >> http://codespeak.net/mailman/listinfo/pypy-dev
>> >
>>
>>
>>
>> --
>> Paolo Giarrusso - Ph.D. Student
>> http://www.informatik.uni-marburg.de/~pgiarrusso/
>
>



-- 
Paolo Giarrusso - Ph.D. Student
http://www.informatik.uni-marburg.de/~pgiarrusso/
_______________________________________________
[email protected]
http://codespeak.net/mailman/listinfo/pypy-dev

Re: [pypy-dev] gpgpu and pypy

Reply via email to