On Thu, Jul 29, 2010 at 15:15, William Leslie <[email protected]> wrote: > On 29 July 2010 18:55, Maciej Fijalkowski <[email protected]> wrote: >> On Thu, Jul 29, 2010 at 10:50 AM, William Leslie >> <[email protected]> wrote: >>> If task X expects that task Y will mutate some object it has, it needs >>> to go back to the source for every read. This means that if you do use >>> mutation of some shared object for communication, it needs to be >>> synchronised before every access. What this means for us is that every >>> read from a possibly mutable object requires an acquire, and every >>> write requires a release. It's as if every reference in the program is >>> implemented with a volatile pointer. Even if the object is never >>> mutated, there can be a lot of unnecessary bus chatter waiting for >>> MESI to tell us so. >>> >> >> I do agree there is an overhead. Can you provide some data how much >> this overhead is? Python is not a very simple language and a lot of >> things are complex and time consuming, so I wonder how it compares to >> locking per object.
Below I try to prove that locking is still too expensive, even for an interpreter. Also, for many things the clever optimizations you do allow making those costs small, at least for the average case / fast path. I have been taught to consider clever optimizations as required. With JIT compilation, specialization and shadow classes, are method calls much more expensive than a guard and (if no inlining is done, as might happen in PyPy in the worst case for big functions) an assembler 'call' opcode, and possibly stack shuffling? How many cycles is that? How more expensive is that than optimized JavaScript (which is not far from C, the only difference being the guard)? You can assume the case of plain calls without keyword arguments and so on (and with inlining, keyword arguments should pay no runtime cost). Also, the free threading patches which tried removing the GIL gave an unacceptable (IIRC 2x) slowdown to CPython in the old days of CPython 1.5. And I don't think they tried to lock every object, just what you need to lock (which included refcounts). > It *is* locking per object, but you also spend time looking for the > data if someone else has invalidated your cache line. That overhead is already there in locking per object, I think (locking can be much more expensive than a cache miss, see below). However, locking per object does not prevent race conditions unless you make atomic regions as big as actually needed (locking per statement does not work), it just prevents data races (a conflict between a write and a memory operation which are not synchronized between each other). And you can't extend atomic regions indefinitely, as that implies starvation. Even software transactional memory requires the programmer to allocate which regions have to be atomic. Given the additional cost (discussed elsewhere in this mail), and given that there is not much benefit, I think locking-per-object is not worth it (but I'd still love to know more about why the effort on python-safethread was halted). > Come to think of it, that isn't as bad as it first seemed to me. If > the sender never mutates the object, it will Just Work on any machine > with a fairly flat cache architecture. You first wrote: "The alternative, implicitly writing updates back to memory as soon as possible and reading them out of memory every time, can be hundreds or more times slower." This is not "locking per object", it is just semantically close to it, and becomes equivalent if only one thread has a reference at any time. They are very different though performance-wise, and each of them is better for some usages. In the Linux kernel (which I consider quite authoritative here, on what you can do in C) both are used for valid performance reasons, and a JIT compiler could choice between them. Here, first I describe the two alternatives mentioned. Finally, I go to the combination for the "unshared case". - What you first described (memory barriers or uncached R/Ws) can be faster for small updates, depending on the access pattern. An uncached memory area does not disturb other memory traffic, unlike memory barriers which are global, but I don't think an unprivileged process is allowed to obtain one (by modifying MSRs or PATs, for x86). Cost: each memory op goes to main memory and is thus as slow as a cache miss (hundreds of clock cycles). When naively reading a Python field, many such reads can be possible, but a JIT compiler can bring it down to the equivalent of a C access with shadow classes and specialization, and this would pay even more here (V8 does it for JavaScript and I think PyPy already does most or all of it). - Locking per object (monitors): slow upfront, but you can do each r/w out of your cache, so if the object is kept locked for some time, this is more efficient. How slow? A system call to perform locking can cost tens of thousands of cycles. But Java locks, and nowadays even Linux futexes (and Windows locks), perform everything in userspace in as many cases as possible (the slowpath is when there is actually contention on the lock, but it's uncommon with locking-per-object). I won't sum up here the literature on this. - Since no contention is expected here, a simple couple of memory barrier is needed on send/receive (a write barrier for send, a read one for receive, IIRC). Allowing read-only access to another thread already brings back to a mixture of the above two solutions. However, in the 1st solution, using memory barriers, you'd need a write barrier for every write, but you could save on read barriers. -- Paolo Giarrusso - Ph.D. Student http://www.informatik.uni-marburg.de/~pgiarrusso/ _______________________________________________ [email protected] http://codespeak.net/mailman/listinfo/pypy-dev
