On Mon, Jan 2, 2012 at 10:29 AM, Yaroslav <[email protected]> wrote: > (2.1) At what moments exactly this syncronizations occur? Is it on every > assembler instruction, or on every write to memory (i.e. on most variable > assignments, all memcpy's, etc.), or is it only happening when two threads > simultaneously work on the same memory area (how narrow is definition of the > area?)? Perhaps there are some hardware sensors indicating that two cores > have same memory area loaded into their caches?
As far as I know (correct me if I'm wrong), when you execute a CPU instruction that writes to a memory location that's also cached by another CPU core, will cause the system to execute its cache coherence protocol. This protocol will invalidate the cache lines in other CPU cores for this memory location. MESI is a protocol that's in wide use. See http://www.akkadia.org/drepper/cpumemory.pdf You can see your multi-core system as a network of computers. You can see each CPU cache as a computer's memory, and the main memory as a big NFS server. > (2.2) Are there ways to avoid unnecessary synchronizations (apart from > switching to processes)? Because in real life there are only few variables > (or buffers) that really need to be shared between threads. I don't want all > memory caches re-synchronized after every assembler instruction. If other CPU cores' caches do not have this memory location cached, then the CPU does not need to do work to ensure the other caches are coherent. In other words, just make sure that you don't read or write to the same memory locations in other threads or processes. Or, if you do read from the same memory locations in other threads, you shouldn't have more than at most 1 writer if you want things to be fast. See "Single Writer Principle": http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html When I say "memory locations" I actually mean cache lines. Memory is cached in blocks of usually 64 bytes. Google "false sharing". That said, locks are still necessary. Locks are usually implemented with atomic instructions, but they're already fairly expensive and will continue to become more expensive as CPU vendors add more cores. The JVM implements what they call "biased locking" which avoids atomic instructions if there's little contention on locks: http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html The benchmark on the above post turned out to be wrong; he fixed the benchmark here: http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html As you can see, biased locking results in a huge performance boost. Unfortunately I don't know any pthread library that implements biased locking. > Point (3) is completely unclear to me. What kind of process data is this all > about? How often is this data need to be accessed? This depends on your application. In http://lists.schmorp.de/pipermail/libev/2011q4/001663.html I presented two applications, one using threads and one using child processes. The one using child processes can store its working data in global variables. These global variables have a constant address, so accessing them only takes 1 step. In the application that uses threads, each thread has to figure out where its working data is by first dereferencing the 'data' pointer. That's two steps. You can't use global variables in multithreaded applications without locking them (which makes things slow). In multiprocess software you don't have to lock because processes don't share memory. However I don't think this really makes that much of a difference in practice. Using global variables is often considered bad practice by many people. I tend to avoid global variables these days, even when writing non-multithreaded software, because not relying on global variables automatically makes my code reentrant. This makes it easy to use my code in multithreaded software later, and makes things easier to test in unit tests and easier to maintain. > (4.1) Can TLS be used as a means of _unsharing_ thread memory, so there are > no synchronizations costs between cpu cores? Yes. Though of course you can still do strange things such as passing a pointer to a TLS variable to another thread, and have the other thread read or write to it. Just don't do that. You don't necessarily need to use __thread for that. The 'void *data' argument in pthread_create's callback is also (conceptually) thread-local storage. You can store your thread-local data in there. > (4.1) Does TLS impose extra overhead (performance cost) compared to regular > memory storage? Is it recommended to use in performance-concerned code? It requires an extra indirection. The app has to: 1. Figure out which thread it's currently running on. 2. Look up the location of the requested data based on the current thread ID. How fast this is and whether there's any locking involved depends on the implementation. I've benchmarked __thread in the past and glibc/NTPL's implementation seems pretty fast. It was fast enough for the things I wanted to do it with, though I didn't benchmark how fast it is compared to a regular pointer access. > For example, I have a program that has several threads, and each thread has > some data bound to that thread, e.g. event loop structures. Solution number > one: pass reference to this structure as a parameter to every function call; > solution number two: store such reference in TLS variable and access it from > every function that needs to reference the loop. Which solution will work > faster? This really depends on the TLS implementation. You should benchmark it. -- Phusion | Ruby & Rails deployment, scaling and tuning solutions Web: http://www.phusion.nl/ E-mail: [email protected] Chamber of commerce no: 08173483 (The Netherlands) _______________________________________________ libev mailing list [email protected] http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev
