> > > (4.1) Does TLS impose extra overhead (performance cost) compared to > regular > > memory storage? Is it recommended to use in performance-concerned code? > > It requires an extra indirection. The app has to: > 1. Figure out which thread it's currently running on. > 2. Look up the location of the requested data based on the current thread > ID. > How fast this is and whether there's any locking involved depends on > the implementation. I've benchmarked __thread in the past and > glibc/NTPL's implementation seems pretty fast. It was fast enough for > the things I wanted to do it with, though I didn't benchmark how fast > it is compared to a regular pointer access. > > > > For example, I have a program that has several threads, and each thread > has > > some data bound to that thread, e.g. event loop structures. Solution > number > > one: pass reference to this structure as a parameter to every function > call; > > solution number two: store such reference in TLS variable and access it > from > > every function that needs to reference the loop. Which solution will work > > faster? > > This really depends on the TLS implementation. You should benchmark it. >
I made simple test. Simple pointer passing works about 15-20% faster than TLS. Although both are quite fast. For example: 256 threads x 100 mln func calls each pointer: ~18 sec TLS: ~22 sec (measured on quad core CPU running Ubuntu 11.10 64bit) Interesting observation: removing __thread storage class makes thread data shared by all threads. Even without any locks concurrent modifications of the same memory area result in 5-10 fold test time increase. I.e., shared variable write is about 5-10 times slower than non-shared even without any locks. Yaroslav On Mon, Jan 2, 2012 at 2:05 PM, Hongli Lai <[email protected]> wrote: > On Mon, Jan 2, 2012 at 10:29 AM, Yaroslav <[email protected]> wrote: > > (2.1) At what moments exactly this syncronizations occur? Is it on every > > assembler instruction, or on every write to memory (i.e. on most variable > > assignments, all memcpy's, etc.), or is it only happening when two > threads > > simultaneously work on the same memory area (how narrow is definition of > the > > area?)? Perhaps there are some hardware sensors indicating that two cores > > have same memory area loaded into their caches? > > As far as I know (correct me if I'm wrong), when you execute a CPU > instruction that writes to a memory location that's also cached by > another CPU core, will cause the system to execute its cache coherence > protocol. This protocol will invalidate the cache lines in other CPU > cores for this memory location. MESI is a protocol that's in wide use. > See http://www.akkadia.org/drepper/cpumemory.pdf > > You can see your multi-core system as a network of computers. You can > see each CPU cache as a computer's memory, and the main memory as a > big NFS server. > > > > (2.2) Are there ways to avoid unnecessary synchronizations (apart from > > switching to processes)? Because in real life there are only few > variables > > (or buffers) that really need to be shared between threads. I don't want > all > > memory caches re-synchronized after every assembler instruction. > > If other CPU cores' caches do not have this memory location cached, > then the CPU does not need to do work to ensure the other caches are > coherent. In other words, just make sure that you don't read or write > to the same memory locations in other threads or processes. Or, if you > do read from the same memory locations in other threads, you shouldn't > have more than at most 1 writer if you want things to be fast. See > "Single Writer Principle": > > http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html > > When I say "memory locations" I actually mean cache lines. Memory is > cached in blocks of usually 64 bytes. Google "false sharing". > > That said, locks are still necessary. Locks are usually implemented > with atomic instructions, but they're already fairly expensive and > will continue to become more expensive as CPU vendors add more cores. > The JVM implements what they call "biased locking" which avoids atomic > instructions if there's little contention on locks: > > http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html > The benchmark on the above post turned out to be wrong; he fixed the > benchmark here: > > http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html > As you can see, biased locking results in a huge performance boost. > Unfortunately I don't know any pthread library that implements biased > locking. > > > > Point (3) is completely unclear to me. What kind of process data is this > all > > about? How often is this data need to be accessed? > > This depends on your application. In > http://lists.schmorp.de/pipermail/libev/2011q4/001663.html I presented > two applications, one using threads and one using child processes. The > one using child processes can store its working data in global > variables. These global variables have a constant address, so > accessing them only takes 1 step. In the application that uses > threads, each thread has to figure out where its working data is by > first dereferencing the 'data' pointer. That's two steps. You can't > use global variables in multithreaded applications without locking > them (which makes things slow). In multiprocess software you don't > have to lock because processes don't share memory. > > However I don't think this really makes that much of a difference in > practice. Using global variables is often considered bad practice by > many people. I tend to avoid global variables these days, even when > writing non-multithreaded software, because not relying on global > variables automatically makes my code reentrant. This makes it easy to > use my code in multithreaded software later, and makes things easier > to test in unit tests and easier to maintain. > > > > (4.1) Can TLS be used as a means of _unsharing_ thread memory, so there > are > > no synchronizations costs between cpu cores? > > Yes. Though of course you can still do strange things such as passing > a pointer to a TLS variable to another thread, and have the other > thread read or write to it. Just don't do that. > > You don't necessarily need to use __thread for that. The 'void *data' > argument in pthread_create's callback is also (conceptually) > thread-local storage. You can store your thread-local data in there. > > > > (4.1) Does TLS impose extra overhead (performance cost) compared to > regular > > memory storage? Is it recommended to use in performance-concerned code? > > It requires an extra indirection. The app has to: > 1. Figure out which thread it's currently running on. > 2. Look up the location of the requested data based on the current thread > ID. > How fast this is and whether there's any locking involved depends on > the implementation. I've benchmarked __thread in the past and > glibc/NTPL's implementation seems pretty fast. It was fast enough for > the things I wanted to do it with, though I didn't benchmark how fast > it is compared to a regular pointer access. > > > > For example, I have a program that has several threads, and each thread > has > > some data bound to that thread, e.g. event loop structures. Solution > number > > one: pass reference to this structure as a parameter to every function > call; > > solution number two: store such reference in TLS variable and access it > from > > every function that needs to reference the loop. Which solution will work > > faster? > > This really depends on the TLS implementation. You should benchmark it. > > -- > Phusion | Ruby & Rails deployment, scaling and tuning solutions > > Web: http://www.phusion.nl/ > E-mail: [email protected] > Chamber of commerce no: 08173483 (The Netherlands) > > _______________________________________________ > libev mailing list > [email protected] > http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev >
_______________________________________________ libev mailing list [email protected] http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev
