Hi everybody, I've been following this discussion from the very beginning cause I'm also trying to learn. The topic seems to be very interesting.
What I've learned so far is that there are extra performance costs associated with threads: (1) many libc calls (specifically malloc) in threaded library use locks (mutexes), which makes them much slower (2) on multiple cpu cores each core's cache (L1/L2) needs to be specifically synchronized cause it's not automatically shared (3) MMU can not be used to store process-specific data (state), which leads to extra indirection when accessing that data Point (1) is perfectly clear to me. About point (2) I have questions: (2.1) At what moments exactly this syncronizations occur? Is it on every assembler instruction, or on every write to memory (i.e. on most variable assignments, all memcpy's, etc.), or is it only happening when two threads simultaneously work on the same memory area (how narrow is definition of the area?)? Perhaps there are some hardware sensors indicating that two cores have same memory area loaded into their caches? (2.2) Are there ways to avoid unnecessary synchronizations (apart from switching to processes)? Because in real life there are only few variables (or buffers) that really need to be shared between threads. I don't want all memory caches re-synchronized after every assembler instruction. Point (3) is completely unclear to me. What kind of process data is this all about? How often is this data need to be accessed? In addition to the above I have some questions related to thread-local-storage (TLS). In particular: gcc's __thread storage class used in Linux on x86/64: (4.1) Can TLS be used as a means of _unsharing_ thread memory, so there are no synchronizations costs between cpu cores? (4.1) Does TLS impose extra overhead (performance cost) compared to regular memory storage? Is it recommended to use in performance-concerned code? For example, I have a program that has several threads, and each thread has some data bound to that thread, e.g. event loop structures. Solution number one: pass reference to this structure as a parameter to every function call; solution number two: store such reference in TLS variable and access it from every function that needs to reference the loop. Which solution will work faster? I'd like to thank everyone and Marc in particaular for this very interesting discussion. Yaroslav Stavnichiy On Sat, Dec 31, 2011 at 3:42 PM, Marc Lehmann <[email protected]> wrote: > On Thu, Dec 22, 2011 at 02:53:52PM +0100, Hongli Lai <[email protected]> > wrote: > > I know that, but as you can read from my very first email I was planning > on > > running I threads, with I=number of cores, where each thread has 1 event > > loop. My question now has got nothing to do with the threads vs events > > debate. Marc is claiming that running I *processes* instead of I threads > is > > faster thanks to MMU stuff and I'm asking for clarification. > > I thought I explained this earlier, and I a not sure I can make it any > clearer. > > Just try, mentally, to imagine what happens on your cache when you access > a mutex, or mmap/munmap some memory (e.g. as a result of free), in the > presense of concurrently executing threads. > > Now imagine you have far-away cpus, where it is beneficial to have per-cpu > memory pools, e.g. in systems with higher number of cores or good old > multi-cpu systems. > > Your cache lines bounce around, and memory is slow, or there will be > ipmi's. > > Maybe the dthreads paper mentioned earlier explains this better, as they > also have real-world data where unsharing memory and joining it later can > have substantial performance benefits. > > Maybe it is just too obvious to me: memory isn't shared between cores in > the > hardware level, where memory means cache and the main memory is some > distant > slow storage device with complex and slow coherency protocols to give you > the > illusion of shared memory. > > It's a bit like using (physical) disk files to exchange data instead of > using memory. It is going to be slower, and vastly more complex to keep > synchronised. > > I think the problem is vice versa - whoever claims that threads are as > fast as processes on *different* cores or cpus has to explain how this > can be possible - for every design using threads I think I cna give a > faster design using process, because processes can also share memory (but > I wished it was easier). > > -- > The choice of a Deliantra, the free code+content MORPG > -----==- _GNU_ http://www.deliantra.net > ----==-- _ generation > ---==---(_)__ __ ____ __ Marc Lehmann > --==---/ / _ \/ // /\ \/ / [email protected] > -=====/_/_//_/\_,_/ /_/\_\ > > _______________________________________________ > libev mailing list > [email protected] > http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev >
_______________________________________________ libev mailing list [email protected] http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev
