Re: Feature request: ability to use libeio with multiple event loops

Yaroslav Tue, 03 Jan 2012 03:02:27 -0800

>
> > (4.1) Does TLS impose extra overhead (performance cost) compared to
> regular
> > memory storage? Is it recommended to use in performance-concerned code?
>
> It requires an extra indirection. The app has to:
> 1. Figure out which thread it's currently running on.
> 2. Look up the location of the requested data based on the current thread
> ID.
> How fast this is and whether there's any locking involved depends on
> the implementation. I've benchmarked __thread in the past and
> glibc/NTPL's implementation seems pretty fast. It was fast enough for
> the things I wanted to do it with, though I didn't benchmark how fast
> it is compared to a regular pointer access.
>
>
> > For example, I have a program that has several threads, and each thread
> has
> > some data bound to that thread, e.g. event loop structures. Solution
> number
> > one: pass reference to this structure as a parameter to every function
> call;
> > solution number two: store such reference in TLS variable and access it
> from
> > every function that needs to reference the loop. Which solution will work
> > faster?
>
> This really depends on the TLS implementation. You should benchmark it.
>


I made simple test. Simple pointer passing works about 15-20% faster than
TLS. Although both are quite fast.
For example: 256 threads x 100 mln func calls each
pointer: ~18 sec
TLS: ~22 sec
(measured on quad core CPU running Ubuntu 11.10 64bit)

Interesting observation: removing __thread storage class makes thread data
shared by all threads. Even without any locks concurrent modifications of
the same memory area result in 5-10 fold test time increase. I.e., shared
variable write is about 5-10 times slower than non-shared even without any
locks.

Yaroslav

On Mon, Jan 2, 2012 at 2:05 PM, Hongli Lai <[email protected]> wrote:

> On Mon, Jan 2, 2012 at 10:29 AM, Yaroslav <[email protected]> wrote:
> > (2.1) At what moments exactly this syncronizations occur? Is it on every
> > assembler instruction, or on every write to memory (i.e. on most variable
> > assignments, all memcpy's, etc.), or is it only happening when two
> threads
> > simultaneously work on the same memory area (how narrow is definition of
> the
> > area?)? Perhaps there are some hardware sensors indicating that two cores
> > have same memory area loaded into their caches?
>
> As far as I know (correct me if I'm wrong), when you execute a CPU
> instruction that writes to a memory location that's also cached by
> another CPU core, will cause the system to execute its cache coherence
> protocol. This protocol will invalidate the cache lines in other CPU
> cores for this memory location. MESI is a protocol that's in wide use.
> See http://www.akkadia.org/drepper/cpumemory.pdf
>
> You can see your multi-core system as a network of computers. You can
> see each CPU cache as a computer's memory, and the main memory as a
> big NFS server.
>
>
> > (2.2) Are there ways to avoid unnecessary synchronizations (apart from
> > switching to processes)? Because in real life there are only few
> variables
> > (or buffers) that really need to be shared between threads. I don't want
> all
> > memory caches re-synchronized after every assembler instruction.
>
> If other CPU cores' caches do not have this memory location cached,
> then the CPU does not need to do work to ensure the other caches are
> coherent. In other words, just make sure that you don't read or write
> to the same memory locations in other threads or processes. Or, if you
> do read from the same memory locations in other threads, you shouldn't
> have more than at most 1 writer if you want things to be fast. See
> "Single Writer Principle":
>
> http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html
>
> When I say "memory locations" I actually mean cache lines. Memory is
> cached in blocks of usually 64 bytes. Google "false sharing".
>
> That said, locks are still necessary. Locks are usually implemented
> with atomic instructions, but they're already fairly expensive and
> will continue to become more expensive as CPU vendors add more cores.
> The JVM implements what they call "biased locking" which avoids atomic
> instructions if there's little contention on locks:
>
> http://mechanical-sympathy.blogspot.com/2011/11/java-lock-implementations.html
> The benchmark on the above post turned out to be wrong; he fixed the
> benchmark here:
>
> http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html
> As you can see, biased locking results in a huge performance boost.
> Unfortunately I don't know any pthread library that implements biased
> locking.
>
>
> > Point (3) is completely unclear to me. What kind of process data is this
> all
> > about? How often is this data need to be accessed?
>
> This depends on your application. In
> http://lists.schmorp.de/pipermail/libev/2011q4/001663.html I presented
> two applications, one using threads and one using child processes. The
> one using child processes can store its working data in global
> variables. These global variables have a constant address, so
> accessing them only takes 1 step. In the application that uses
> threads, each thread has to figure out where its working data is by
> first dereferencing the 'data' pointer. That's two steps. You can't
> use global variables in multithreaded applications without locking
> them (which makes things slow). In multiprocess software you don't
> have to lock because processes don't share memory.
>
> However I don't think this really makes that much of a difference in
> practice. Using global variables is often considered bad practice by
> many people. I tend to avoid global variables these days, even when
> writing non-multithreaded software, because not relying on global
> variables automatically makes my code reentrant. This makes it easy to
> use my code in multithreaded software later, and makes things easier
> to test in unit tests and easier to maintain.
>
>
> > (4.1) Can TLS be used as a means of _unsharing_ thread memory, so there
> are
> > no synchronizations costs between cpu cores?
>
> Yes. Though of course you can still do strange things such as passing
> a pointer to a TLS variable to another thread, and have the other
> thread read or write to it. Just don't do that.
>
> You don't necessarily need to use __thread for that. The 'void *data'
> argument in pthread_create's callback is also (conceptually)
> thread-local storage. You can store your thread-local data in there.
>
>
> > (4.1) Does TLS impose extra overhead (performance cost) compared to
> regular
> > memory storage? Is it recommended to use in performance-concerned code?
>
> It requires an extra indirection. The app has to:
> 1. Figure out which thread it's currently running on.
> 2. Look up the location of the requested data based on the current thread
> ID.
> How fast this is and whether there's any locking involved depends on
> the implementation. I've benchmarked __thread in the past and
> glibc/NTPL's implementation seems pretty fast. It was fast enough for
> the things I wanted to do it with, though I didn't benchmark how fast
> it is compared to a regular pointer access.
>
>
> > For example, I have a program that has several threads, and each thread
> has
> > some data bound to that thread, e.g. event loop structures. Solution
> number
> > one: pass reference to this structure as a parameter to every function
> call;
> > solution number two: store such reference in TLS variable and access it
> from
> > every function that needs to reference the loop. Which solution will work
> > faster?
>
> This really depends on the TLS implementation. You should benchmark it.
>
> --
> Phusion | Ruby & Rails deployment, scaling and tuning solutions
>
> Web: http://www.phusion.nl/
> E-mail: [email protected]
> Chamber of commerce no: 08173483 (The Netherlands)
>
> _______________________________________________
> libev mailing list
> [email protected]
> http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev
>

_______________________________________________
libev mailing list
[email protected]
http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev

Re: Feature request: ability to use libeio with multiple event loops

Reply via email to