Re: Optimal multithread model

James Mansion Tue, 16 Mar 2010 04:26:11 -0700

Brandon Black wrote:

However, the thread model as typically used scales poorly across
multiple CPUs as compared to distinct processes, especially as one
scales up from simple SMP to the ccNUMA style we're seeing with

This is just not true.

large-core-count Opteron and Xeon -based machines these days.  This is
mostly because of memory access and data caching issues, not because
of context switching.  The threads thrash on caching memory that
they're both writing to (and/or content on locks, it's related), and
some of the threads are running on a different NUMA node than where
the data is (in some cases this is very pathological, especially if
you haven't had each thread allocate its own memory with a smart
malloc).

This affects processes too. The key aspect is how much data is activelyshared betweenthe different threads *and is changing*. The heap is a key point ofcontention but threadcaching heap managers are a big help - and the process model only workswhen you don't

need to share the data.


Large core count Opteron and Xeon chips are *reducing* the NUMA affects with

modest numbers of threads because the number of physical sockets isgoing downand the integration on a socket is higher, and when you communicatebetween your

coprocesses you'll have the same data copying - its not free.

Sure - scaling with threads on a 48 core system is a challenge. Scalingon a glueless8 core system (or on a share of a host with more cores) is more relevantto what most

of us do though.

Clock speed isn't going anywhere at the moment, but core count is - andso is the

available LAN performance. I think

In the case that you can either (a) just use threads instead of event
loops, but you still want one process per core and several threads

No, I think one process per core is not necessarily a smart way topartition things - oneper NUMA area, sure. Once you've gone to one per core you can no longerpartitioncompute across multiple cores (even to do some crypto while going backto IO, which

is a loss.

You can make threads scale up well anyways by simply designing your
multi-threaded software to not contend on pthread mutexes and not
having multiple threads writing to the same shared blocks of memory,

Yes.

but then you're effectively describing the behavior of processes, and
you've implemented a multi-process model by using threads but not

No. I start with 'its shared' and design for parallel execution withoutcontention, and I canpass arbitrary data structures around and freely reference complexarbitrary structures thatare static. That's a long way different from sharing only a small flatmemory resource and

some pipe IPC. Its more convenient and can be a lot faster.

using most of the defining features of threads.  You may as well save
yourself some sanity and use processes at that point, and have any

No, I do it to stay sane. I let the CPUs handle the infrequent on-demandtransfeof the shared data structures instead of having to marshall and streamthem myself,and the real memory usage resulting is much smaller, and I have mucheasier rendezvous

semantics where I have delegated work to compute tasks.

shared read-only data either in memory pre-fork (copy-on-write, and
write never happens to these blocks), or via mmap(MAP_SHARED), or some
other data-sharing mechanism.  So If you've got software that's
scaling well by adding threads as you add CPU cores, you've probably
got software that could have just as efficiently been written as
processes instead of threads, and been less error-prone to boot.

Subprocesses have their place and they do decouple things well. I'm notgoing todispute that a bunch of completely independant threads in processcontainers withno shared state at all can run with zero contention (there's no sharedstate after all)and be faster. But faster at what? We're talking about scaling a singlelogical'thing' and that implies that the boss process is handing out work in away thatstreams all the requests and results *and* all the shared informationneeded bythe broken out tasks. Sometimes you can do that, but often its ratherhard and the

complexity hurts development and the copying hurts runtime.

You are trying to deny-away that threading scales remarkably well ifdone properly.

Perhaps you think that no Java or .Net processes scale well on 4- or 8-core

systems? I'm not going to argue that 'Java is as fast as C' but itsremarkably

close at some things and performance of Netty-based systems can be good, and
Jetty is pretty handy.

Let me ask you - how do you think memcached should have scaled past their

original single-thread performance? libevent isn't massively slow withmodestnumbers of connections, and theer's a lot of shared state. And that'seven before

you consider running a secure or authenticated connection.

James


_______________________________________________
libev mailing list
[email protected]
http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev

Re: Optimal multithread model

Reply via email to