Brandon Black wrote:
However, the thread model as typically used scales poorly across
multiple CPUs as compared to distinct processes, especially as one
scales up from simple SMP to the ccNUMA style we're seeing with
This is just not true.
large-core-count Opteron and Xeon -based machines these days.  This is
mostly because of memory access and data caching issues, not because
of context switching.  The threads thrash on caching memory that
they're both writing to (and/or content on locks, it's related), and
some of the threads are running on a different NUMA node than where
the data is (in some cases this is very pathological, especially if
you haven't had each thread allocate its own memory with a smart
malloc).
This affects processes too. The key aspect is how much data is actively shared between the different threads *and is changing*. The heap is a key point of contention but thread caching heap managers are a big help - and the process model only works when you don't
need to share the data.

Large core count Opteron and Xeon chips are *reducing* the NUMA affects with
modest numbers of threads because the number of physical sockets is going down and the integration on a socket is higher, and when you communicate between your
coprocesses you'll have the same data copying - its not free.

Sure - scaling with threads on a 48 core system is a challenge. Scaling on a glueless 8 core system (or on a share of a host with more cores) is more relevant to what most
of us do though.

Clock speed isn't going anywhere at the moment, but core count is - and so is the
available LAN performance. I think
In the case that you can either (a) just use threads instead of event
loops, but you still want one process per core and several threads
No, I think one process per core is not necessarily a smart way to partition things - one per NUMA area, sure. Once you've gone to one per core you can no longer partition compute across multiple cores (even to do some crypto while going back to IO, which
is a loss.

You can make threads scale up well anyways by simply designing your
multi-threaded software to not contend on pthread mutexes and not
having multiple threads writing to the same shared blocks of memory,
Yes.
but then you're effectively describing the behavior of processes, and
you've implemented a multi-process model by using threads but not
No. I start with 'its shared' and design for parallel execution without contention, and I can pass arbitrary data structures around and freely reference complex arbitrary structures that are static. That's a long way different from sharing only a small flat memory resource and
some pipe IPC. Its more convenient and can be a lot faster.
using most of the defining features of threads.  You may as well save
yourself some sanity and use processes at that point, and have any
No, I do it to stay sane. I let the CPUs handle the infrequent on-demand transfe of the shared data structures instead of having to marshall and stream them myself, and the real memory usage resulting is much smaller, and I have much easier rendezvous
semantics where I have delegated work to compute tasks.
shared read-only data either in memory pre-fork (copy-on-write, and
write never happens to these blocks), or via mmap(MAP_SHARED), or some
other data-sharing mechanism.  So If you've got software that's
scaling well by adding threads as you add CPU cores, you've probably
got software that could have just as efficiently been written as
processes instead of threads, and been less error-prone to boot.
Subprocesses have their place and they do decouple things well. I'm not going to dispute that a bunch of completely independant threads in process containers with no shared state at all can run with zero contention (there's no shared state after all) and be faster. But faster at what? We're talking about scaling a single logical 'thing' and that implies that the boss process is handing out work in a way that streams all the requests and results *and* all the shared information needed by the broken out tasks. Sometimes you can do that, but often its rather hard and the
complexity hurts development and the copying hurts runtime.

You are trying to deny-away that threading scales remarkably well if done properly.
Perhaps you think that no Java or .Net processes scale well on 4- or 8-core
systems? I'm not going to argue that 'Java is as fast as C' but its remarkably
close at some things and performance of Netty-based systems can be good, and
Jetty is pretty handy.

Let me ask you - how do you think memcached should have scaled past their
original single-thread performance? libevent isn't massively slow with modest numbers of connections, and theer's a lot of shared state. And that's even before
you consider running a secure or authenticated connection.

James


_______________________________________________
libev mailing list
[email protected]
http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev

Reply via email to