Brandon Black wrote:
However, the thread model as typically used scales poorly across
multiple CPUs as compared to distinct processes, especially as one
scales up from simple SMP to the ccNUMA style we're seeing with
This is just not true.
large-core-count Opteron and Xeon -based machines these days. This is
mostly because of memory access and data caching issues, not because
of context switching. The threads thrash on caching memory that
they're both writing to (and/or content on locks, it's related), and
some of the threads are running on a different NUMA node than where
the data is (in some cases this is very pathological, especially if
you haven't had each thread allocate its own memory with a smart
malloc).
This affects processes too. The key aspect is how much data is actively
shared between
the different threads *and is changing*. The heap is a key point of
contention but thread
caching heap managers are a big help - and the process model only works
when you don't
need to share the data.
Large core count Opteron and Xeon chips are *reducing* the NUMA affects with
modest numbers of threads because the number of physical sockets is
going down
and the integration on a socket is higher, and when you communicate
between your
coprocesses you'll have the same data copying - its not free.
Sure - scaling with threads on a 48 core system is a challenge. Scaling
on a glueless
8 core system (or on a share of a host with more cores) is more relevant
to what most
of us do though.
Clock speed isn't going anywhere at the moment, but core count is - and
so is the
available LAN performance. I think
In the case that you can either (a) just use threads instead of event
loops, but you still want one process per core and several threads
No, I think one process per core is not necessarily a smart way to
partition things - one
per NUMA area, sure. Once you've gone to one per core you can no longer
partition
compute across multiple cores (even to do some crypto while going back
to IO, which
is a loss.
You can make threads scale up well anyways by simply designing your
multi-threaded software to not contend on pthread mutexes and not
having multiple threads writing to the same shared blocks of memory,
Yes.
but then you're effectively describing the behavior of processes, and
you've implemented a multi-process model by using threads but not
No. I start with 'its shared' and design for parallel execution without
contention, and I can
pass arbitrary data structures around and freely reference complex
arbitrary structures that
are static. That's a long way different from sharing only a small flat
memory resource and
some pipe IPC. Its more convenient and can be a lot faster.
using most of the defining features of threads. You may as well save
yourself some sanity and use processes at that point, and have any
No, I do it to stay sane. I let the CPUs handle the infrequent on-demand
transfe
of the shared data structures instead of having to marshall and stream
them myself,
and the real memory usage resulting is much smaller, and I have much
easier rendezvous
semantics where I have delegated work to compute tasks.
shared read-only data either in memory pre-fork (copy-on-write, and
write never happens to these blocks), or via mmap(MAP_SHARED), or some
other data-sharing mechanism. So If you've got software that's
scaling well by adding threads as you add CPU cores, you've probably
got software that could have just as efficiently been written as
processes instead of threads, and been less error-prone to boot.
Subprocesses have their place and they do decouple things well. I'm not
going to
dispute that a bunch of completely independant threads in process
containers with
no shared state at all can run with zero contention (there's no shared
state after all)
and be faster. But faster at what? We're talking about scaling a single
logical
'thing' and that implies that the boss process is handing out work in a
way that
streams all the requests and results *and* all the shared information
needed by
the broken out tasks. Sometimes you can do that, but often its rather
hard and the
complexity hurts development and the copying hurts runtime.
You are trying to deny-away that threading scales remarkably well if
done properly.
Perhaps you think that no Java or .Net processes scale well on 4- or 8-core
systems? I'm not going to argue that 'Java is as fast as C' but its
remarkably
close at some things and performance of Netty-based systems can be good, and
Jetty is pretty handy.
Let me ask you - how do you think memcached should have scaled past their
original single-thread performance? libevent isn't massively slow with
modest
numbers of connections, and theer's a lot of shared state. And that's
even before
you consider running a secure or authenticated connection.
James
_______________________________________________
libev mailing list
[email protected]
http://lists.schmorp.de/cgi-bin/mailman/listinfo/libev