I've run some more tests with much higher concurrency (so far only on
my uniprocessor solaris 8/sparc machine, but from preliminary results
from Ian's 8-way sun box things only get worse with more CPUs). I've
tried to match some usage pattern from each of our major MPMs, described
here:
- prefork: one listener/worker per process, many processes.
- threaded: multiple listeners/workers per process, many processes but
many less than prefork.
- worker: single listener, multiple workers per process, similiar number of
processes to threaded.
I ran these three tests, each with 50 concurrent {threads,processes} that
each contend for a lock, increment a counter, and unlock; exiting after
the counter has reached 1 million:
pthread_mutex across threads: 18.5 sec
-- applicable to threaded and worker
pthread_mutex across processes: 18.0 sec
-- applicable to threaded, prefork, and worker
fcntl() across processes: 2790.2 sec (46.5 minutes!!)
-- applicable to threaded, prefork, and worker
My interpretation of this is that the overhead incurred on acquiring
and releasing a lock 1 million times is somewhere around 2 orders of
magnitude more on fcntl() than the overhead for the same operation using
a posix mutex. At first thought this may seem like an extreme case, but
given a high request-load there will be on the order of n LWPs waiting
on the same accept lock in both of the prefork and threaded MPMs (where
n is the number of processes * workers/process).
Given these results, it is clear to me that we should attempt to use
posix mutexes whenever possible (even moreso on large n-way machines,
as fcntl()'s exponential growth seems to increase it's ascent with each
new processor). This may only be true for Solaris (8/sparc), but I think
that in order to properly evaluate other platforms we'll need to run
similiar tests.
Would it be prudent for APR to provide a shared-memory implementation of
posix mutexes? It seems to me that we don't have to rely on PROCESS_SHARED
being available on a particular platform if we handle our own shared
memory allocation. Are there any known caveats to this type of an
implementation?
-aaron