On Fri, 11 Nov 2011, Paul Querna wrote:
After r1201149, we now lock for lots of things, where in an ideal
case, we shouldn't need it.
I agree that that's not optimal, but it is no regression. Event always
used locking for the timeout queues. But what we really should do in 2.4.0
is remove all the MPM-implementation specific details from conn_state_t.
The only field that is actually used outside of the MPMs is 'state'. If we
make the rest non-public and somehow change the logic for the conn_state_t
to be created by the MPM, we can switch to an improved implementation of
timeout queues in 2.4.x.
I'm toying around with ideas on how to eliminate the need for a mutex at all.
My current 'best' idea I think:
1) Create a new struct, ap_pollset_operation_and_timeout_info_t, which
contains a what pollset operation to do (Add, Remove, Flags),
timeout_type, the timeout value, and a pointer to the conn_rec.
2) Implement a single-reader single writer circular ring buffer of
these structures. (Using 2x uint16_t head/end offsets stuffed into a
single uint32_t so we can make it portability atomic using
apr_atomic_cas32)
3) Allocate this ring buffer per-worker.
4) Have the single Event thread de-queue operations from all the worker threads.
This would remove the 4 or 5 separate timeout queues we have
developed, and their associated mutex, and basically move all of the
apr_pollset operations to the single main thread.
But then the main thread would need to do a sorted insert into its
main timeout queue, which is expensive. Or how would you find out when the
next timeout is due?
Without modification to the event mpm, it would potentially cause some
issues as the event thread isn't always waking up that often, but I
think we can figure out a way to do this without too much pain (either
a pipe trigger to wake up when in 'slow mode', or just lower the
default timeout on the pollset form 500ms to like 5ms).
There would have to be some wakeup. Otherwise it is possible that the
worker would only write to the socket every 5ms in write completion mode,
which would hurt the maximum transfer rate big time.
Thoughts?
BTW, IMHO such major overhaul should be done in a separate branch for now.
Other improvements could include having a per thread struct that contains
all per-thread data and is aligned to e.g. 512 bytes to avoid cache-line
ping-pong. Currently the worker_sockets array is causing false sharing.
But as far as I remember, this was not (yet) relevant for performance,
because other parts (mainly dir/file walk and config merging) are much
slower anyway.
Cheers,
Stefan