I've been thinking some more about scalability and what we need to
measure in order to locate and remove the next set of bottlenecks.
The lock wait time distribution and the sum of lock held time is of
interest in understanding contention.
Shared locks present some complexities for analysing contention stats.
If we look at the sum of the lock held time then we will get the wrong
answer because many backends can hold an LW_SHARED mode lock at the same
Moreover, LW_SHARED locks have queue jumping characteristics that make
LW_EXCLUSIVE locks wait for substantial lengths of time. The worst of
those situations was the old CheckpointStartLock which could starve a
starting checkpoint for many minutes on a busy server. For locks that
can be both shared and exclusive we should measure the lock wait time
for shared and exclusive separately and we should measure the lock hold
time only for exclusive mode.
We've discussed the possibility of a third type of lock, a queued shared
lock. I've not found any benefit in prototypes so far, but one day...
RARE EVENTS AND TRAFFIC JAMS
For queued exclusive locks the queue length is an interesting
measurement over time. This is because we may find that certain rare
events cause effects out of proportion to their actual duration.
If the random arrival rate of new lock requests approaches the lock hold
time (service time) then when a traffic jam forms it can take long
periods to clear again.
e.g. if a lock is randomly requested every 11us and lock service time is
10us then the lock seems like it will mostly be clear. Should the lock
ever be held for an extended time, e.g. 1ms (=1000us) then a long queue
will form, say about ~99 long. But the every 100us we serve 10 lock
requestors while 9 more arrive. So after the traffic jam forms it will
take 10,000us to clear, i.e. the traffic jam takes 10 times as long to
clear as the original event that caused it.
Taken to the extreme, very rare events can still be the major source of
contention in a dynamic system. Now introduce non-random effects into
the arrival rate distribution and you can see that flash queues can form
easily and yet take a long time to clear.
The maths for this is fairly hard...
WHY ARE WE WAITING?
Up to now we've looked at contention on single well-known LWlocks, such
as BufMappingLock etc.. There will be times when we need to return to
looking at those contention points, but I'm thinking we may need to
begin looking at other points of contention in the server. The single
well-known locks behave in different ways because each lock has
different lock service times and also different access frequencies on
different lock modes (shared or exclusive). We should be careful not to
consider all of these locks similarly in any analysis.
The second source of contention issues I see is where we hold multiple
well-known locks. For example holding WALInsertLock is normal, as is
holding WALWriteLock, but holding both WALInsertLock while we perform a
write with WALWriteLock held is a bad thing and we would want to avoid
that condition. So I'd like to look at what combinations of locks we
hold and Why they were taken.
The third source of contention is data block events. These are much
harder to spot because they are spread across the whole buffer space. An
example might be index block splits. These will occur at the same
logical place in the index, though because of the way we split the new
right page is always a new data block and so in a different buffer. So
contention on the value "123" in an index could actually move across
different buffer locks and not be visible for what it really is.
Recursive block splits can cause very long waits. We need ways to be
able to track those types of event.
So our sources of contention are at least
1. single well-known locks
2. multiple well-known locks
3. data block contention events
I've thought about ways of understanding the root cause of a lock wait
and there are some. But because of what we said earlier about traffic
jams lasting much longer than the original event, its hard to accurately
explain why certain tasks wait. Are we waiting because an earlier event
caused a traffic jam, or are we waiting because a sudden rush of lock
requests occurred before the original traffic jam cleared?
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster