Manfred Spraul <[EMAIL PROTECTED]> writes:
> But: According to the descriptions the problem is a context switch
> storm. I don't see that cache line bouncing can cause a context switch
> storm. What causes the context switch storm?
As best I can tell, the CS storm arises because the backends get into
some sort of lockstep timing that makes it far more likely than you'd
expect for backend A to try to enter the bufmgr when backend B is already
holding the BufMgrLock. In the profiles we were looking at back in
April, it seemed that about 10% of the time was spent inside bufmgr
(which is bad enough in itself) but the odds of LWLock collision were
much higher than 10%, leading to many context swaps.
This is not totally surprising given that they are running identical
queries and so are running through loops of the same length, but still
it seems like there must be some effect driving their timing to converge
instead of diverge away from the point of conflict.
What I think (and here is where it's a leap of logic, cause I can't
prove it) is that the excessive time spent passing the spinlock cache
line back and forth is exactly the factor causing that convergence.
Somehow, the delay caused when a processor has to wait to get the cache
line contributes to keeping the backend loops in lockstep.
It is critical to understand that the CS storm is associated with LWLock
contention not spinlock contention: what we saw was a lot of semop()s
not a lot of select()s.
> If it's the pg_usleep in s_lock, then my patch should help a lot: with
> pthread_rwlock locks, this line doesn't exist anymore.
The profiles showed that s_lock() is hardly entered at all, and the
select() delay is reached even more seldom. So changes in that area
will make exactly zero difference. This is the surprising and
counterintuitive thing: oprofile clearly shows that very large fractions
of the CPU time are being spent at the initial TAS instructions in
LWLockAcquire and LWLockRelease, and yet those TASes hardly ever fail,
as proven by the fact that oprofile shows s_lock() is seldom entered.
So as far as the Postgres code can tell, there isn't any contention
worth mentioning for the spinlock. This is indeed the way it was
designed to be, but when so much time is going to the TAS instructions,
you'd think there'd be more software-visible contention for the
It could be that I'm all wet and there is no relationship between the
cache line thrashing and the seemingly excessive BufMgrLock contention.
They are after all occurring at two very different levels of abstraction.
But I think there is some correlation that we just don't understand yet.
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings