On Sat, Aug 17, 2013 at 10:55 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure <mmonc...@gmail.com> wrote: >> *) What I think is happening: >> I think we are again getting burned by getting de-scheduled while >> holding the free list lock. I've been chasing this problem for a long >> time now (for example, see: >> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html) >> but not I've got a reproducible case. What is happening this: >> >> 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension >> 2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer() >> 3. Lock free list, go into clock sweep loop >> 4. while holding clock sweep, hit 'hot' buffer, spin on it >> 5. get de-scheduled >> 6. now enter the 'hot buffer spin lock lottery' >> 7. more/more backends pile on, linux scheduler goes bezerk, reducing >> chances of winning #6 >> 8. finally win the lottery. lock released. everything back to normal. > > This is an interesting theory, but where's the evidence? I've seen > spinlock contention come from enough different places to be wary of > arguments that start with "it must be happening because...".
Absolutely. My evidence is circumstantial at best -- let's call it a hunch. I also do not think we are facing pure spinlock contention, but something more complex which is a combination of spinlocks, the free list lwlock, and the linux scheduler. This problem showed up going from RHEL 5->6 which brought a lot of scheduler changes. A lot of other things changed too, but the high sys cpu really suggests we are getting some feedback from the scheduler. > IMHO, the thing to do here is run perf record -g during one of the > trouble periods. The performance impact is quite low. You could > probably even set up a script that runs perf for five minute intervals > at a time and saves all of the perf.data files. When one of these > spikes happens, grab the one that's relevant. Unfortunately -- that's not on the table. Dropping shared buffers to 2GB (thanks RhodiumToad) seems to have fixed the issue and there is zero chance I will get approval to revert that setting in order to force this to re-appear. So far, I have not been able to reproduce in testing. By the way, this problem has popped up in other places too; and the typical remedies are applied until it goes away :(. > If you see that s_lock is where all the time is going, then you've > proved it's a PostgreSQL spinlock rather than something in the kernel > or a shared library. If you can further see what's calling s_lock > (which should hopefully be possible with perf -g), then you've got it > nailed dead to rights. Well, I don't think it's that simple. So my plan of action is this: 1) Improvise a patch that removes one *possible* trigger for the problem, or at least makes it much less likely to occur. Also, in real world cases where usage_count is examined N times before returning a candidate buffer, the amount of overall spinlocking from buffer allocating is reduced by approximately (N-1)/N. Even though spin locking is cheap, it's hard to argue with that... 2) Exhaustively performance test patch #1. I think this is win-win since SGB clock sweep loop is quite frankly relatively un-optimized. I don't see how reducing the amount of locking could hurt performance but I've been, uh, wrong about these types of things before. 3) If a general benefit without downside is shown from #2, I'll simply advance the patch for the next CF and see how things shake out. If and when I feel like there's a decent shot at getting accepted, I may go through the motions of setting up a patched server in production and attempting shared buffers. But that's a long way off. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers