Just as an update from my angle on the THP side... I put together a
systemtap script last night and so far it's confirming my theory (at least
in our environment).  I want to go through some more data and make some
changes on our test box to see if we can make it go away before declaring
success - it's always possible two problems are intertwined or that the THP
thing is only showing up because of the *real* problem... you know how it
goes.

Basically the systemtap script does this:
- probes the compaction function
- keeps track of the number of calls to it and aggregate time spent in it
by process
- at the end spit out the collected info.

So far when I run the script for a short period of time that I know THP
compactions are happening, I have been able to match up the compaction
duration collected via systemtap with a query in the pg logs that took that
amount of time or slightly longer (as expected).  A lot of these are only a
second or so, so I haven't been able to catch everything, but at least the
data I am getting is consistent.

Will be interested to see what you find Johnny.

Reply via email to