On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure <mmonc...@gmail.com> wrote: >>> I think the basic problem is that the cache pages are so large. It's >>> hard to make them smaller because that increases the cost of accessing >>> the cache, as you already noted, but at the same time, making an >>> eviction decision on a cache that holds 64K entries based on 100 data >>> points seems like waving around a pretty large-caliber weapon in a >>> fairly uncontrolled fashion. Maybe it works OK 90% of the time, but >>> it's hard to avoid the niggling fear that someone might accidentally >>> get shot. >> >> Right...it's essentially a rolling statistical sampling of transaction >> IDs based on past acceses. Going from say, 100 to 1000 will probably >> drop your errror margin quite a bit but I really wonder how benefit >> you can get from going past that. > > An interesting idea might be to forcibly cause a replacement every 100 > tuples (or every 1000 tuples) and see if that causes a noticeable > performance regression. If it doesn't, that's a good data point.
To test this I disabled the cache check on top of HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a bogus xid (so every tuple scanned was treated as a 'miss'. Scanning 1 million records in memory over and over went up a few percentage points -- converging on about 280ms instead of 270ms. This is with bumped to 1000 miss array: oprofile said: regular hinted tuple case: 120 10.2041 advance_aggregates 107 9.0986 heapgettup_pagemode 77 6.5476 ExecProject 74 6.2925 heapgetpage 72 6.1224 ExecProcNode 72 6.1224 ExecScan 69 5.8673 advance_transition_function 66 5.6122 heap_getnext 58 4.9320 HeapTupleSatisfiesMVCC hinted tuple with pathological cache entry: 111 8.9588 advance_aggregates 109 8.7974 heapgettup_pagemode 85 6.8604 ExecProject 81 6.5375 heapgetpage 77 6.2147 ExecScan 70 5.6497 ExecStoreTuple 70 5.6497 HeapTupleSatisfiesMVCC this was a fairly short profiling run but the numbers are fairly consistent. the replacement is fairly cheap...sorting 1000 ints doesn't take all that long. 100 is even faster. > I think the sour spot for this whole idea is likely to be the case > where you have a workload that is 90% read and 10% write, or something > like that. On an all-read workload (like pgbench -S), you're > hopefully going to converge to a state where all the hint-bits are > set, once VACUUM kicks in. And on an all-write workload I think that > you might lose the effect in the noise. Not sure if we have an easy > way to set up such a workload with pgbench, though. it's trivial to test with pgbench -- just use the random number facility to generate an id for some table and update where random() > .9. However this does not generate nearly enough 'misses' to exercise cache replacement in any meaningful way. My workstation vm can punch out about 5k transactions/sec. With only 10% getting a new xid and writing back a tuple to the table only 500 xids are getting generated/second. At that rate it takes quite a while to burn through the 256k transactions the cache ranges over. Also this case is not bound by scan performance anyways making profiling it a little silly -- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP workloads. Scan heavy loads is where this matters, and scan heavy loads naturally tend to generate less xids per tuple scanned. >>> Just for the sake of argument, let's suppose we made an array with 64K >>> elements, each one representing one 64K chunk of the XID space. Each >>> element is a 4-byte unsigned integer, so the array takes up 256kB of >>> memory... probably not good, but I'm just thinking out loud here. >>> Every time we're asked about an XID, we bump the corresponding >>> counter. Periodically (say, every 64K probes) we zip through all the >>> counters and consider performing a cache replacement. We look at the >>> not-already-cached page that has the highest counter value and compare >>> it to the counter value for the cached pages. If it is higher and the >>> difference exceeds some safety margin (to protect against hysteresis), >>> then we do the replacement. I think something like this would allow >>> you to have a lot more information available to make a direct >>> apples-to-apples comparison at cache replacement time, as compared >>> with what you have now. >> >> yeah -- what you've done here is basically two things: 1. by mapping >> static ranges you get to skip the sort (but not the scan) during the >> replacement, and 2. by vastly expanding the sampling size you >> eliminate thrashing via noise in the rolling sample. This comes at a >> significant memory cost and loss of some nimbleness in the cache (i'm >> not completely sure more aggressive replacement is 'bad') -- although >> that could be mitigated by replacing at more frequent intervals. > > Well, that gets at an interesting question: how often SHOULD we > replace cache pages? My gut is that it's 10K-100K and your gut is > that it's 100-1000, which is a hundred-fold difference. Seems like we > need some kind of emprical data to decide what actually works well in > real life. I think using a smaller 'n' for a sort based replacement is on solid mathematical grounds. Amortized replacement performance is (lg(n) + (k/n)) * q where q is the miss rate and k is the page replacement cost. k/n is close to zero for n >= 100 so it's really lg(n) * q. From this we can deduce that since lg(10k) is roughly double lg(100), you'd have to get double the hit rate to break even and I just don't think that's going to happen -- it's pretty hard to even contrive cases with miss rates > .05 (and from there replacement performance is moot). OTOH, your approach wants n to be larger because presumably the hit rate would be better -- amortized replacement performance is k/n * q. The value that gives the lowest q within reasonable memory constraints is what you'd want. hm. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers