Tom Lane wrote:
... and not guaranteeing to reset theaccess pattern on failure, either.
Good catch, I thought I had that covered but apparently not.
I think we've got to get rid of the global variable and make the access pattern be a parameter to StrategyGetBuffer, instead. Which in turn suggests a variant ReadBuffer, maybe ReadBufferWithStrategy(), at the next level up. This would serve directly for the heapscan case, and for the main heap accesses in vacuum/analyze, but it might be a bit of a pain to make index vacuuming pay attention to the strategy.
I thought of that as well, but settled on the global variable for that same reason, the UnpinBuffer problem, and the fact that we'd end up with quite many different ReadBuffer variants: ReadBuffer, ReadOrZeroBuffer, ReadBufferWithStrategy, ReleaseAndReadBuffer, ReleaseAndReadBufferWithStrategy (if we care about that).
Index vacuum code for all the indexams could be changed to use the ReadBufferWithStrategy if we took that approach, though there's quite a many of those calls in total. We could use a hybrid approach, where you could use ReadBufferWithStrategy to give the strategy on page-by-page basis, or use SetStrategyHintVacuum/SetAccessPattern to set it for all subsequent ReadBuffer-calls.
The other case that the patch addresses is COPY TO REL, which we could handle if we were willing to pass a strategy parameter down through heap_insert and RelationGetBufferForTuple ... is it worth the trouble? I don't recall anyone having presented measurements to show COPY TO REL being helped by this patch.
The theory was that it'll reduce the cache-spoiling of COPY, just like it does for selects. I can run tests on that.
I don't especially care for the ring data structure being global inside freelist.c, either. What I'm inclined to do is have the "access strategy parameter" actually be a pointer to some struct type, with NULL representing the default strategy. At the beginning of a bulk operation we'd create an access strategy object, which would be internally the current Ring data structure, but it'd be opaque to code outside freelist.c. This'd mean that a query using two large seqscans would use two rings not one, but that's not necessarily bad; it would in any case make it a lot easier to generalize the strategy concept in future to support more than one kind of non-default policy being active in different parts of a query.
I thought about the two large scans scenario but came to the conclusion that it's not possible to have two large scans active at the same time in a single backend. You could have a query like "SELECT * FROM bigtable_1, bigtable_2" with a cartesian product, or something with a function that scans a large table, but even then one of the scans would progress veery slowly compared to the other scan, and using more than one ring buffer would make no difference.
I agree on the generalization argument, though.
A point I have not figured out how to deal with is that in the patch as given, UnpinBuffer needs to know the strategy; and getting it that info would make the changes far more invasive. But the patch's behavior here is pretty risky anyway, since the strategy global at the time of unpin might have nothing to do with how it was set when the buffer was acquired. What I'm tempted to do is remove the special case there and adjust buffer acquisition instead (maybe make it decrement the usage_count when re-using a buffer from the ring).
It's assumed that the same strategy is used when unpinning, which is true for the current usage (and apparently needs to be documented).
Normally buffers that are in the ring are recycled quickly, in which case the usage_count will always be increased from 0->1 in UnpinBuffer, regardless of the check. The check is there to avoid inflating the usage_count on buffers that happened to be already in shared_buffers. It might not be that important for the selects, but that's what keeps COPY from bumping the usage_count all the way up to 5.
If we figure out another way to deal with the COPY usage_count, maybe we could remove the check altogether. I've been thinking of changing COPY anyway so that it always kept the last page it inserted to pinned, to avoid the traffic to buffer manager on each tuple.
Comments? I'm still playing with the ideas at this point...
Let me know if you want me to make changes and submit a new version. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings