On Mon, Aug 24, 2020 at 6:38 AM John Naylor <john.nay...@2ndquadrant.com> wrote: > On Fri, Aug 21, 2020 at 8:53 PM Peter Geoghegan <p...@bowt.ie> wrote: > > Note that there is a ~20% reduction in blks_hit here, even though the > > patch does ~1% more transactions (the rate limiting doesn't work > > perfectly). There is also a ~5.5% reduction in aggregate > > blk_read_time, and a ~9% reduction in blk_write_time. I'd say that > > that's pretty significant. > > Indeed.
Most of this seems to come from the new_orders table, which has heap pages that are continually inserted into and then deleted a little later on. new_orders is a bit like a queue that never gets too big. It is probably the component of TPC-C where we have the most room for improvement, fragmentation-wise. OTOH, despite all the churn the high watermark size of the new_orders table isn't all that high -- maybe ~50MB with a 1TB database. So it's not like we'll save very many shared_buffers misses there. > > Preferring close-by blocks in the FSM is something that there was > > discussion of when the current FSM implementation went in back in 8.4. > > Right, I found a pretty long one here: > > https://www.postgresql.org/message-id/flat/1253201179.9666.174.camel%40ebony.2ndQuadrant Thanks for finding that. > I imagine you're suggesting to make this change in the FSM data? I'm > thinking we could change the category byte to a signed integer, and > reduce FSM_CATEGORIES to 128. Yeah, something like that. I don't think that we need very many distinct FSM_CATEGORIES. Even 128 seems like way more than could ever be needed. > Any change of the FSM file would require pg_upgrade to rewrite the > FSM, but it still doesn't seem like a lot of code. I think that the sloppy approach to locking for the fsmpage->fp_next_slot field in functions like fsm_search_avail() (i.e. not using real atomic ops, even though we could) is one source of problems here. That might end up necessitating fixing the on-disk format, just to get the FSM to behave sensibly -- assuming that the value won't be too stale in practice is extremely dubious. This fp_next_slot business interacts poorly with the "extend a relation by multiple blocks" logic added by commit 719c84c1be5 -- concurrently inserting backends are liable to get the same heap block from the FSM, causing "collisions". That almost seems like a bug IMV. We really shouldn't be given out the same block twice, but that's what my own custom instrumentation shows happens here. With atomic ops, it isn't a big deal to restart using a compare-and-swap at the end (when we set/reset fp_next_slot for other backends). > Other ideas? I've been experimenting with changing the way that we enforce heap fill factor with calls to heap_insert() (and even heap_update()) that happen to occur at a "natural temporal boundary". This works by remembering an XID alongside the target block in the relcache when the target block is set. When we have an existing target block whose XID does not match our backend's current XID (i.e. it's an old XID for the backend), then that means we're at one of these boundaries. We require that the page has a little more free space before we'll insert on it when at a boundary. If we barely have enough space to insert the incoming heap tuple, and it's the first of a few tuples the transaction will ultimately insert, then we should start early on a new page instead of using the last little bit of space (note that the "last little bit" of space does not include the space left behind by fill factor). The overall effect is that groups of related tuples are much less likely to span a heap page boundary unless and until we have lots of updates -- though maybe not even then. I think that it's very common for transactions to insert a group of 2 - 15 logically related tuples into a table at a time. Roughly speaking, you can think of this as the heapam equivalent of the nbtree page split choice logic added by commit fab25024. We ought to go to at least a little bit of effort to minimize the number of distinct XIDs that are present on each heap page (in the tuple headers). We can probably figure out heuristics that result in respecting heap fill factor on average, while giving inserts (and even non-HOT updates) a little wiggle room when it comes to heap page boundaries. By applying both of these techniques together (locality/page split thing and real atomic ops for fp_next_slot) the prototype patch I'm working on mostly restores the system's current ability to reuse space (as measured by the final size of relations when everything is done), while maintaining most of the performance benefits of not using the FSM at all. The code is still pretty rough, though. I haven't decided how far to pursue this. It's not as if there are that many ways to make TPC-C go 5%+ faster left; it's very write-heavy, and stresses many different parts of the system all at once. I'm sure that anything like my current prototype patch will be controversial, though. Maybe it will be acceptable if we only change the behavior for people that explicitly set heap fillfactor. -- Peter Geoghegan