On Fri, 2008-07-04 at 15:20 +0300, Heikki Linnakangas wrote:
> 2. We only update the FSM when we try to insert a tuple and find that it
> doesn't fit. That reduces the amount of FSM updates dramatically when
> you're doing bulk inserts. (This is the same as in the current
> implementation, though I'm not sure it's optimal anymore.)
OK, so the ideal search algorithm would be to search the FSM page you
just updated, rather than starting right from the top again. That way,
we'd only need to do one page access to do a page-full-need-another-page
operation. That's O(1) whatever the table size.
> I haven't been able to come up with a simple test case that shows any
> meaningful performance difference between the current and this new
> implementation. Got any ideas? I fear that we're going overboard trying
> to avoid contention that simple isn't there, but it's hard to argue
> without a concrete test case to analyze..
We do very poorly at measuring contention every where else too... yet we
know it exists and that we have a long way to go yet.
My concern is too avoid contention theoretically. Almost all the cases
of contention we've seen have been pulsing or occasional traffic jam
cases not nice steady contention. As a result we've repeatedly
underestimated the knock-on effects through the system.
Right now this scheme introduces new sources of wait
* access to FSM page while wait for buffer partition lock
* access to FSM page while I/O occurs
* access to FSM blocked by exclusive lock while waiting for
Both of those are significantly longer than any wait on existing FSM
> > The FSM tree as proposed has exact values.
> Not quite. Free space is tracked in BLCKSZ/256 (=32 with default BLCKSZ)
> increments, so that we only need one byte per heap page.
OK, so that probably addresses my concerns then. That is imprecise to
avoid changes across pages when only a few bytes differ between values
higher up the FSM tree. I like that.
> One idea is to make the mapping from "amount of free space in bytes" to
> the 1-byte "FSM category" non-linear. For example, use just one category
> for > 2000 bytes free, on the assumption that anything larger than that
> is toasted anyway, and divide the 255 categories evenly across the
> remaining 2000 bytes.
Hmm. Why not use 4 bits per page and store a log2 value? That still
gives nice big chunks, without too much difference.
0-31 bytes free = 0
32-64 bytes free = 1
65-127 bytes free = 2
128-255 bytes free = 3
256-511 bytes free = 4
512-1023 bytes free = 5
1024-2047 bytes free = 6
That works for BLCKSZ > 8192, which the 1 byte linear scheme doesn't.
It would also give you a fan out of 8 rather than just 4.
It also leaves a free bit per page to record other information also,
when BLCKSZ = 8192.
> > I'm not happy about the FSM being WAL logged for minor changes (new
> > pages, yes). The high number of changes will cause extra traffic where
> > we don't want it. This will accentuate the locks in key areas and will
> > introduce additional delays into the code paths that use FSM, but don't
> > currently write WAL. Writing WAL will further exacerbate the expected
> > contention around the FSM root page.
> There's no such code paths AFAICS. The FSM is only updated on INSERTS,
> UPDATEs and VACUUMs, and they're already WAL-logged. We will be writing
> an extra WAL record, though, in addition to the ones we already write.
Right. So we'll have to queue for the damn lock twice.
> At first, I tried to piggy-back on the WAL replay routines of heap
> inserts and updates, but it turned out to be quite complicated because
> of the bubbling up. If we don't bubble up at update time, per your idea,
> perhaps it would be possible after all.
> > We will write dirty FSM pages at checkpoint, so just allow that rather
> > than logging every change. If we crash, going back to the last
> > checkpoint is probably OK, since all mistakes will cause corrections in
> > the FSM anyway and it will soon correct itself. If the values are only
> > approximate this will make the post-crash imprecision of the FSM less
> > important anyway.
> If we go down that path, we really need to make sure that the FSM is
> self-correcting. The current implementation isn't, the bubble up
> operations need to be replayed from WAL, or you end up with an
> incoherent FSM.
Well, make changes across pages WAL-logged, so the FSM always links up
correctly, but might be wrong on some of the internal nodes. We don't want the
structure to be incoherent, but the values themselves aren't very exciting.
> Changing the logic so that the upper nodes are fixed at searches, rather
> than the updates, helps, but vacuums that increase the amount of free
> space on page would still need to be WAL-logged. Mixing WAL-logged and
> non-WAL-logged operations sounds like a sure way to get confused.
> > I'd suggest that FSM page locks be
> > usually taken conditionally after the root level. So if one route is
> > busy, try another, unless the caller prefers to wait to allow keeping
> > the heap in order.
> We're only talking about lightweight locks here. Doing another
> ReadBuffer, possibly causing I/O, and trying to lock another buffer
> instead is surely more expensive than just waiting on the first lock.
Then you should do it with conditional read buffer, so we don't do I/O
if we don't need to.
> > Maybe re-initialising the level value when jumping between pages, so it
> > restarts at level zero. Maybe call them level 1+ so you can spot that
> > more easily.
> Now you lost me. 'level' is re-initialized in GetPageWithFreeSpace when
> we start over.
> BTW, level zero is the *bottom* level, and 1 is the "middle" level, and
> 2 is the root page.
Yes,so reinitialising the level variable would cause the search to start
again at the bottom of the tree, hence looping.
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Sent via pgsql-patches mailing list (firstname.lastname@example.org)
To make changes to your subscription: