Did someone come up with a bitmap compression scheme for on-disk bitmap indexes that would help out here? Some form of compression could make a big difference in mostly-dead pages. If nothing else, it would likely be worth special-casing an entire page being dead, which is a common case for queue tables. That could be done by making an entry in the page number array with a special offset value.
On Sun, May 13, 2007 at 07:42:36PM -0400, Tom Lane wrote: > Heikki Linnakangas <[EMAIL PROTECTED]> writes: > >>> Or we could switch to a more compact representation of the dead tuples, > >>> and not need such a big maintenance_work_mem in the first place. > > > One idea is to use a compressed bitmap like in the bitmap index patch, > > and a tree of block numbers or ranges to allow random access to it. > > I thought a bit about that but it doesn't seem tremendously appealing, > at least not as the only representation, because it's not more compact > for small numbers of dead tuples per page. (And we don't have the "out" > of switching to lossy representation.) > > Here's a design sketch that works if we are willing to limit VACUUM's > usable maintenance_work_mem to 4GB: > > 1. Store an array of page numbers plus offsets into a second working > array of uint16 (the offsets are 32 bits, whence the 4GB limitation). > This requires 8 bytes per page-with-dead-tuples, and since it will be > built in order as a byproduct of our scanning order, it can be > binary-searched on the page number. > > 2. The offset part of the per-page entry points at a segment of the > uint16 array belonging to this page. It can have one of 2 formats. > For a small number of dead tuples on the page, we just store an > array of line numbers. For a larger number, we store a bitmap > showing the positions of dead tuples. While we scan a page, we > accumulate dead tuple line numbers in a small working array, and > then either copy those to the large array or build a bitmap from > them, depending on which will be smaller. Since the offsets into > the uint16 array will always be even, we can usurp the low-order > bit of the pointer word to distinguish which representation is > stored. > > 3. You might think we need to spend an additional word storing > how many line numbers or bitmap words there are per page, but > we can save that space by comparing offsets of adjacent entries > in the per-page array, since we know they're stored adjacently. > > I envision the per-page array as being built upwards from the bottom of > a single large maintenance_work_mem-sized array, and the uint16 array > data as being filled from the top down, and whenever the two pointers > are in danger of crossing, we stop and do an index vacuum cycle, just > like in the current logic. This lets us avoid having to guess in > advance how much per-page versus per-tuple space we will need. Note > this means the end of a page entry's uint16 data is determined by > looking at the prior page entry's offset instead of the next one, > but that seems no big problem. > > So the lookup process involves a binary search on the page number only, > and then either a scan of the tuple line numbers or a single bitmap > probe. (We could make the scan be a binary search, but since that > representation will only be used with small numbers of tuples, it'd > probably not be any faster than a simple search loop.) AFAICS that > ought to be as fast or faster than the current lookup methodology; > significantly faster where there are many dead tuples per page. > > The space requirements are: > > No dead tuples on page 0 bytes (same as now) > 1 dead tuple on page 10 bytes (vs 6 now) > 2 dead tuples 12 bytes (same as now) > 3 dead tuples 14 bytes (vs 18 now) > > and so on, except that for typical table densities of say 100 tuples per > page, we will switch over to the bitmap representation at 6 or so dead > tuples per page, and so the requirement will never go beyond about 20 > bytes per page whereas the current method could get as bad as 600 bytes > for an all-dead page. > > What this says is that it's not worth changing if you expect low > dead-tuple densities (which IIRC was the assumption I made when I > designed the current representation). In a table with an average of > less than 1 dead tuple per page, this way is a loser. OTOH, with very > low dead-tuple densities it may hardly matter, since you're going to > have to scan many GB of heap before you fill maintenance_work_mem > anyway. > > If you assume that a more typical scenario is vacuuming after 10% > or so tuple "churn", then there would be 10 or so dead tuples per > page, which makes this a good win: about 20 vs about 60 bytes per > page, with the win going up the longer vacuum is delayed. > > HOT would take away some of the win, probably, but I'm not sure > how much. > > Comments? Can anyone invent a better data structure? > > regards, tom lane > -- Jim Nasby [EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend