Hi, This is a preliminary proposal for Minmax indexes. I'm experimenting with the code, but it's too crude to post yet, so here's a document explaining what they are and how they work, so that reviewers can poke holes to have the design improved. My intention is to have a patch to show for CF2, so please do have a look at this and comment.
This is part of the AXLE project http://www.axleproject.eu and the intention is to support tables of very large size. In a quick experiment, I have a table of ~12 GB and its corresponding index is 65 kB in size, making the time to do the equivalent of a seqscan a small fraction of that taken by a real seqscan. This technique sits between a bitmap scan of a normal btree, and a seqscan: the minmax index tells the bitmap heap scan what pages to seqscan, allowing it to skip a large fraction of pages that are known not to contain tuples matching the query quals. This is a huge win for large data warehouses. Without further ado, here's what I propose. Minmax Range Indexes ==================== Minmax indexes are a new access method intended to enable very fast scanning of extremely large tables. The essential idea of a minmax index is to keep track of the min() and max() values in consecutive groups of heap pages (page ranges). These values can be used by constraint exclusion to avoid scanning such pages, depending on query quals. The main drawback of this is having to update the stored min/max values of each page range as tuples are inserted into them. Other database systems already have this feature. Some examples: * Oracle Exadata calls this "storage indexes" http://richardfoote.wordpress.com/category/storage-indexes/ * Netezza has "zone maps" http://nztips.com/2010/11/netezza-integer-join-keys/ * Infobright has this automatically within their "data packs" http://www.infobright.org/Blog/Entry/organizing_data_and_more_about_rough_data_contest/ * MonetDB seems to have it http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.2662 "Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS" Grammar ------- To create a minmax index, we use CREATE INDEX foo_minmax_idx ON foo USING MINMAX (a, b, e); Partial indexes are not supported; since an index is concerned with minimum and maximum values of the involved columns across all the pages in the table, it doesn't make sense to exclude values. Another way to see "partial" indexes here would be those that only considered some pages in the table instead of all of them; but this would be difficult to implement and manage and, most likely, pointless. Expressional indexes can probably be supported in the future, but we disallow them initially for conceptual simplicity. Having multiple minmax indexes in the same table is acceptable, though most of the time it would make more sense to have a single index covering all the interesting columns. Multiple indexes might be useful for columns added later. Access Method Design -------------------- Since item pointers are not stored inside indexes of this type, it is not possible to support the amgettuple interface. Instead, we only provide amgetbitmap support; scanning a relation using this index requires a recheck node on top. The amgetbitmap routine would return a TIDBitmap comprising all the pages in those page groups that comply with the query quals; the recheck node prunes tuples that are not visible per snapshot and those that are not visible per query quals. For each supported datatype, we need an opclass with the following catalog entries: - support functions (pg_amproc) * pg_proc entry for min() * pg_proc entry for max() - support operators (pg_amop): same as btree (<, <=, =, >=, >) The min() and max() support functions are used during index construction. The support operators are used in the optimizer, so that the index is chosen when queries on the indexed table are planned. (Also, we use them in the amgetbitmap routine, to compare ScanKeys and decide whether to emit a certain block or not). In each index tuple (corresponding to one page range), we store: - first block this tuple applies to - last block this tuple applies to - for each indexed column: * min() value across all tuples in the range * max() value across all tuples in the range * nulls present in any tuple? With the default INDEX_MAX_KEYS of 32, and considering columns of 8-byte length types (timestamptz, bigint), each tuple would be 524 bytes in length, which seems reasonable. Of course, larger columns are possible, such as varchar, but creating minmax indexes on such columns seems of little practical usefulness. This maximum index tuple size is calculated as: BlockNumber (4 bytes) * 2 + data value (8 bytes) * 32 * 2 + null bitmap (4 bytes) Block ranges mentioned in index entries shouldn't overlap. However, there can be gaps where some pages have no covering index entry. (In particular, the last few pages of the table would commonly not be summarized.) In order to scan a table, a sequential scan of the index is done; for each tuple for which the range limits (min/max) are proven to be required to be scanned by the ScanKeys, we obtain the block range covered by that tuple, and return a lossy TIDBitmap of pages in the page range; any unsummarized pages and invalid page ranges (see below) must also be returned. Note that we store pg_proc entries for aggregate functions. It is the responsibility of the access method to obtain the pg_aggregate rows starting from the pg_proc rows; we avoid the parser code that deals with aggregates, because the mechanism it uses to resolve from parser to executor starts from aggregate names; this is not useful for our purposes. Also, running the aggregate is not done through the executor support for aggregates either, because that code is very specialized to running inside a node with a child node feeding tuples, which IndexBuildHeapScan cannot do. Instead, the AM is in charge of initializing state and running the transition functions for each new tuple read by the index-build heap scan. Data changes in the heap ------------------------ Value changes in columns that are part of a minmax index, and tuple insertion in summarized pages, would invalidate the stored min/max values. To support this, each minmax index has a validity map; a range can only be considered in a scan if it hasn't been invalidated by such changes (A range "not considered" in the scan needs to be returned in whole regardless of the stored min/max values, that is, it cannot be pruned per query quals). The validity map is very similar to the visibility map in terms of performance characteristics: quick enough that it's not contentious, allowing updates and insertions to proceed even when data values violate the minmax index conditions. An invalidated range can be made valid by re-summarization (see below). Re-summarization is relatively expensive, because the complete page range has to be scanned. To avoid this, a table having a minmax index would be configured so that inserts only go to the page(s) at the end of the table; this avoids frequent invalidation of ranges in the middle of the table. We provide a table reloption that tweaks the FSM behavior, so that summarized pages are not candidates for insertion. A possible optimization is that whenever an update does not modify the values covered by minmax indexes, or whenever the value changes are not outside the min()/max() interval for the corresponding page range, the page range does not need to be marked invalid. This is particularly useful when considering same-page updates (i.e. updates which find enough empty space in the same page that the new tuple fits in it), and HOT (which frees space used by previous updates). This suggests we ought to recommend lower-than-100 fillfactor values to permit these features to kick in. When tuples are added to unsummarized pages, nothing needs to happen to the index. Tuples can be removed from anywhere without restriction. See below for more on vacuuming. Summarization ------------- At index creation time, the whole table is scanned; for each page range the min() and max() values and nulls bitmap are collected, and stored in the index. The possibly-incomplete range at the end of the table is also included. Once in a while, it is necessary to summarize a bunch of unsummarized pages (because the table has grown since the index was created), or re-summarize a range that has been marked invalid. This is simple: scan the page range calculating the min() and max() for each indexed column, then insert the new index entry at the end of the index. The main interesting questions are: a) when to do it The perfect time to do it is as soon as a complete page range of the configured range size has been filled (assuming page ranges are constant size). b) who does it (what process) It doesn't seem a good idea to have a client-connected process do it; it would incur unwanted latency. Three other options are (i) to spawn a specialized process to do it, which perhaps can be signalled by a client-connected process that executes a scan and notices the need to run summarization; or (ii) to let autovacuum do it, as a separate new maintenance task. This seems simple enough to bolt on top of already existing autovacuum infrastructure. The timing constraints of autovacuum might be undesirable, though. (iii) wait for user command. The easiest way to go around this seems to have vacuum do it. That way we can simply do re-summarization on the amvacuumcleanup routine. Other answers would mean we need a separate AM routine, which appears unwarranted at this stage. Vacuuming --------- Vacuuming a table that has a minmax index does not represent a significant challenge. Since no CTIDs are stored, it's not necessary to scan the index when heap tuples are removed. It might be that some min() value can be incremented, or some max() value can be decremented; but this would represent an optimization opportunity only, not a correctness issue. Perhaps it's simpler to represent this as the need to re-run summarization on the affected page range. Note that if there are no indexes on the table other than the minmax index, usage of maintenance_work_mem by vacuum can be decreased significantly, because no detailed index scan needs to take place (and thus it's not necessary for vacuum to save TIDs to remove). This optimization opportunity is best left for future improvement. Optimizer --------- In order to make this all work, the only thing we need to do is ensure we have a good enough opclass and amcostestimate. With this, the optimizer is able to pick up the index on its own. Open questions -------------- * Same-size page ranges? Current related literature seems to consider that each "index entry" in a minmax index must cover the same number of pages. There doesn't seem to be a hard reason for this to be so; it might make sense to allow the index to self-tune so that some index entries cover smaller page ranges, if this allows the min()/max() values to be more compact. This would incur larger space overhead for the index itself, but might allow better pruning of page ranges during scan. In the limit of one index tuple per page, the index itself would occupy too much space, even though we would be able to skip reading the most heap pages, because the min()/max() ranges are tight; in the opposite limit of a single tuple that summarizes the whole table, we wouldn't be able to prune anything from the seqscan even though the index is very small. * More compact representation for TIDBitmap? TIDBitmap is the structure currently used to represent bitmap scans. The representation of lossy page ranges is not optimal for our purposes, because it uses a Bitmapset to represent pages in the range; since we're going to return all pages in a large range, it might be more convenient to allow for a struct that instead uses start and end page numbers to represent the range. This is mentioned in passing in tidbitmap.c comments, so perhaps the time has come for its implementation. (In experimentation, tidbitmap has no trouble with ranges of a thousand pages.) References: Email thread on pgsql-hackers http://www.postgresql.org/message-id/1199296574.7260.149.ca...@ebony.site From: Simon Riggs To: pgsql-hackers Subject: Dynamic Partitioning using Segment Visibility Map http://wiki.postgresql.org/wiki/Segment_Exclusion http://wiki.postgresql.org/wiki/Segment_Visibility_Map -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers