Overview -------- CREATE INDEX, CREATE INDEX CONCURRENTLY and VACUUM FULL all need some adaptation to work correctly with HOT.
[This summary and proposal supercedes all previous proposals by me regarding utilities with HOT] The Problem ----------- With HOT, CREATE INDEX may find tuples that are HEAP_ONLY_TUPLE, yet have an index value(s) that differs from the indexable value of the tuple version which is at the root of the hot chain. e.g. we have a simple hot chain that looks like this on one page: tup1(attr1=1, attr2=A)[ROOT] -> tup2(attr1=1, attr2=B)[HOT) there is already an index(attr1), and we are trying to build an index on attr2. Index(attr2) will to point to both tup1 and tup2. So by the end of the build we need to have inserted an additional index pointer into index(attr1) (yes, the other index) for tup2 and also broken the link between tup1->tup2. This operation is known as chilling, or de-HOT-ifying. If we don't do this, we will have an index inconsistency. If we attempt to chill a tuple concurrently, we also get the possibility of a concurrent index scan following multiple paths to the same tuple. This effects all types of indexes, not just b-trees. Solutions --------- There are three basic solutions: LOCK, UPDATE & WAIT and MARK 1. LOCK - Use some form of locking to prevent concurrent index scans while we modify the various tuple versions and fiddle with the index entries. This has some sub-options: a) AccessExclusiveLock on main relation prevents all index accesses b) Lock all of the indexes with an AccessExclusiveLock, in a consistent order, so that no concurrent index scans are possible. This would be a non-transactional lock, so could be dropped before end of transaction - we're not locking the data we're just locking the physical access mechanism. 2. UPDATE & WAIT - From Hannu - Perform cold-null-UPDATEs on each tuple, which can happen concurrently. This can happen concurrently, but then requires us to wait until we are older than all non-utility transactions, so that these UPDATEs are visible to all. 3. MARK - From Heikki - verbatim: One solution I thought of is to add another flag to heap tuples, CHILL_IN_PROGRESS. To chill a tuple, you would: 1. Mark heap tuple with CHILL_IN_PROGRESS 2. Insert missing index entries 3. Clear CHILL_IN_PROGRESS and HEAP_ONLY_TUPLE flags Index scans would ignore tuples with CHILL_IN_PROGRESS and directly pointed to from the index. That way if we crash in the middle of step 2, scans and updates would work normally after replay, as if the index entries weren't there. CREATE INDEX would have to fail if there's any CHILL_IN_PROGRESS tuples, because we wouldn't know which index entries need to be inserted; some might already be there. To clear the CHILL_IN_PROGRESS flag, a vacuum would be needed. Vacuum would remove all index entries for those tuples, but instead of removing the heap tuple in the 2nd scan it would just clear the CHILL_IN_PROGRESS flag, bringing us back to where we started. Thoughts -------- - CREATE INDEX CONCURRENTLY can use UPDATE & WAIT or MARK - VACUUM FULL could use any of the solutions. It does currently hold AccessExclusiveLock. UPDATE & WAIT would be very good. - CREATE INDEX has the most problems, but MARK solves most of them. Multiple index builds may need to chill different sets of tuples, but if there is any overlap then the second build will fail when it sees the CHILL_IN_PROGRESS flag that the other has set. So we can no longer achieve multiple concurrent index builds without great care. It might be possible to clean up old CHILL_IN_PROGRESS flags as we go, but if we did that we'd need to make sure concurrent index scans never pass each other, otherwise one index build would clean up the others flags. So we have two choices: a) we continue to allow concurrent index builds, but we throw an ERROR when an index build finds a CHILL_IN_PROGRESS flag b) we clean up CHILL_IN_PROGRESS flags as we go, but we give up the ability to perform concurrent index builds Proposal -------- - CREATE INDEX can use the MARK solution as part of the first scan (cos there's only one), setting flags called HEAP_CHILL_IN_PROGRESS. As the index build scan proceeds we insert records directly into each index using the standard WAL-logged insertion path. No spool file is built up while we do this. Setting HEAP_CHILL_IN_PROGRESS must must also write WAL records and dirty the block, since we might fail after we have performed an index insert but before we remove the mark. This could mean we'd end up writing lots of database blocks when we do CREATE INDEX. (Note here that we use WAL to chill, even though the main index build may still skip writing WAL as it does now, if archive_command is not set). Note we might fail to build an index because of a uniqueness violation that occurs *during* the tuplesort run. So we must mark the tuple, insert the indexes, clean the tuple and then offer the tuple to the index callback. We pick option a): continue to allow concurrent index builds, but we throw an ERROR when an index build finds a HEAP_CHILL_IN_PROGRESS flag. That's the optimistic approach. Building indexes concurrently only usually happens after a restore/large load, which will be when there are no HOT tuples or flags set anyway. This means we also need to modify the VACUUM process to clean up after a failed index build. VACUUM will build a second array of tids that can be used to perform removal of HEAP_CHILL_IN_PROGRESS during lazy_scan_index() [[perhaps with uncool_tuple()? :-) ]]. We use a second array because we need only remove HEAP_CHILL_IN_PROGRESS flags when a tuple will not be removed by the normal vacuum process, so there is by definition no overlap between the two lists. We use an array of tids to allow us to reuse all the existing code for heap/index cleaning. - CREATE INDEX CONCURRENTLY can use the MARK solution as part of the second scan. After the first scan, the new index will be visible to other transactions and no new HOT tuples will be created that could effect the new index. So chilling as part of the second scan will ensure that we catch all HOT tuples. - VACUUM FULL - The best solution, for now, is to make VACUUM FULL perform a reindex on all indexes on the table. Chilling may require us to modify considerably more index entries than previously. UPDATE & WAIT would be very good, but probably should wait for the next release now, since we now have changes to make to 4 utilities. Are we cool with that? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq