On Fri, 2007-03-16 at 21:56 +0530, Pavan Deolasee wrote:

> Any thoughts on the overall approach ? Any suggestions to
> simplify things or any alternate designs ? 

Well your design is very different from what we discussed, so I think I
should post my proposed design alongside this, for further discussion.

- - -


CREATE INDEX CONCURRENTLY has no feature changes to work with HOT.

CREATE INDEX works normally, except when HOT tuples are found, in which
case special processing occurs requiring additional locking. The index
build occurs in a single scan, as now.

We perform no chilling during the initial scan. We index the tuple
identified by SnapshotNow, but we take the root tuple's htid, not the
htid of the tuple version being indexed. We assume that the tuple
version indexed will be the root of the HOT chain by the time the index
is complete.

Currently, we wait until all pre-existing transactions exit before we
allow this to COMMIT. With HOT, we simply move the wait so it occurs
*before* the second scan, then we can prune the HOT chains as we pass
through the heap on the second scan. There will be no pre-existing HOT
tuples and so no chilling is required.


We add a field, xchill, to pg_class that stores TransactionIds. This is
set to InvalidTransactionId if no index has been built yet. The concept
of "main indexer" is introduced, so we allow multiple concurrent index
builds, but only one of these can chill tuples at a time.

1. In IndexBuildHeapScan, as we pass through the table:

a) if we find any any HOT rows, we check xchill and do one of steps
(i-iii). Until this point, it hasn't been important whether we are the
main or a secondary indexer.

i) if xchill is InvalidTransactionId or is committed then we attempt to
become main indexer immediately, following these steps before we
continue building the index (1b)

-- If the table is temp, or if we created the table then we immediately
become the main indexer, so return immediately. If the table being
indexed is already visible to everybody, then:
-- Update pg_class entry for the table, setting the xchill field for the
table to the builder's Xid. (Use heap_inplace_update for this, which is
OK, whether we commit or abort).
-- acquire AccessExclusiveLock on all existing indexes (not the heap)

ii) If xchill is in-progress we wait for that transaction to complete,
then do either step i) or iii). We cannot continue building our index
until the other transaction commits because we cannot yet see the other
index, yet we have to insert into it in order to correctly chill a tuple
to allow *our* index to be built.

iii) if xchill is aborted we abort also, saying that a VACUUM is needed.

b) If we get here then we are the main indexer and can chill tuples. As
we move through the scan we chill all HOT tuples, mark them
HEAP_CHILL_IN_PROGRESS, write WAL for this and insert index entries for
them in all existing indexes, as well as this one. Then remove
CHILL_IN_PROGRESS flags, without writing WAL.

c) release locks on indexes, before end of transaction

2. If we crash or a transaction abort occurs:
- we cannot prune a HEAP_ONLY_TUPLE that points to a tuple with

- VACUUM must be used to clean up after an aborted index build and needs
some additional code to allow this to occur.

3. Concurrent index builds are allowed. If we are not the main indexer,
then we can attempt to build an index, but any scan that sees a HOT
tuple will block and wait for the main index builder to complete before
it proceeds.

4. When an indexscan reads the table, if it finds a
HEAP_CHILL_IN_PROGRESS tuple it may or may not be valid. Concurrent
index scans and tuple chilling can mean that an index scan find the same
tuple twice, by different routes, if a CREATE INDEX crashed. To avoid
this an IndexScan will only find a tuple visible if it came across a
HEAP_CHILL_IN_PROGRESS tuple using an indirect route, i.e. it followed
the path from root->HOT tuple.

In this design, CREATE INDEX does have a deadlock risk when it is used
within a transaction *and* the index is being built on a publicly
visible table (i.e. not just-built and not temp). IMHO that risk is
acceptable, since if users are worried about concurrent access to a

  Simon Riggs             
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to