On Thu, 2007-03-22 at 22:11 +0530, Pavan Deolasee wrote:
> With this background, I propose to index ONLY the head of the
> HOT-chain. The TID of the root tuple is used instead of the actual
> TID of the tuple being indexed. This index will not be available to
> the transactions which are started before the CREATE INDEX
> transaction. Just like we use "indisvalid" flag to avoid including
> an invalid index in the plan, we use the pg_index "xid" to decide
> whether to use the index in the plan or not. Only transactions with
> txid > pg_index:xid can see the index and use it.
> In fact, the serializable transactions started before CREATE INDEX
> can not anyway see the index so all this is done to handle
> read-committed transactions.
> In this proposal we indexed only the latest version. But none of the
> transactions started after CREATE INDEX can anyway see the
> older tuples and hence we should be fine even if we don't index
> them in the new index. And none of the older transaction can see
> the index, so again we are safe. The design also helps us to
> preserve the heap HOT semantics and chain pruning and does not
> need VACUUM or any special handling.
Well, ISTM you've nailed it.
CREATE INDEX returns as soon as possible, but people will have to wait
for their next transaction before they can see it and use it too. Nice
role reversal to avoid having CREATE INDEX wait. No restrictions on the
number of indexes, no restrictions on multiple concurrent index builders
and we can do this in just one pass.
The ShareLock taken by CREATE INDEX guarantees all transactions that
wrote data to the table have completed and that no new data can be added
until after the index build commits. So the end of the chain is visible
to CREATE INDEX and won't change. As long as you index the latest
committed version on each HOT chain, then I think it works. Clearly want
to ignore aborted versions. Sounds like you'll need to read the HOT
chains in sequence to ensure we don't repeat the VACUUM FULL error. If
there are no HOT chains then it will be just a normal seq scan of each
block, so there's no real speed loss for situations where no HOT updates
have taken place, such as reload from pg_dump.
Sounds like you'll need to store the Next TransactionId rather than the
TransactionId of the CREATE INDEX. We don't need to store the ComboId as
well, since all commands are planned in ComboId sequence, assuming plan
invalidation blows away any earlier plans held by our own backend.
There is a slight hole in that SERIALIZABLE transactions won't be able
to use any indexes they build during their transaction, since they may
need to be able to see prior data, but I don't think anybody is going to
complain about that restriction. Anyone?
Maybe we can use this technique for CREATE INDEX CONCURRENTLY as well,
so that it doesn't have to wait either. That needs some careful
thinking... it may not work the same because of the locking differences.
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not