On 8/30/07, Tom Lane <[EMAIL PROTECTED]> wrote: > > "Pavan Deolasee" <[EMAIL PROTECTED]> writes: > > You are right - a new index might mean that an existing HOT chain > > is broken as far as the new index is concerned. The way we address > > that is by indexing the root tuple of the chain, but the index key is > > extracted from the last tuple in the chain. The index is marked > "unusable" > > for all those existing transactions which can potentially see any > > intermediate tuples in the chain. > > I don't think that works --- what if the last tuple in the chain isn't > committed good yet? If its inserter ultimately rolls back, you've > indexed the wrong value.
I am confused. How could we get ShareLock on the relation while there is some open transaction which has inserted a tuple in the table ? (Of course, I am not considering the system tables here) If the transaction which inserted the last tuple in the chain is already aborted, we are not indexing that tuple (even if that tuple is at the end). We would rather index the last committed-good tuple in the chain. Running the tuples with HeapTupleSatisfiesVacuum guarantees that. Isn't it ? > Please see this document written by Greg Stark. He has nicely summarized > > how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. > > http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php > > Isn't the extra machination for C.I.C. just useless complication? > What's the point of avoiding creation of new broken HOT chains when > you still have to deal with existing ones? IMHO the extra step in C.I.C simplifies the index build. The transaction-waits takes care of the existing broken chains and the early catalog entry for the index helps us avoid creating new broken chains. I am not against doing it a different way, but this is the best solution we could come up when we worked on CIC. >> I also don't think I > >> believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated > >> tuples: what if the index completion commits, but the concurrent delete > >> rolls back? Then you've got a valid tuple that's not in the index. > > > Since CREATE INDEX works with ShareLock on the relation, we can > > safely assume that we can't find any DELETE_IN_PROGRESS tuples except > > those deleted by our own transaction. The only exception is system > relation, > > but we don't do HOT updates on system relation. > > That chain of reasoning is way too shaky for me. Essentially what > you're saying is you'll promise not to corrupt non-system indexes. > Nor am I really thrilled about having to disable HOT for system > catalogs. I am not against supporting HOT for system catalogs. But IMHO its not a strict requirements because system catalogs are small, less frequently updated and HOT adds little value to them. If we don't have a generic solution which works for system and non-system tables, thats the best we can get. We can start with non-system tables and expand its scope later. > The only reason to redefine MaxHeapTuplesPerPage to higher side is > > because HOT allows us to accommodate more tuples per page because > > of redirect-dead line pointers. > > No, it doesn't allow more tuples per page. It might mean there can be > more line pointers than that on a page, but not more actual tuples. > The question is whether there is any real use in allowing more line > pointers than that --- the limit is already unrealistically high, > since it assumes no data content in any of the tuples. If there is a > rationale for it then you should invent a different constant > MaxLinePointersPerPage or some such, but I really think there's no > point. I agree. I think the current limit on MaxHeapTuplesPerPage is sufficiently large. May be we can keep it the same and tune it later if we have numbers to prove its worthiness. > Doubling the value is chosen as a balance between heap page > > utilization, line pointer bloating and overhead for bitmap scans. > > I'd say it allows a ridiculous amount of line pointer bloat. OK. Lets keep MaxHeapTuplesPerPage unchanged. >> Even if it's safe, ISTM what you're mostly accomplishing there is to > >> expend a lot of cycles while holding exclusive lock on the page, when > >> there is good reason to think that you're blocking other people who are > >> interested in using the page. Eliminating the separation between that > >> and cleanup would also allow eliminating the separate "PD_FRAGMENTED" > >> page state. > > > The reason we did it that way because repairing fragmentation seems > > much more costly that pruning. Please note that we prune a single > > chain during index fetch. Its only for heap-scans (and VACUUM) that > > we try to prune all chains in the page. So unless we are doing > heap-scan, > > I am not sure if we are spending too much time holding the exclusive > > lock. I agree we don't have any specific numbers to prove that though. > > If you don't have numbers proving that this extra complication has a > benefit, I'd vote for simplifying it. The SnapshotAny case is going to > bite other people besides you in future. OK. So if I get you correctly, you are suggesting to acquire cleanup lock. If we don't get that, we don't to any maintenance work. Otherwise, we prune and repair fragmentation in one go. A related question is: should we always prune all chains in the page ? I guess if we are going to repair fragmentation, it would make more sense to do that. > Another reasoning behind separating these two steps is: pruning > > requires exclusive lock whereas repairing fragmentation requires > > cleanup lock. > > This is nonsense. Those are the same lock. If you have the former and > not the latter, it just means that you *know* there is contention for > the page. It seems to me that performing optional maintenance work in > such a situation is completely wrong. > > Oh yes, they are the same lock. The difference is the chances of getting one is more than the other. But I agree with your argument about the contention and maintenance work. I think we can do what you are suggesting and then fine tune it. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com