On 8/31/07, Tom Lane <[EMAIL PROTECTED]> wrote: > > "Pavan Deolasee" <[EMAIL PROTECTED]> writes: > > Not if someone else releases lock before committing. Which I remind you > is a programming technique we use quite a lot with respect to the system > catalogs. I'm not prepared to guarantee that there is no code path in > core (much less in contrib or third-party addons) that doesn't do it on > a user table; even less that no such path will ever appear in future. > As things stand it's a pretty harmless error --- the worst consequence > I know of is that a VACUUM FULL starting right then might bleat and > refuse to do shrinking. What you propose is to turn the case into a > cause of silent index corruption. > > A more robust solution would be to wait out the source transaction if > CREATE INDEX comes across an INSERT_IN_PROGRESS or DELETE_IN_PROGRESS > HOT tuple. Or you could throw an error, since it's not really likely to > happen (except maybe in a system catalog REINDEX operation). But the > way the patch approaches this at the moment is far too fragile IMHO.
OK. I looked at the code again. In CVS HEAD we assume that we should never see an DELETE_IN_PROGRESS/INSERT_IN_PROGRESS unless its deleted/inserted by our own transaction or we are indexing a system table. Except for these cases, we throw an error. With HOT, we keep it the same i.e. if we see a DELETE_IN_PROGRESS/INSERT_IN_PROGRESS tuple, we throw an error, unless its deleted/inserted by us or this is a system table.. What we change is if we find a DELETE_IN_PROGRESS tuple deleted by our own transaction and its HOT-updated, we skip that tuple. This is fine because if the CREATE INDEX commits the DELETE is also committed and the tuple is not visible (I am still assuming the indcreatxid mechanism is in place) So if we don't do HOT update on system tables, the current algorithm should work fine because we should never find a HOT updated tuple in the system table and the error-out mechanism should ensure that we don't corrupt user tables. So I guess what you are suggesting is to turn on HOT on system tables and then wait-out any DELETING/INSETING transaction if we find such a tuple during CREATE INDEX. I think this would work and since we are talking about system tables, the wait-out business won't be harmful - I remember there were objections when I suggested this as a general solution. If we approach it this way, we might also be able to jettison some of > the other complexity such as idxcreatexid. I doubt, unless we replace it with something better. I think indcreatexid serves another purpose. While building an index, we skip any RECENTLY_DEAD HOT-updated tuples. This is required to handle the existing HOT chains which are broken with respect to the new index. Essentially what it means is we don't want to index anything other than the last committed good tuple in the HOT chain. So when we skip any intermediate RECENTLY_DEAD tuples, which are potentially visible to the existing running transactions, we want those transactions NOT to use this new index. > > IMHO the extra step in C.I.C simplifies the index build. > > If you make the change suggested above, I think you don't need to do > things differently in C.I.C. I am not sure if I follow you correctly here. The issue with CIC is that it works with a snapshot. So during initial index build, we might index a tuple which is in the middle of a broken HOT chain. In the second phase, we just don't need to index the tuple at the head of the chain, but also remove the previously inserted tuple, otherwise there would be two references from the index to the same root heap tuple. The additional step which does two things: 1) Create catalog entry - stops any new broken HOT chains being created 2) Wait-out existing transactions - makes sure that when the index is built, we only index the last committed good tuple in the chain. > >> > >> Since CREATE INDEX works with ShareLock on the relation, we can > >> safely assume that we can't find any DELETE_IN_PROGRESS tuples except > >> those deleted by our own transaction. The only exception is system > >> relation, but we don't do HOT updates on system relation. > > Same issue as above: this makes correctness utterly dependent on nobody > releasing locks early. As I said above, we in fact throw an error for user tables. But if we want to support HOT updates on system tables, we may need to do the wait-out business. In fact, now that I think about it there is no other fundamental reason to not support HOT on system tables. So we can very well do what you are suggesting. > OK. So if I get you correctly, you are suggesting to acquire cleanup > lock. > > If we don't get that, we don't to any maintenance work. Otherwise, we > prune > > and repair fragmentation in one go. > > Yeah, that's what I'm thinking; then there's no need to track a separate > page state where we've pruned but not defragmented. > > I agree. Lets keep it simple and we can always improve it later. The only thing that worries me how to balance between repair fragmentation (which is costly operation since it involves several memmoves) and chain pruning. It might be alright to delay repair operation, but if we end up with long chains, fetches might suffer. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com