On Sun, 17 Dec 2023, 21:14 Michail Nikolaev, <michail.nikol...@gmail.com> wrote: > > Hello! > > > I've thought about alternative solutions, too: how about getting a new > > snapshot every so often? > > We don't really care about the liveness of the already-scanned data; the > > snapshots used for RIC > > are used only during the scan. C/RIC's relation's lock level means vacuum > > can't run to clean up > > dead line items, so as long as we only swap the backend's reported snapshot > > (thus xmin) while > > the scan is between pages we should be able to reduce the time C/RIC is the > > one backend > > holding back cleanup of old tuples. > > Hm, it looks like an interesting idea! It may be more dangerous, but > at least it feels much more elegant than an LP_DEAD-related way. > Also, feels like we may apply this to both phases (first and the second > scans). > The original patch (1) was helping only to the second one (after call > to set_indexsafe_procflags). > > But for the first scan we allowed to do so only for non-unique indexes > because of: > > > * The reason for doing that is to avoid > > * bogus unique-index failures due to concurrent UPDATEs (we might see > > * different versions of the same row as being valid when we pass over them, > > * if we used HeapTupleSatisfiesVacuum). This leaves us with an index that > > * does not contain any tuples added to the table while we built the index.
Yes, for that we'd need an extra scan of the index that validates uniqueness. I think there was a proposal (though it may only have been an idea) some time ago, about turning existing non-unique indexes into unique ones by validating the data. Such a system would likely be very useful to enable this optimization. > Also, (1) was limited to indexes without expressions and predicates > (2) because such may execute queries to other tables (sic!). Note that the use of such expressions would be a violation of the function's definition; it would depend on data from other tables which makes the function behave like a STABLE function, as opposed to the IMMUTABLE that is required for index expressions. So, I don't think we should specially care about being correct for incorrectly marked function definitions. > One possible solution is to add some checks to make sure no > user-defined functions are used. > But as far as I understand, it affects only CIC for now and does not > affect the ability to use the proposed technique (updating snapshot > time to time). > > However, I think we need some more-less formal proof it is safe - it > is really challenging to keep all the possible cases in the head. I’ll > try to do something here. I just realised there is one issue with this design: We can't cheaply reset the snapshot during the second table scan: It is critically important that the second scan of R/CIC uses an index contents summary (made with index_bulk_delete) that was created while the current snapshot was already registered. If we didn't do that, the following would occur: 1. The index is marked ready for inserts from concurrent backends, but not yet ready for queries. 2. We get the bulkdelete data 3. A concurrent backend inserts a new tuple T on heap page P, inserts it into the index, and commits. This tuple is not in the summary, but has been inserted into the index. 4. R/CIC resets the snapshot, making T visible. 5. R/CIC scans page P, finds that tuple T has to be indexed but is not present in the summary, thus inserts that tuple into the index (which already had it inserted at 3) This thus would be a logic bug, as indexes assume at-most-once semantics for index tuple insertion; duplicate insertion are an error. So, the "reset the snapshot every so often" trick cannot be applied in phase 3 (the rescan), or we'd have to do an index_bulk_delete call every time we reset the snapshot. Rescanning might be worth the cost (e.g. when using BRIN), but that is very unlikely. Alternatively, we'd need to find another way to prevent us from inserting these duplicate entries - maybe by storing the scan's data in a buffer to later load into the index after another index_bulk_delete()? Counterpoint: for BRIN indexes that'd likely require a buffer much larger than the result index would be. Either way, for the first scan (i.e. phase 2 "build new indexes") this is not an issue: we don't care about what transaction adds/deletes tuples at that point. For all we know, all tuples of the table may be deleted concurrently before we even allow concurrent backends to start inserting tuples, and the algorithm would still work as it does right now. > Another possible issue may be caused by the new locking pattern - we > will be required to wait for all transaction started before the ending > of the phase to exit. What do you mean by "new locking pattern"? We already keep an ShareUpdateExclusiveLock on every heap table we're accessing during R/CIC, and that should already prevent any concurrent VACUUM operations, right? Kind regards, Matthias van de Meent Neon (https://neon.tech)