Hello, Matthias!

> We can just release the current snapshot, and get a new one, right? I
> mean, we don't actually use the transaction for much else than
> visibility during the first scan, and I don't think there is a need
> for an actual transaction ID until we're ready to mark the index entry
> with indisready.

> I suppose we could be resetting the snapshot every so often? Or use
> multiple successive TID range scans with a new snapshot each?

It seems like it is not so easy in that case. Because we still need to hold
catalog snapshot xmin, releasing the snapshot which used for the scan does
not affect xmin propagated to the horizon.
That's why d9d076222f5b94a85e0e318339cfc44b8f26022d(1) affects only the
data horizon, but not the catalog's one.

So, in such a situation, we may:

1) starts scan from scratch with some TID range multiple times. But such an
approach feels too complex and error-prone for me.

2) split horizons propagated by `MyProc` to data-related xmin and
catalog-related xmin. Like `xmin` and `catalogXmin`. We may just mark
snapshots as affecting some of the horizons, or both. Such a change feels
easy to be done but touches pretty core logic, so we need someone's
approval for such a proposal, probably.

3) provide some less invasive (but less non-kludge) way: add some kind of
process flag like `PROC_IN_SAFE_IC_XMIN` and function like
`AdvanceIndexSafeXmin` which changes the way backend affect horizon
calculation. In the case of `PROC_IN_SAFE_IC_XMIN` `ComputeXidHorizons`
uses value from `proc->safeIcXmin` which is updated by
`AdvanceIndexSafeXmin` while switching scan snapshots.

So, with option 2 or 3, we may avoid holding data horizon during the first
phase scan by resetting the scan snapshot every so often (and, optionally,
using `AdvanceIndexSafeXmin` in case of 3rd approach).


The same will be possible for the second phase (validate).

We may do the same "resetting the snapshot every so often" technique, but
there is still the issue with the way we distinguish tuples which were
missed by the first phase scan or were inserted into the index after the
visibility snapshot was taken.

So, I see two options here:

1) approach with additional index with some custom AM proposed by you.

   It looks correct and reliable but feels complex to implement and
maintain. Also, it negatively affects performance of table access (because
of an additional index) and validation scan (because we need to merge
additional index content with visibility snapshot).

2) one more tricky approach.

We may add some boolean flag to `Relation` about information of index
building in progress (`indexisbuilding`).

It may be easily calculated using `(index->indisready &&
!index->indisvalid)`. For a more reliable solution, we also need to somehow
check if backend/transaction building the index still in progress. Also, it
is better to check if index is building concurrently using the "safe_index"
way.

I think there is a non too complex and expensive way to do so, probably by
addition of some flag to index catalog record.

Once we have such a flag, we may "legally" prohibit `heap_page_prune_opt`
affecting the relation updating `GlobalVisHorizonKindForRel` like this:

   if (rel != NULL && rel->rd_indexvalid && rel->rd_indexisbuilding)
           return VISHORIZON_CATALOG;

So, in common it works this way:

* backend building the index affects catalog horizon as usual, but data
horizon is regularly propagated forward during the scan. So, other
relations are processed by vacuum and `heap_page_prune_opt` without any
restrictions

* but our relation (with CIC in progress) accessed by `heap_page_prune_opt`
(or any other vacuum-like mechanics) with catalog horizon to honor CIC
work. Therefore, validating scan may be sure what none of the HOT-chain
will be truncated. Even regular vacuum can't affect it (but yes, it can't
be anyway because of relation locking).

As a result, we may easily distinguish tuples missed by first phase scan,
just by testing them against reference snapshot (which used to take
visibility snapshot).

So, for me, this approach feels non-kludge enough, safe and effective and
the same time.

I have a prototype of this approach and looks like it works (I have a good
test catching issues with index content for CIC).

What do you think about all this?

[1]:
https://github.com/postgres/postgres/commit/d9d076222f5b94a85e0e318339cfc44b8f26022d#diff-8879f0173be303070ab7931db7c757c96796d84402640b9e386a4150ed97b179R1779-R1793

Reply via email to