On Wed, Oct 19, 2016 at 12:53 PM, Simon Riggs <si...@2ndquadrant.com> wrote:

> On 18 October 2016 at 23:46, Alexander Korotkov
> <a.korot...@postgrespro.ru> wrote:
>
> > Then vacuum removes (0;1) from heap, reference to (0;1) from tbl_pk_idx.
> > But how will it remove (1,1) tuple from tbl_val_indirect_idx?  Thus,
> before
> > vacuuming tbl_val_indirect_idx we should know not only values of id which
> > are being removed, but actually (id, val) pairs which are being removed.
> > Should we collect those paris while scanning heap?  But we should also
> take
> > into account that multiple heap tuples might have same (id, val) pair
> values
> > (assuming there could be other columns being updated).  Therefore, we
> should
> > take into account when last pair of particular (id, val) pair value was
> > deleted from heap.  That would be very huge change to vacuum, may be even
> > writing way more complex vacuum algorithm from scratch.  Probably, you
> see
> > the better solution of this problem.
>
> The best way to sum up the problem is to consider how we deal with
> repeated updates to a single tuple that flip the value from A to B
> then back to A then to B then A etc.. Any value in the index can point
> to multiple versions of the same tuple and multiple index values can
> point to the same tuple (PK value). This problem behaviour was already
> known to me from Claudio's earlier analysis of WARM (thanks Claudio).
>

Thank you for pointing.  I didn't follow details of WARM discussion.

Yes, VACUUMing that is likely to be a complex issue, as you say. At
> the moment I don't have a plan for that, but am not worried.
>

AFAICS, the main goal of indirect indexes is to reduce their maintenance
cost.  Indirect indexes are much easier to maintain during UPDATEs and this
is good.  But it's harder to VACUUM them.  So, we need to figure out how
much maintenance cost would be reduced for indirect indexes.  This is why I
think digging into VACUUM problems is justified for now.

Indirect indexes produce less index entries in general than current,
> so the problem is by-design much smaller than current situation.
> Indirect indexes can support killed-tuple interface, so scanning the
> index by users will result in natural index maintenance, further
> reducing the problem.
>

That makes sense.  But that is not necessary true for any workload.  For
instance, keys, which are frequently updated, are not necessary same that
keys, which are frequently selected.  Thus, there is still some risk of
bloat.

So there will be a much reduced need for bulk maintenance. Bulk
> maintainence of the index, when needed, can be performed by scanning
> the whole table via the index, after the PK index has been vacuumed.
>

That's possible, but such vacuum is going to be very IO consuming when heap
doesn't fit cache.  It's even possible that rebuilding of index would be
cheaper.


> That can be optimized using an index-only scan of the PK to avoid
> touching the heap, which should be effective since the VM has been so
> recently refreshed.


But we can't get which of indirect index keys still persist in heap by
using index only scan by PK, because PK doesn't contain those keys.  So, we
still need to scan heap for it.


> For correctness it would require the index blocks
> to be locked against write while checking for removal, so bulk
> collection of values to optimize the underlying index doesn't seem
> useful. The index scan could also be further optimized by introducing
> a visibility map for the index, which is something that would also
> optimize normal index VACUUMs as well, but that is a later project and
> not something for 10.x
>

Visibility map for indexes sounds interesting.  And that means including
visibility information into index.  It's important property of current MVCC
implementation of PostgreSQL, that while updating heap tuple, we don't have
to find location of old index tuples referring it, we only have to insert
new index tuples.  Finding location of old index tuples, even for barely
updating index visibility map, would be a substantial change.

At this stage, the discussion should be 1) can it work? 2) do we want
> it?


I think that we definitely need indirect indexes and they might work.  The
question is design.  PostgreSQL MVCC is designed so that index contain no
visibility information.  So, currently we're discussing approach which
implies expanding of this design to indirect indexes.  The downsides of
this approach are (at least): 1) we should always recheck results obtained
from index, 2) VACUUM becomes very difficult.

There is also alternative approach: include visibility information into
indirect index.  In this approach we should include fields required for
visibility (xmin, xmax, etc) into indirect index tuple and keep them up to
date.  Then while updating indexed column we would have to update old index
tuple as well.  This is the downside.  But we would be able to scan without
recheck and VACUUM will be much more easier.  We would be even able to
VACUUM indirect index independently from heap.  Implementation of this
approach would be way more intrusive.  But in my opinion it's much more
clear design.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Reply via email to