On Thu, Nov 24, 2016 at 10:09 PM, Robert Haas <robertmh...@gmail.com> wrote:
> I don't really understand how a system of this type copes with page
> splits.  Suppose there are entries 1000, 2000, ..., 1e6 in the btree.
> I now start two overlapping transactions, one inserting all the
> positive odd values < 1e6 and the other inserting all the positive
> even values < 1e6 that are not already present.  The insertions are
> interleaved with each other.  After they get done inserting, what does
> the 3D time traveling btree look like at this point?
> The problem I'm getting at here is that the heap, unlike an index, is
> unordered.  So when I've got all values 1..1e6, they've got to be in a
> certain order.  Necessarily, the values from the in-flight
> transactions are interleaved with each other and with the old data.
> If they're not, I can no longer support efficient searches by the
> in-flight transactions, plus I have an unbounded amount of cleanup
> work to do at commit time.  But if I *do* have all of those values
> interleaved in the proper order, then an abort leaves fragmented free
> space.

Yeah, that's right, but at next insert, on the same page, we can
defragment the page and claim all the space.  We might want to perform
such a cleanup only in certain cases like when the remaining space in
the page is below a certain threshold.

>  I can go and kill all of the inserted tuples during UNDO of an
> aborted transactions, but because page splits have happened, the index
> tuples I need to kill may not be on the same pages where I inserted
> them, so I might need to just search from the top of the tree, so it's
> expensive.

Another way to do is to write UNDO log for split operation (with exact
record locations on pages that are split) so that you don't need to
perform this expensive search operation.  I can understand writing
UNDO for split could be costly but so is writing WAL for it.  I think
for UNDO, we should follow a generic rule as for heap which is that
any change in the page should have it's corresponding UNDO.

>  And even if I do it anyway, the pages are left half-full.
> The point is that the splits are the joint result of the two
> concurrent transactions taken together - you can't blame individual
> splits on individual transactions.

Sure, but you are talking about an extreme case, so it might be okay
to leave pages half-full in such cases, next inserts will consume this
space.  Also aborts are less frequent operations, so the impact should
be minimal.

> Do you see some trick here that I'm missing?

I think we should be able to find some way to tackle the problems you
have listed above as there are databases (including Oracle) which
write UNDO for indexes (btree's) as well.  However, I think here the
crucial question is whether the only-heap-based-undo design can give
us consistent data for all possible cases if so, then it is better to
do that in the first phase and then later we can implement UNDO for
indexes as well.   As of now, nobody has reported any such problem, so
maybe we can stick with the only-heap-based-undo design.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to