On Thu, Nov 24, 2016 at 10:09 PM, Robert Haas <robertmh...@gmail.com> wrote: > > I don't really understand how a system of this type copes with page > splits. Suppose there are entries 1000, 2000, ..., 1e6 in the btree. > I now start two overlapping transactions, one inserting all the > positive odd values < 1e6 and the other inserting all the positive > even values < 1e6 that are not already present. The insertions are > interleaved with each other. After they get done inserting, what does > the 3D time traveling btree look like at this point? > > The problem I'm getting at here is that the heap, unlike an index, is > unordered. So when I've got all values 1..1e6, they've got to be in a > certain order. Necessarily, the values from the in-flight > transactions are interleaved with each other and with the old data. > If they're not, I can no longer support efficient searches by the > in-flight transactions, plus I have an unbounded amount of cleanup > work to do at commit time. But if I *do* have all of those values > interleaved in the proper order, then an abort leaves fragmented free > space. >
Yeah, that's right, but at next insert, on the same page, we can defragment the page and claim all the space. We might want to perform such a cleanup only in certain cases like when the remaining space in the page is below a certain threshold. > I can go and kill all of the inserted tuples during UNDO of an > aborted transactions, but because page splits have happened, the index > tuples I need to kill may not be on the same pages where I inserted > them, so I might need to just search from the top of the tree, so it's > expensive. > Another way to do is to write UNDO log for split operation (with exact record locations on pages that are split) so that you don't need to perform this expensive search operation. I can understand writing UNDO for split could be costly but so is writing WAL for it. I think for UNDO, we should follow a generic rule as for heap which is that any change in the page should have it's corresponding UNDO. > And even if I do it anyway, the pages are left half-full. > The point is that the splits are the joint result of the two > concurrent transactions taken together - you can't blame individual > splits on individual transactions. > Sure, but you are talking about an extreme case, so it might be okay to leave pages half-full in such cases, next inserts will consume this space. Also aborts are less frequent operations, so the impact should be minimal. > > Do you see some trick here that I'm missing? > I think we should be able to find some way to tackle the problems you have listed above as there are databases (including Oracle) which write UNDO for indexes (btree's) as well. However, I think here the crucial question is whether the only-heap-based-undo design can give us consistent data for all possible cases if so, then it is better to do that in the first phase and then later we can implement UNDO for indexes as well. As of now, nobody has reported any such problem, so maybe we can stick with the only-heap-based-undo design. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers