On 16.11.2010 20:01, Tom Lane wrote:
Heikki Linnakangas<heikki.linnakan...@enterprisedb.com>  writes:
2. When a page is split, we mark the new left page with a flag to
indicate that the downlink for the page to the right hasn't been
inserted yet. When the downlink is inserted, the flag is cleared. Again
the purpose is to ensure that the tree is self-consistent at all times.
If we crash just after a page split, before the downlink is inserted,
scans will find the tuples on the right page by following the rightlink.
It's slightly less performant, but correct.

The one thought that comes to mind is how does the flag business work
after multiple splittings?  That is, assume we have a page that has the
flag set because of a previous crash.  If we have to split either that
page or its right sibling, what do we do with the flags?

As I mentioned in the README, the insertion algorithm finishes any incomplete splits it sees before proceeding. AFAICS that should ensure that the situation never arises where you try to split a page that already has the flag set. Or its right sibling; such a page can only be reached via the rightlink so you would see the page with the flag set first.

Hmm, there is one corner-case that I didin't think of before: One backend splits a leaf page, and another backend concurrently splits the parent of that leaf page. If for some reason the backend operating on the parent page dies, releasing the locks, the other backend will see the incomplete split when it walks up to insert the downlink. Although it should be possible to handle that, I think we can simply give up on inserting the downlink in that case, and leave that split incomplete as well. It's still correct, and next insert that comes along will complete the splits, from top to bottom.

BTW, I don't try to fix incomplete splits during vacuum in the patch. That's perhaps a bit surprising, and probably would be easy to add, but I left it out for now as it's not strictly necessary.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to