On Wed, Nov 16, 2016 at 5:06 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> + * Ideally, we'd compare every item in the index against every other
> + * item in the index, and not trust opclass obedience of the transitive
> + * law to bridge the gap between children and their grandparents (as
> + * well as great-grandparents, and so on).  We don't go to those
> + * lengths because that would be prohibitively expensive, and probably
> + * not markedly more effective in practice.
> + */
> I skimmed the Modern B-Tree Techniques paper recently, and there was a
> section on "the cousin problem" when verifying btrees, which this
> comment reminded me of.

I'm impressed that you made the connection. This comment reminded you
of that section because it was directly influenced by it. I actually
use the term "cousin page" elsewhere, because it's really useful to
distinguish cousins from "true siblings" when talking about page
deletion in particular...and I have a lot to say about page deletion.

> I tried to come up with an example of a pair
> of characters being switched in a collation that would introduce
> undetectable corruption:
> T1.  Order   = a < b < c < d
>      Btree   =        [a|c]
>                       /   \
>                      /     \
>                     /       \
>                   [a]-------[c]
>                    |         |
>                    |         |
>                   [b]-------[d]
> T2.  Order   = a < c < b < d

So, the Btree you show is supposed to be good? You've left it up to me
to draw T2, the corrupt version? If so, here is a drawing of T2 for
the sake of clarity:

      Btree   =        [a|b]
                       /   \
                      /     \
                     /       \
                    |         |
                    |         |

> Now c and b have been switched around.  Won't amcheck still pass since
> we only verify immediate downlinks and siblings?  Yet searches for b
> will take a wrong turn at the root.  Do I have this right?  I wonder
> if there is a way to verify that each page's high key < the 'next' key
> for all ancestors.

I don't think you have it right, because your argument is predicated
on classic L&Y, not the Postgres variant. The high key is literally
guaranteed to be *equal* to the initial first value of the right half
of a split page post-split (at least initially), which is where the
new downlink for the new right half comes from in turn, so it will be
exactly equal as well. There is a comment which says all this within
_bt_insert_parent(), actually.

In general, I think it's pretty bad that we're so loose about
representing the key space in downlinks compared to classic L&Y,
because it probably makes index bloat a lot worse in internal pages,
messing up the key space long term. As long as we're doing that,
though, we clearly cannot usefully "verify that each page's high key <
the 'next' key for all ancestors", because we deviate from L&Y in a
way that makes that check always fail. It's okay that it always fails
for us (it doesn't actually indicate corruption), because of our
duplicate-aware index scan logic (change to invariant).

In summary, I'm pretty sure that our "Ki <= v <= Ki+1" thing makes the
cousin problem a non-issue, because we don't follow downlinks that are
equal to our scankey -- we go to their immediate left and start
*there*, in order to support duplicates in leaf pages. Index scans for
'b' work for both T1 and T2 in Postgres, because we refuse to follow
the 'b' downlinks when doing an index scan for 'b' (relevant to T2). I
would actually like to change things to make the invariant the classic
L&Y "Ki < v <= Ki+1", to avoid bloat in the internal pages and to make
suffix truncation in internal pages work.

So, we don't have the cousin problem, but since I wish for us to adopt
the stricter classic L&Y invariant at some point, you could say that I
also wished that we had the cousin problem.

Confused yet? :-)

[1] https://archive.org/stream/symmetricconcurr00lani#page/6/mode/2up
Peter Geoghegan

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to