05.04.2016 01:48, Peter Geoghegan :
On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
It's a bit more complicated to add it into index creation algorithm.
There's a trick with a "high key".
          * We copy the last item on the page into the new page, and then
          * rearrange the old page so that the 'last item' becomes its high
          * rather than a true data item.  There had better be at least two
          * items on the page already, else the page would be empty of useful
          * data.
          * Move 'last' into the high key position on opage

To be consistent with other steps of algorithm ( all high keys must be
truncated tuples), I had to update this high key on place:
delete the old one, and insert truncated high key.
Hmm. But the high key comparing equal to the Scankey gives insertion
the choice of where to put its IndexTuple (it can go on the page with
the high key, or its right-sibling, according only to considerations
about fillfactor, etc). Is this changed? Does it not matter? Why not?
Is it just worth it?

I would say, this is changed, but it doesn't matter.
Performing any search in btree (including choosing suitable page for insertion), we use only key attributes.
We assume that included columns are stored in index unordered.
Simple example.
create table tbl(id int, data int);
create index idx on tbl (id) including (data);

Select query does not consider included columns in scan key.
It selects all tuples satisfying the condition on key column. And only after that it applies filter to remove wrong rows from the result. If key attribute doesn't satisfy query condition, there are no more tuples to return and we can interrupt scan.

You can find more explanations in the attached sql script,
that contains queries to recieve detailed information about index structure using pageinspect.

The right-most page on every level has no high-key. But you say those
pages have an "imaginary" *positive* infinity high key, just as
internal pages have (non-imaginary) minus infinity downlinks as their
first item/downlink. So tuples in a (say) leaf page are always bound
by the downlink lower bound in parent, while their own high key is an
upper bound. Either (and, rarely, both) could be (positive or
negative) infinity.

Maybe you now see why I talked about special _bt_compare() logic for
this. I proposed special logic that is similar to the existing minus
infinity thing _bt_compare() does (although _bt_binsrch(), an
important caller of _bt_compare() also does special things for
internal .vs leaf case, so I'm not sure any new special logic must go
in _bt_compare()).

In my view, it's the correct way to fix this problem, because the caller is
responsible for passing proper arguments to the function.
Of course I will add a check into bt_compare, but I'd rather make it an
assertion (see the patch attached).
I see what you mean, but I think we need to decide what to do about
the key space when leaf high keys are truncated. I do think that
truncating the high key was the right idea, though, and it nicely
illustrates that nothing special should happen in upper levels. Suffix
truncation should only happen when leaf pages are split, generally
As I said, the high key is very similar to the downlinks, in that both
bound the things that go on each page. Together with downlinks they
represent *discrete* ranges *unambiguously*, so INCLUDING truncation
needs to make it clear which page new items go on. As I said,
_bt_binsrch() already takes special actions for internal pages, making
sure to return the item that is < the scankey, not <= the scankey
which is only allowed for leaf pages. (See README, from "Lehman and
Yao assume that the key range for a subtree S is described by Ki < v
<= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent

To give a specific example, I worry about the case where two sibling
downlinks in a parent page are distinct, but per specific-to-Postgres
"Ki <= v <= Ki+1" thing (which differs from the classic L&Y
invariant), some tuples with all right downlink's attributes matching
end up in left child page, not right child page. I worry that since
_bt_findsplitloc() doesn't consider this (for example), the split
point doesn't *reliably* and unambiguously divide the key space
between the new halves of a page being split. I think the "Ki <= v <=
Ki+1"/_bt_binsrch() thing might save you in common cases where all
downlink attributes are distinct, so maybe that simpler case is okay.
But to be even more specific, what about the more complicated case
where the downlinks *are* fully _bt_compare()-wise equal? This could
happen even though they're constrained to be unique in leaf pages, due
to bloat. Unique indexes aren't special here; they just make it far
less likely that this would happen in practice, because it takes a lot
of bloat. Less importantly, when that bloat happens, you don't want to
have to do a linear scan through many leaf pages (that should only
happen when there are many fully matching IndexTuples at the leaf
level -- not just matching on constrained attributes).

"just matching on constrained attributes" is the core idea of the whole patch. Included columns just provide us possibility to use index-only scan. Nothing more. We assume use case where index-only-scan is faster than index-scan + heap fetch. For example, in queries like "select data from tbl where id = 1;" we have no scan condition on data. Maybe you afraid of long linear scan when we have enormous index bloat even on unique index. It will happen anyway, whether we have index-only scan on covering index or index-scan on unique index + heap fetch. The only difference is that the covering index is faster.

At the very beginning of the proposal discussion, I suggested to implement third kind of columns, which are not constrained, but used in scankey. They must have op class to do it, and they are not truncated. But it was decided to abandon this feature.

The more I think about it, the more I doubt that it's okay to not
ensure downlinks are always distinct with their siblings, by sometimes
including non-constrained (truncatable) attributes within internal
pages, as needed to *distinguish* downlinks (also, we must
occasionally have *all* attributes including truncatable attributes in
internal pages -- we must truncate nothing to keep the key space sane
in the parent). Unfortunately, these requirements are very close to
the actual full requirements for a full, complete suffix truncation
patch, including storing how many attributes are stored in each and
every internal IndexTuple (no general thing for the index), page split
code to determine where to truncate to make adjacent downlinks
distinct, etc.

You may think: But that fully-matching-downlink case is okay, because
it only makes us do more linear scanning due to the lack of
non-truncatable attributes, which is still correct, if a little more
slow when there is bloat -- at the leaf level, we'll start at the
correct place (the first place the item could be on), per the "Ki <= v
<= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We
need to be able to reliably detect a concurrent page-split. Otherwise,
we'll move right within _bt_search() before even considering if
anything of interest for our index scan *might* be on the initial page
found from downlink (before even calling _bt_binsrch()). Even this bug
wouldn't happen in the common case where nextkey = true, but what
about when nextkey = false (e.g. for backwards scans)? We'd skip stuff
we are not supposed to by spuriously moving right, I think. I have a
bad feeling that even then we'd "accidentally fail to fail", because
of how backwards scans work at a higher level, but it's just too hard
to prove that that is correct. It's just too complicated to rely on so
much from a great distance.

This might not be the simplest example of where we could run into
trouble, but it's one example that I could see. The assumption that
downlinks and highkeys discretely separate ranges in the key space is
probably made many times. There could be more problematic spots, and
it's really hard to know where they might be. :-(

In general, it's common for any modification to the B-Tree code to
only break in a very subtle way, like this. I would be more
comfortable if I knew the patch received extensive stress-testing,
probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But
generally, I believe we should not allow the key space to fail to be
separated fully by downlinks and high keys, even if our original "Ki
<= v <= Ki+1" changes to the L&Y algorithm to make duplicates work
happens to mask the problems in simple testing. It's too different to
what we have today.

Frankly, I still do not understand what you're worried about.
If high key is greater than the scan key, we definitely cannot find any more tuples, because key attributes are ordered. If high key is equal to the scan key, we will continue searching and read next page. The code is not changed here, it is the same as processing of duplicates spreading over several pages. If you do not trust postgresql btree changes to the L&Y to make duplicates work, I don't know what to say, but it's definitely not related to my patch.

Of course I do not mind if someone will do more testing.
I did some tests and didn't find anything special. Besides, don't we have special alpha and beta release stages to find tricky bugs?

Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment: test_including.sql
Description: application/sql

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to