05.04.2016 01:48, Peter Geoghegan :
On Mon, Mar 21, 2016 at 9:53 AM, Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote:It's a bit more complicated to add it into index creation algorithm. There's a trick with a "high key". /* * We copy the last item on the page into the new page, and then * rearrange the old page so that the 'last item' becomes its high key * rather than a true data item. There had better be at least two * items on the page already, else the page would be empty of useful * data. */ /* * Move 'last' into the high key position on opage */To be consistent with other steps of algorithm ( all high keys must be truncated tuples), I had to update this high key on place: delete the old one, and insert truncated high key.Hmm. But the high key comparing equal to the Scankey gives insertion the choice of where to put its IndexTuple (it can go on the page with the high key, or its right-sibling, according only to considerations about fillfactor, etc). Is this changed? Does it not matter? Why not? Is it just worth it?
I would say, this is changed, but it doesn't matter.Performing any search in btree (including choosing suitable page for insertion), we use only key attributes.
We assume that included columns are stored in index unordered. Simple example. create table tbl(id int, data int); create index idx on tbl (id) including (data); Select query does not consider included columns in scan key.It selects all tuples satisfying the condition on key column. And only after that it applies filter to remove wrong rows from the result. If key attribute doesn't satisfy query condition, there are no more tuples to return and we can interrupt scan.
You can find more explanations in the attached sql script,that contains queries to recieve detailed information about index structure using pageinspect.
The right-most page on every level has no high-key. But you say those pages have an "imaginary" *positive* infinity high key, just as internal pages have (non-imaginary) minus infinity downlinks as their first item/downlink. So tuples in a (say) leaf page are always bound by the downlink lower bound in parent, while their own high key is an upper bound. Either (and, rarely, both) could be (positive or negative) infinity. Maybe you now see why I talked about special _bt_compare() logic for this. I proposed special logic that is similar to the existing minus infinity thing _bt_compare() does (although _bt_binsrch(), an important caller of _bt_compare() also does special things for internal .vs leaf case, so I'm not sure any new special logic must go in _bt_compare()).In my view, it's the correct way to fix this problem, because the caller is responsible for passing proper arguments to the function. Of course I will add a check into bt_compare, but I'd rather make it an assertion (see the patch attached).I see what you mean, but I think we need to decide what to do about the key space when leaf high keys are truncated. I do think that truncating the high key was the right idea, though, and it nicely illustrates that nothing special should happen in upper levels. Suffix truncation should only happen when leaf pages are split, generally speaking. As I said, the high key is very similar to the downlinks, in that both bound the things that go on each page. Together with downlinks they represent *discrete* ranges *unambiguously*, so INCLUDING truncation needs to make it clear which page new items go on. As I said, _bt_binsrch() already takes special actions for internal pages, making sure to return the item that is < the scankey, not <= the scankey which is only allowed for leaf pages. (See README, from "Lehman and Yao assume that the key range for a subtree S is described by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent page..."). To give a specific example, I worry about the case where two sibling downlinks in a parent page are distinct, but per specific-to-Postgres "Ki <= v <= Ki+1" thing (which differs from the classic L&Y invariant), some tuples with all right downlink's attributes matching end up in left child page, not right child page. I worry that since _bt_findsplitloc() doesn't consider this (for example), the split point doesn't *reliably* and unambiguously divide the key space between the new halves of a page being split. I think the "Ki <= v <= Ki+1"/_bt_binsrch() thing might save you in common cases where all downlink attributes are distinct, so maybe that simpler case is okay. But to be even more specific, what about the more complicated case where the downlinks *are* fully _bt_compare()-wise equal? This could happen even though they're constrained to be unique in leaf pages, due to bloat. Unique indexes aren't special here; they just make it far less likely that this would happen in practice, because it takes a lot of bloat. Less importantly, when that bloat happens, you don't want to have to do a linear scan through many leaf pages (that should only happen when there are many fully matching IndexTuples at the leaf level -- not just matching on constrained attributes).
"just matching on constrained attributes" is the core idea of the whole patch. Included columns just provide us possibility to use index-only scan. Nothing more. We assume use case where index-only-scan is faster than index-scan + heap fetch. For example, in queries like "select data from tbl where id = 1;" we have no scan condition on data. Maybe you afraid of long linear scan when we have enormous index bloat even on unique index. It will happen anyway, whether we have index-only scan on covering index or index-scan on unique index + heap fetch. The only difference is that the covering index is faster.
At the very beginning of the proposal discussion, I suggested to implement third kind of columns, which are not constrained, but used in scankey. They must have op class to do it, and they are not truncated. But it was decided to abandon this feature.
The more I think about it, the more I doubt that it's okay to not ensure downlinks are always distinct with their siblings, by sometimes including non-constrained (truncatable) attributes within internal pages, as needed to *distinguish* downlinks (also, we must occasionally have *all* attributes including truncatable attributes in internal pages -- we must truncate nothing to keep the key space sane in the parent). Unfortunately, these requirements are very close to the actual full requirements for a full, complete suffix truncation patch, including storing how many attributes are stored in each and every internal IndexTuple (no general thing for the index), page split code to determine where to truncate to make adjacent downlinks distinct, etc. You may think: But that fully-matching-downlink case is okay, because it only makes us do more linear scanning due to the lack of non-truncatable attributes, which is still correct, if a little more slow when there is bloat -- at the leaf level, we'll start at the correct place (the first place the item could be on), per the "Ki <= v <= Ki+1"/_bt_binsrch() thing. I don't think it's correct, though. We need to be able to reliably detect a concurrent page-split. Otherwise, we'll move right within _bt_search() before even considering if anything of interest for our index scan *might* be on the initial page found from downlink (before even calling _bt_binsrch()). Even this bug wouldn't happen in the common case where nextkey = true, but what about when nextkey = false (e.g. for backwards scans)? We'd skip stuff we are not supposed to by spuriously moving right, I think. I have a bad feeling that even then we'd "accidentally fail to fail", because of how backwards scans work at a higher level, but it's just too hard to prove that that is correct. It's just too complicated to rely on so much from a great distance. This might not be the simplest example of where we could run into trouble, but it's one example that I could see. The assumption that downlinks and highkeys discretely separate ranges in the key space is probably made many times. There could be more problematic spots, and it's really hard to know where they might be. :-( In general, it's common for any modification to the B-Tree code to only break in a very subtle way, like this. I would be more comfortable if I knew the patch received extensive stress-testing, probably involving amcheck, lots of bloat, lots of VACUUMing, etc. But generally, I believe we should not allow the key space to fail to be separated fully by downlinks and high keys, even if our original "Ki <= v <= Ki+1" changes to the L&Y algorithm to make duplicates work happens to mask the problems in simple testing. It's too different to what we have today.
Frankly, I still do not understand what you're worried about.If high key is greater than the scan key, we definitely cannot find any more tuples, because key attributes are ordered. If high key is equal to the scan key, we will continue searching and read next page. The code is not changed here, it is the same as processing of duplicates spreading over several pages. If you do not trust postgresql btree changes to the L&Y to make duplicates work, I don't know what to say, but it's definitely not related to my patch.
Of course I do not mind if someone will do more testing.I did some tests and didn't find anything special. Besides, don't we have special alpha and beta release stages to find tricky bugs?
-- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
-- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers