On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire <klaussfre...@gmail.com> wrote: > On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan <p...@heroku.com> wrote: >> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire <klaussfre...@gmail.com> >> wrote: >>> In fact, that's why non-leaf index tuples need a different format, >>> because while leaf index tuples contain the heap pointer already, >>> non-leaf ones contain only the downlink, not the pointer into the >>> heap. To be able to do comparisons and pick the right downlink, the >>> original heap pointer in the leaf index tuple is copied into the >>> downlink index tuple when splitting pages into an additional >>> IndexTupleData header that is prepended only to non-leaf index tuples. >> >> I think that this is a bad idea. We need to implement suffix >> truncation of internal page index tuples at some point, to make them >> contain less information from the original leaf page index tuple. >> That's an important optimization, because it increases fan-in. This >> seems like a move in the opposite direction. > > I see that. I could try to measure average depth to measure the impact > this had on fan-in. > > While it should cut it in half for narrow indexes, half of very high > is still high. Wide indexes, which are are the ones that would suffer > from poor fan-in, would feel this far less, since the overhead is > constant. > > Even if it does have an impact, I don't see an alternative, without > also implementing suffix truncation. Perhaps I could try to avoid > adding the leaf tid header if it isn't necessary, though I would have > to use up the last available flag bit in t_info for that.
So, I did some tests. TL;DR version is, as expected, there is a big impact on narrow indexes, but I don't believe it's something to be worried about. **Begin LONG version** (skip to end long version if not interested) Regardless of the fanout (or fanin, if you look at it that way), what matters is tree depth, so I used pageinspect to check how depth of indexes changed after patching. This is all on pristine recently built indexes. I will try to measure the same on bloated indexes later on, but the tests are painfully slow. I used the following query to inspect all user indexes, since pageinspect failed to work for toast indexes for some reason (it probably needed qualified relnames or something like that). Anyway user indexes should be enough. select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest from pg_class, bt_metap(relname) where relkind = 'i' and relnamespace = 2200 and relam = 403 group by level order by level desc; I tested it on both patched and unpached versions of both my test database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale 10000 (146GB). I believe pgbench is a worst case example, because it only has very narrow PK indexes, which are the ones that will suffer the most from this patch. The numbers: mat, patched level;count;biggest 3;18;"1369 MB" 2;60;"322 MB" 1;26;"808 kB" 0;50;"16 kB" mat, unpatched level;count;biggest 3;15;"1367 MB" 2;63;"334 MB" 1;26;"800 kB" 0;49;"16 kB" pgbench, scale 1000, patched level;count;biggest 3;1;"2145 MB" 1;2;"456 kB" pgbench, scale 1000, unpatched level;count;biggest 3;1;"2142 MB" 1;2;"456 kB" pgbench, scale 10000, patched level;count;biggest 3;1;"21 GB" 2;1;"2224 kB" 1;1;"240 kB" pgbench, scale 10000, unpatched level;count;biggest 3;1;"21 GB" 1;2;"2208 kB" So, clearly some indexes are pushed to the next level, but it isn't a widespread effect - it looks more like the threshold index size for needing a root split is slightly pushed downwards. It totally agrees with the math. The index tuple header is 8 bytes, and the datums need to be maxaligned so they will also be 8 bytes at least. That means the B in B-tree gets cut by 50% (2/3) at the worst but only for inner tuples, and so you get that if a * log_(B)(N/B) = log_(B/(3/2))(N/B) then for big enogh N a < (3/2)*log_(B)(N) - 2 Which means the depth of huge B-trees is increased by 50%, but the effects only begins to be seen at level 2, which is what we have here, and level 2 for such a narrow index seems to be about 2GB. So we're talking about very big indexes already. If level 2 can hold 2GB of index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have yet to see (even in very big databases) a 1TB index on a PK. If anyone has such an index, yes, this patch will hurt performance by 25% (4 page reads instead of 3). Bad, but since it only affects very extreme cases, doesn't seem so bad. **End LONG version** So, that said, I started mocking up a version of the patch that used a "has aux data" flag to signal another set of flags from where variable size attributes can be read, which would enable implementation of suffix truncation and prefix compression in the future with minimal data format changes, and was surprised to find that it seems not only more compact, but cleaner. So I will probably go that way, and do that. -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers