Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Peter Geoghegan Fri, 28 Sep 2018 11:09:08 -0700

On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> So.  I don't know much about the btree code, so don't believe anything I
> say.


I think that showing up and reviewing this patch makes you somewhat of
an expert, by default. There just isn't enough expertise in this area.

> I was very interested in the bloat test case that you posted on
> 2018-07-09 and I tried to understand it more.

Up until recently, I thought that I would justify the patch primarily
as a project to make B-Trees less bloated when there are many
duplicates, with maybe as many as a dozen or more secondary benefits.
That's what I thought it would say in the release notes, even though
the patch was always a broader strategic thing. Now I think that the
TPC-C multiple insert point bloat issue might be the primary headline
benefit, though.

I hate to add more complexity to get it to work well, but just look at
how much smaller the indexes are following an initial bulk load (bulk
insertions) using my working copy of the patch:

Master

customer_pkey: 75 MB
district_pkey: 40 kB
idx_customer_name: 107 MB
item_pkey: 2216 kB
new_order_pkey: 22 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB
oorder_pkey: 78 MB
order_line_pkey: 774 MB
stock_pkey: 181 MB
warehouse_pkey: 24 kB

Patch

customer_pkey: 50 MB
district_pkey: 40 kB
idx_customer_name: 105 MB
item_pkey: 2216 kB
new_order_pkey: 12 MB
oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB
oorder_pkey: 42 MB
order_line_pkey: 429 MB
stock_pkey: 111 MB
warehouse_pkey: 24 kB

All of the indexes used by oltpbench to do TPC-C are listed, so you're
seeing the full picture for TPC-C bulk loading here (actually, there
is another index that has an identical definition to
oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted
as redundant). As you can see, all the largest indexes are
*significantly* smaller, with the exception of
oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this
improvement until I post the next version, though, since this is a
brand new development. Note that VACUUM hasn't been run at all, and
doesn't need to be run, as there are no dead tuples. Note also that
this has *nothing* to do with getting tired -- almost all of these
indexes are unique indexes.

Note that I'm also testing TPC-E and TPC-H in a very similar way,
which have both been improved noticeably, but to a degree that's much
less compelling than what we see with TPC-C. They have "getting tired"
cases that benefit quite a bit, but those are the minority.

Have you ever used HammerDB? I got this data from oltpbench, but I
think that HammerDB might be the way to go with TPC-C testing
Postgres.

> You propose to address this by appending the tid to the index key, so
> each key, even if its "payload" is a duplicate value, is unique and has
> a unique place, so we never have to do this "tiresome" search.This
> makes a lot of sense, and the results in the bloat test you posted are
> impressive and reproducible.

Thanks.

> I tried a silly alternative approach by placing a new duplicate key in a
> random location.  This should be equivalent since tids are effectively
> random.

You're never going to get any other approach to work remotely as well,
because while the TIDs may seem to be random in some sense, they have
various properties that are very useful from a high level, data life
cycle point of view. For insertions of duplicates, heap TID has
temporal locality --  you are only going to dirty one or two leaf
pages, rather than potentially dirtying dozens or hundreds.
Furthermore, heap TID is generally strongly correlated with primary
key values, so VACUUM can be much much more effective at killing
duplicates in low cardinality secondary indexes when there are DELETEs
with a range predicate on the primary key. This is a lot more
realistic than the 2018-07-09 test case, but it still could make as
big of a difference.

>  I didn't quite get this to fully work yet, but at least it
> doesn't blow up, and it gets the same regression test ordering
> differences for pg_depend scans that you are trying to paper over. ;-)

FWIW, I actually just added to the papering over, rather than creating
a new problem. There are plenty of instances of "\set VERBOSITY terse"
in the regression tests already, for the same reason. If you run the
regression tests with ignore_system_indexes=on, there are very similar
failures [1].

> As far as the code is concerned, I agree with Andrey Lepikhov that one
> more abstraction layer that somehow combines the scankey and the tid or
> some combination like that would be useful, instead of passing the tid
> as a separate argument everywhere.

I've already drafted this in my working copy. It is a clear
improvement. You can expect it in the next version.

> I think it might help this patch move along if it were split up a bit,
> for example 1) suffix truncation, 2) tid stuff, 3) new split strategies.
> That way, it would also be easier to test out each piece separately.
> For example, how much space does suffix truncation save in what
> scenario, are there any performance regressions, etc.

I'll do my best. I don't think I can sensibly split out suffix
truncation from the TID stuff -- those seem truly inseparable, since
my mental model for suffix truncation breaks without fully unique
keys. I can break out all the cleverness around choosing a split point
into its own patch, though -- _bt_findsplitloc() has only been changed
to give weight to several factors that become important. It's the
"brain" of the optimization, where 90% of the complexity actually
lives.

Removing the _bt_findsplitloc() changes will make the performance of
the other stuff pretty poor, and in particular will totally remove the
benefit for cases that "become tired" on the master branch. That could
be slightly interesting, I suppose.

> In the last few
> versions, the patches have still been growing significantly in size and
> functionality, and most of the supposed benefits are not readily visible
> in tests.

I admit that this patch has continued to evolve up until this week,
despite the fact that I thought it would be a lot more settled by now.
It has actually become simpler in recent months, though. And, I think
that the results justify the iterative approach I've taken. This stuff
is inherently very subtle, and I've had to spend a lot of time paying
attention to tiny regressions across a fairly wide variety of test
cases.

> And of course we need to think about how to handle upgrades, but you
> have already started a separate discussion about that.

Right.

[1] 
https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-mvcm_ph...@mail.gmail.com
--
Peter Geoghegan

Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

Reply via email to