On Fri, Sep 28, 2018 at 7:50 AM Peter Eisentraut <peter.eisentr...@2ndquadrant.com> wrote: > So. I don't know much about the btree code, so don't believe anything I > say.
I think that showing up and reviewing this patch makes you somewhat of an expert, by default. There just isn't enough expertise in this area. > I was very interested in the bloat test case that you posted on > 2018-07-09 and I tried to understand it more. Up until recently, I thought that I would justify the patch primarily as a project to make B-Trees less bloated when there are many duplicates, with maybe as many as a dozen or more secondary benefits. That's what I thought it would say in the release notes, even though the patch was always a broader strategic thing. Now I think that the TPC-C multiple insert point bloat issue might be the primary headline benefit, though. I hate to add more complexity to get it to work well, but just look at how much smaller the indexes are following an initial bulk load (bulk insertions) using my working copy of the patch: Master customer_pkey: 75 MB district_pkey: 40 kB idx_customer_name: 107 MB item_pkey: 2216 kB new_order_pkey: 22 MB oorder_o_w_id_o_d_id_o_c_id_o_id_key: 60 MB oorder_pkey: 78 MB order_line_pkey: 774 MB stock_pkey: 181 MB warehouse_pkey: 24 kB Patch customer_pkey: 50 MB district_pkey: 40 kB idx_customer_name: 105 MB item_pkey: 2216 kB new_order_pkey: 12 MB oorder_o_w_id_o_d_id_o_c_id_o_id_key: 61 MB oorder_pkey: 42 MB order_line_pkey: 429 MB stock_pkey: 111 MB warehouse_pkey: 24 kB All of the indexes used by oltpbench to do TPC-C are listed, so you're seeing the full picture for TPC-C bulk loading here (actually, there is another index that has an identical definition to oorder_o_w_id_o_d_id_o_c_id_o_id_key for some reason, which is omitted as redundant). As you can see, all the largest indexes are *significantly* smaller, with the exception of oorder_o_w_id_o_d_id_o_c_id_o_id_key. You won't be able to see this improvement until I post the next version, though, since this is a brand new development. Note that VACUUM hasn't been run at all, and doesn't need to be run, as there are no dead tuples. Note also that this has *nothing* to do with getting tired -- almost all of these indexes are unique indexes. Note that I'm also testing TPC-E and TPC-H in a very similar way, which have both been improved noticeably, but to a degree that's much less compelling than what we see with TPC-C. They have "getting tired" cases that benefit quite a bit, but those are the minority. Have you ever used HammerDB? I got this data from oltpbench, but I think that HammerDB might be the way to go with TPC-C testing Postgres. > You propose to address this by appending the tid to the index key, so > each key, even if its "payload" is a duplicate value, is unique and has > a unique place, so we never have to do this "tiresome" search.This > makes a lot of sense, and the results in the bloat test you posted are > impressive and reproducible. Thanks. > I tried a silly alternative approach by placing a new duplicate key in a > random location. This should be equivalent since tids are effectively > random. You're never going to get any other approach to work remotely as well, because while the TIDs may seem to be random in some sense, they have various properties that are very useful from a high level, data life cycle point of view. For insertions of duplicates, heap TID has temporal locality -- you are only going to dirty one or two leaf pages, rather than potentially dirtying dozens or hundreds. Furthermore, heap TID is generally strongly correlated with primary key values, so VACUUM can be much much more effective at killing duplicates in low cardinality secondary indexes when there are DELETEs with a range predicate on the primary key. This is a lot more realistic than the 2018-07-09 test case, but it still could make as big of a difference. > I didn't quite get this to fully work yet, but at least it > doesn't blow up, and it gets the same regression test ordering > differences for pg_depend scans that you are trying to paper over. ;-) FWIW, I actually just added to the papering over, rather than creating a new problem. There are plenty of instances of "\set VERBOSITY terse" in the regression tests already, for the same reason. If you run the regression tests with ignore_system_indexes=on, there are very similar failures [1]. > As far as the code is concerned, I agree with Andrey Lepikhov that one > more abstraction layer that somehow combines the scankey and the tid or > some combination like that would be useful, instead of passing the tid > as a separate argument everywhere. I've already drafted this in my working copy. It is a clear improvement. You can expect it in the next version. > I think it might help this patch move along if it were split up a bit, > for example 1) suffix truncation, 2) tid stuff, 3) new split strategies. > That way, it would also be easier to test out each piece separately. > For example, how much space does suffix truncation save in what > scenario, are there any performance regressions, etc. I'll do my best. I don't think I can sensibly split out suffix truncation from the TID stuff -- those seem truly inseparable, since my mental model for suffix truncation breaks without fully unique keys. I can break out all the cleverness around choosing a split point into its own patch, though -- _bt_findsplitloc() has only been changed to give weight to several factors that become important. It's the "brain" of the optimization, where 90% of the complexity actually lives. Removing the _bt_findsplitloc() changes will make the performance of the other stuff pretty poor, and in particular will totally remove the benefit for cases that "become tired" on the master branch. That could be slightly interesting, I suppose. > In the last few > versions, the patches have still been growing significantly in size and > functionality, and most of the supposed benefits are not readily visible > in tests. I admit that this patch has continued to evolve up until this week, despite the fact that I thought it would be a lot more settled by now. It has actually become simpler in recent months, though. And, I think that the results justify the iterative approach I've taken. This stuff is inherently very subtle, and I've had to spend a lot of time paying attention to tiny regressions across a fairly wide variety of test cases. > And of course we need to think about how to handle upgrades, but you > have already started a separate discussion about that. Right. [1] https://postgr.es/m/CAH2-Wz=wAKwhv0PqEBFuK2_s8E60kZRMzDdyLi=-mvcm_ph...@mail.gmail.com -- Peter Geoghegan