On Mon, Jul 24, 2017 at 11:12 AM, Peter Geoghegan <p...@bowt.ie> wrote: > You can get about a 1/3 loss of space by inserting randomly, rather > than inserting in sorted order, which is what REINDEX will more or > less do for you. That's because random workloads almost entirely get > 50:50 page splits, whereas sorted input will always split the > rightmost page, and so will always get 90:10 splits. The space in the > random case isn't exactly wasted; it's there for the taking, for key > values that happen to fit on the page.
I looked into this again. I decided to see for myself what was up. Actually, I used oltpbench [1], since that was the easiest TPC-C benchmark available. The order_line primary key may have been designed to be as unfriendly as possible to implementations that don't have good suffix truncation. This is also true of several of other TPC-C indexes. I initialized a scale 50 TPC-C database: pg@bat:~/code/oltpbench$ ./oltpbenchmark -b tpcc -c config/my_postgres_tpcc_config.xml --create=true --load=true 15:26:57,214 (DBWorkload.java:259) INFO - ====================================================================== Benchmark: TPCC {com.oltpbenchmark.benchmarks.tpcc.TPCCBenchmark} Configuration: config/sample_tpcc_config.xml Type: POSTGRES Driver: org.postgresql.Driver URL: jdbc:postgresql:tpcc Isolation: TRANSACTION_SERIALIZABLE Scale Factor: 50.0 Note that there is no garbage for VACUUM to clean up just yet. There hasn't been any deletes or updates yet. And yet, order_line_pkey is bloated. It's 783 MB, but if I run REINDEX it shrinks right down to 451 MB. This is on the master branch -- not with one of my patches. You can also arrange to make the index much smaller if you force the insertions to occur in a totally random order, rather than the order that the benchmark actually inserts them. The problem for us is that tuples are inserted in clustered/monotonically increasing order for the last 3 columns, while the first column (ol_w_id) is a low cardinality column whose values appear in random order, more or less. This isn't actually unrealistic - it makes sense that associated records would be inserted as per-warehouse groups like this, while the order of the groups remains unpredictable. I am working on adding suffix truncation at the moment. Right now, my patch doesn't help, because it isn't sophisticated enough about the choice of split point (we need to care about suffix truncation, and not just balancing space among the two halves of a split). It should try to find a split point that maximizes the chances of a new pivot tuple not having any low cardinality final column (ol_number). We should be able to mostly not have any of the last column in tuples that make it into the branch/internal nodes: pg@tpcc[1452]=# select count(*) from order_line ; count ──────────── 15,001,784 (1 row) pg@tpcc[1452]=# select count(distinct(ol_w_id, ol_d_id, ol_o_id)) from order_line ; count ─────────── 1,500,000 (1 row) As I said, the final ol_number column makes the keyspace unbalanced by unnecessarily appearing in the internal/branch nodes: pg@tpcc[1452]=# select count(distinct(ol_number)) from order_line ; count ─────── 15 (1 row) Since there can only be 15 distinct values for any given (ol_w_id, ol_d_id, ol_o_id, *), and since over 100 index tuples will fit on each leaf page, we just have to pick a split point that's between (x, y, z, 15) and (x, y, z + 1, 1). This makes it legal to truncate away the ol_number column, which allows us to balance the use of free space for items that are inserted after the split, by later transactions. [1] https://github.com/oltpbenchmark/oltpbench -- Peter Geoghegan