Hi, Attached is rebased version of this BRIN patch series, fixing mostly the breakage due to 372728b0 (aka initial-catalog-data format changes). As 2018-07 CF is meant for almost-ready patches, this is more a 2018-09 material. But perhaps someone would like to take a look - and I'd have to fix it anyway ...
At the pgcon dev meeting I suggested that long-running patches should have a "summary" post once in a while, so that reviewers don't have to reread the whole thread and follow all the various discussions. So let me start with this thread, although it's not a particularly long or complex one, nor does it have a long discussion. But anyway ... The patches introduce two new BRIN opclasses - minmax-multi and bloom. minmax-multi ============ minmax-multi is a variant of the current minmax opclass that handles cases where the plain minmax opclass degrades due to outlier values. Imagine almost perfectly correlated data (say, timestamps in a log table) - that works great with regular minmax indexes. But if you go and delete a bunch of historical messages (for whatever reason), new rows with new timestamps will be routed to the empty space and the minmax indexes will degrade because the ranges will get much "wider" due to the new values. The minmax-multi indexes deal with that by maintaining not a single minmax range, but several of them. That allows tracking the outlier values separately, without constructing one wide minmax range. Consider this artificial example: create table t (a bigint, b int); alter t set (fillfactor=95); insert into t select i + 1000*random(), i+1000*random() from generate_series(1,100000000) s(i); update t set a = 1, b = 1 where random() < 0.001; update t set a = 100000000, b = 100000000 where random() < 0.001; Now if you create a regular minmax index, it's going to perform terribly, because pretty much every minmax range is [1,100000000] thanks to the update of 0.1% of rows. create index on t using brin (a); explain analyze select * from t where a between 1923300::int and 1923600::int; QUERY PLAN ----------------------------------------------------------------- Bitmap Heap Scan on t (cost=75.11..75884.45 rows=319 width=12) (actual time=948.906..101739.892 rows=308 loops=1) Recheck Cond: ((a >= 1923300) AND (a <= 1923600)) Rows Removed by Index Recheck: 99999692 Heap Blocks: lossy=568182 -> Bitmap Index Scan on t_a_idx (cost=0.00..75.03 rows=22587 width=0) (actual time=89.357..89.357 rows=5681920 loops=1) Index Cond: ((a >= 1923300) AND (a <= 1923600)) Planning Time: 2.161 ms Execution Time: 101740.776 ms (8 rows) But with the minmax-multi opclass, this is not an issue: create index on t using brin (a int8_minmax_multi_ops); QUERY PLAN ------------------------------------------------------------------- Bitmap Heap Scan on t (cost=1067.11..76876.45 rows=319 width=12) (actual time=38.906..49.763 rows=308 loops=1) Recheck Cond: ((a >= 1923300) AND (a <= 1923600)) Rows Removed by Index Recheck: 22220 Heap Blocks: lossy=128 -> Bitmap Index Scan on t_a_idx (cost=0.00..1067.03 rows=22587 width=0) (actual time=28.069..28.069 rows=1280 loops=1) Index Cond: ((a >= 1923300) AND (a <= 1923600)) Planning Time: 1.715 ms Execution Time: 50.866 ms (8 rows) Which is clearly a big improvement. Doing this required some changes to how BRIN evaluates conditions on page ranges. With a single minmax range it was enough to evaluate them one by one, but minmax-multi needs to see all of them at once (to match them against the partial ranges). Most of the complexity is in building the summary, particularly picking which values (partial ranges) to merge. The max number of values in the summary is specified as values_per_range index reloption, and by default it's set to 64, so there can be either 64 points or 32 intervals or some combination of those. I've been thinking about some automated way to tune this (either globally or for each page range independently), but so far I have not been very successful. The challenge is that making good decisions requires global information about values in the column (e.g. global minimum and maximum). I think the reloption with 64 as a default is a good enough solution for now. Perhaps the stats from pg_statistic would be useful for improving this in the future, but I'm not sure. bloom ===== As the name suggests, this opclass uses bloom filter for the summary. Compared to the minmax-multi it's a bit more experimental idea, but I believe the foundations are safe. Using bloom filter means that the index can only support equalities, but for many use cases that's an acceptable limitation - UUID, IP addresses, ... (various identifiers in general). Of course, how to size the bloom filter? It's worth noting the false positive rate of the filter is essentially the fraction of a table that will be scanned every time. Similarly to the minmax-multi, parameters for computing optimal filter size are set as reloptions (false_positive_rate, n_distinct_per_range) with some reasonable defaults (1% false positive rate and distinct values 10% of maximum heap tuples in a page range). Note: When building the filter, we don't compute the hashes from the original values, but we first use the type-specific hash function (the same we'd use for hash indexes or hash joins) and then use the hash a as an input for the bloom filter. This generally works fine, but if "our" hash function generates a lot of collisions, it increases false positive ratio of the whole filter. I'm not aware of a case where this would be an issue, though. What further complicates sizing of the bloom filter is available space - the whole bloom filter needs to fit onto an 8kB page, and "full" bloom filters with about 1/2 the bits set are pretty non-compressible. So there's maybe ~8000 bytes for the bitmap. So for columns with many distinct values, it may be necessary to make the page range smaller, to reduce the number of distinct values in it. And of course it requires good ndistinct estimates, not just for the column as a whole, but for a single page range (because that's what matters for sizing the bloom filter). Which is not a particularly reliable estimate, I'm afraid. So reloptions seem like a sufficient solution, at least for now. open questions ============== * I suspect the definition of cross-type opclasses (int2 vs. int8) are not entirely correct. That probably needs another look. * The bloom filter now works in two modes - sorted (where in the sorted mode it stores the hashes directly) and hashed (the usual bloom filter behavior). The idea is that for ranges with significantly fewer distinct values, we only store those to save space (instead of allocating the whole bloom filter with mostly 0 bits). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz
Description: application/gzip
0002-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz
Description: application/gzip
0003-BRIN-bloom-indexes.patch.gz
Description: application/gzip
0004-BRIN-multi-range-minmax-indexes.patch.gz
Description: application/gzip