On Sat, Oct 15, 2022 at 5:34 AM Tomas Vondra <tomas.von...@enterprisedb.com> wrote:
> Hi, > > There have been a couple discussions about using BRIN indexes for > sorting - in fact this was mentioned even in the "Improving Indexing > Performance" unconference session this year (don't remember by whom). > But I haven't seen any patches, so here's one. > > The idea is that we can use information about ranges to split the table > into smaller parts that can be sorted in smaller chunks. For example if > you have a tiny 2MB table with two ranges, with values in [0,100] and > [101,200] intervals, then it's clear we can sort the first range, output > tuples, and then sort/output the second range. > > The attached patch builds "BRIN Sort" paths/plans, closely resembling > index scans, only for BRIN indexes. And this special type of index scan > does what was mentioned above - incrementally sorts the data. It's a bit > more complicated because of overlapping ranges, ASC/DESC, NULL etc. > > This is disabled by default, using a GUC enable_brinsort (you may need > to tweak other GUCs to disable parallel plans etc.). > > A trivial example, demonstrating the benefits: > > create table t (a int) with (fillfactor = 10); > insert into t select i from generate_series(1,10000000) s(i); > > > First, a simple LIMIT query: > > explain (analyze, costs off) select * from t order by a limit 10; > > QUERY PLAN > ------------------------------------------------------------------------ > Limit (actual time=1879.768..1879.770 rows=10 loops=1) > -> Sort (actual time=1879.767..1879.768 rows=10 loops=1) > Sort Key: a > Sort Method: top-N heapsort Memory: 25kB > -> Seq Scan on t > (actual time=0.007..1353.110 rows=10000000 loops=1) > Planning Time: 0.083 ms > Execution Time: 1879.786 ms > (7 rows) > > QUERY PLAN > ------------------------------------------------------------------------ > Limit (actual time=1.217..1.219 rows=10 loops=1) > -> BRIN Sort using t_a_idx on t > (actual time=1.216..1.217 rows=10 loops=1) > Sort Key: a > Planning Time: 0.084 ms > Execution Time: 1.234 ms > (5 rows) > > That's a pretty nice improvement - of course, this is thanks to having a > perfectly sequential, and the difference can be almost arbitrary by > making the table smaller/larger. Similarly, if the table gets less > sequential (making ranges to overlap), the BRIN plan will be more > expensive. Feel free to experiment with other data sets. > > However, not only the LIMIT queries can improve - consider a sort of the > whole table: > > test=# explain (analyze, costs off) select * from t order by a; > > QUERY PLAN > ------------------------------------------------------------------------- > Sort (actual time=2806.468..3487.213 rows=10000000 loops=1) > Sort Key: a > Sort Method: external merge Disk: 117528kB > -> Seq Scan on t (actual time=0.018..1498.754 rows=10000000 loops=1) > Planning Time: 0.110 ms > Execution Time: 3766.825 ms > (6 rows) > > test=# explain (analyze, costs off) select * from t order by a; > QUERY PLAN > > > ---------------------------------------------------------------------------------- > BRIN Sort using t_a_idx on t (actual time=1.210..2670.875 rows=10000000 > loops=1) > Sort Key: a > Planning Time: 0.073 ms > Execution Time: 2939.324 ms > (4 rows) > > Right - not a huge difference, but still a nice 25% speedup, mostly due > to not having to spill data to disk and sorting smaller amounts of data. > > There's a bunch of issues with this initial version of the patch, > usually described in XXX comments in the relevant places.6) > > 1) The paths are created in build_index_paths() because that's what > creates index scans (which the new path resembles). But that is expected > to produce IndexPath, not BrinSortPath, so it's not quite correct. > Should be somewhere "higher" I guess. > > 2) BRIN indexes don't have internal ordering, i.e. ASC/DESC and NULLS > FIRST/LAST does not really matter for them. The patch just generates > paths for all 4 combinations (or tries to). Maybe there's a better way. > > 3) I'm not quite sure the separation of responsibilities between > opfamily and opclass is optimal. I added a new amproc, but maybe this > should be split differently. At the moment only minmax indexes have > this, but adding this to minmax-multi should be trivial. > > 4) The state changes in nodeBrinSort is a bit confusing. Works, but may > need cleanup and refactoring. Ideas welcome. > > 5) The costing is essentially just plain cost_index. I have some ideas > about BRIN costing in general, which I'll post in a separate thread (as > it's not specific to this patch). > > 6) At the moment this only picks one of the index keys, specified in the > ORDER BY clause. I think we can generalize this to multiple keys, but > thinking about multi-key ranges was a bit too much for me. The good > thing is this nicely combines with IncrementalSort. > > 7) Only plain index keys for the ORDER BY keys, no expressions. Should > not be hard to fix, though. > > 8) Parallel version is not supported, but I think it shouldn't be > possible. Just make the leader build the range info, and then let the > workers to acquire/sort ranges and merge them by Gather Merge. > > 9) I was also thinking about leveraging other indexes to quickly > eliminate ranges that need to be sorted. The node does evaluate filter, > of course, but only after reading the tuple from the range. But imagine > we allow BrinSort to utilize BRIN indexes to evaluate the filter - in > that case we might skip many ranges entirely. Essentially like a bitmap > index scan does, except that building the bitmap incrementally with BRIN > is trivial - you can quickly check if a particular range matches or not. > With other indexes (e.g. btree) you essentially need to evaluate the > filter completely, and only then you can look at the bitmap. Which seems > rather against the idea of this patch, which is about low startup cost. > Of course, the condition might be very selective, but then you probably > can just fetch the matching tuples and do a Sort. > > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company Hi, I am still going over the patch. Minor: for #8, I guess you meant `it should be possible` . Cheers