On 6 January 2012 21:14, Tom Lane <t...@sss.pgh.pa.us> wrote: > When there are lots of duplicates of a particular indexed value, the > existing code will cause an indexscan to search them in physical order, > whereas if we remove the existing logic it'll be random --- in > particular, accesses to the same heap page can no longer be expected to > be clustered.
Isn't it possible to get them in physical order anyway, by reading them into memory in that order? Efficient quick sort implementations are not stable, and ours is no exception, but we could perhaps come up with a cheaper tie-breaker value at that stage, if you're determined to maintain this behaviour. We have sufficient incentive to, as I describe below. > Admittedly, I don't have any numbers quantifying just how useful that > might be, but on the other hand you've not presented any evidence > justifying removing the behavior either. If we believe your position > that indexes don't generally have lots of duplicates, then the code in > question will seldom be reached and therefore there would be no > performance benefit in removing it. I ran the same btree benchmark on master, but without the "cheap insurance". The results were interesting, to say the least. The columns indexed were the same columns and data that I've been using all along. Initially this made sense, as the performance of multi sort key sorts often largely hinged on being able to get away with doing one comparison per pair of tuples - with many duplicates, I could avoid cheating and show something closer to worst case for the patch. I didn't think it mattered that indexing the same columns would produce what happened to be a not so useful index in the real world, due to having so many duplicates - better to have figures that were somewhat comparable for btree tuple sorting and heap tuple sorting. When I ran the same benchmark on a server that differed from master only in that their was no insurance, it momentarily appeared that *all* of the gains for btree index creation came from being able to elide the "cheap insurance", but only where it would have to be paid for a high number of times. I soon realised that I'd made a blunder: the code (that is, the patch that I posed most recently) wasn't even using my specialisation for qsorting, because the SortSupport pointer was null! I did not have tuplesort_begin_index_btree initialise the SortSupport struct as tuplesort_begin_heap did, so my earlier benchmark was effectively meaningless, except that it indicated the benefits of eliding the cheap insurance alone, if only for that not so compelling case. You should note that the benefits of not paying for the insurance can be very significant indeed. Attached are figures for an identical run of the same btree python script, but with a version of the patch that actually uses my specialisations. Granted, these numbers are still partially predicated on the index in question having a large number of duplicates, but it's worth noting: 1. The gain from specialisation isn't bad; not as good as the improvements we saw for heap tuples, but not so bad either, especially considering that binary bloat should be much less controversial for btree tuples. 2. The index that results from the tests is still useful; the planner is perfectly willing to use it rather than than performing an in-memory sort. It will also use it to satisfy a query like "select * from orderlines where prod_id = 5", albeit via a bitmap index scan. I took the precaution of increasing default_statistics_target to its maximum value, as well an performing an analyze on orderlines in advance of checking this. Revision to this patch that fixes the bug to follow - I produced these new numbers from a rough cut of that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
-- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers