On 19.11.2025 03:19, Tom Lane wrote: > I wrote: >> Thinking a bit harder, we are comparing these costs: >> [ theoretical arguments trimmed ] > > I spent some effort on actually measuring timings of the v6 patch, > and concluded that this is all splitting hairs that we don't need > to split. The actual crossover between hash-loses and hash-wins > is more than what my theoretical argument suggested, but still > probably less than 100 MCVs on each side. I think we should go with > > (sslot1.nvalues + sslot2.nvalues) >= 200 > > and call it good. > > To arrive at this result, I built the v6 patchset with > EQJOINSEL_MCV_HASH_THRESHOLD changed to either 0 (to force hashing) > or 1000000 (to prevent it). I then ran the attached scripts with > different values of "nstats" and collected timings from the postmaster > log output produced by the 0001 patch. > > The scripts are designed to test both the cheap-comparisons scenario > (integer columns) and the expensive-comparisons scenario (text columns > with a case-insensitive ICU collation). My motivation for splitting > them into a setup and a test step was to allow the tests to be run > repeatedly against the same underlying data. (Although I soon realized > that because VACUUM ANALYZE takes a random sample each time, the stats > we're working from aren't totally the same each time anyway.) Also > you'll notice that the test data is based on log(random()), which > I did to roughly approximate a zipfian distribution. If you remove > the log() call you'll get a flat distribution instead, but it didn't > seem to change the conclusions much. Thanks for working out the details!
I've ran your script on my development machine with 1000, 100 and 50 MCVs with the following results. As the runtimes had quite some variance I didn't bother trying more variations. I think your proposal to go with 200 is fine. nstats | off INT | off TEXT | on INT | on TEXT -------------------------------------|------ 1000 | 697 | 8907 | 14 | 2417 100 | 13.7 | 213 | 2.3 | 239 50 | 1.4 | 7.6 | 1.5 | 49 The results suggest that the hash function for the non-deterministic collation is really slow. If we could properly include the operator cost, we could enable the optimization earlier in the case of simple data types such as INT. That can be future work. -- David Geier
