Re: Use merge-based matching for MCVs in eqjoinsel

Ilia Evdokimov Mon, 08 Sep 2025 05:20:42 -0700


On 08.09.2025 13:08, David Geier wrote:

Hi Ilia!


On 05.09.2025 16:03, David Geier wrote:

I propose an optimization: when the column datatype supports
ordering(i.e., has < and >), we can sort both MCV lists and apply
mege-style algorithm to detect matches. This reduces runtime from
O(N^2) to O(NlogN), where N is the number of MCV entries. The patch
also applies the same optimization to semi-join clauses, which show
similar performance behavior.

Why do you sort both lists and then merge instead of putting the smaller
list into a hash map and then doing hash lookups (if the type is hashable)?

I've tested your and my code with the following script:

CREATE TABLE foo(col TEXT);
CREATE TABLE bar(col TEXT);
SET default_statistics_target = 10000;

-- Generate MCV values. PostgreSQL doesn't store MCVs if the table has
-- only a single value or every value has exactly the same cardinality.
DO $$
BEGIN
   FOR i IN 1..10000 LOOP
     FOR j IN 1..least(i, 50) LOOP
       INSERT INTO foo VALUES ('aaaaaaaaaaaaaaaaaaaa' || i::TEXT);
       INSERT INTO bar VALUES ('aaaaaaaaaaaaaaaaaaaa' || (i + 100000)::TEXT);
     END LOOP;
   END LOOP;
END;
$$;

ANALYZE foo, bar;
\timing on
EXPLAIN SELECT * FROM foo f, bar b WHERE f.col = b.col;

Results are:

- master: 433 ms
- Order+Merge: 11 ms
- Hash map: 4 ms

I've attached my draft patch.

--
David Geier



Hi David,

Thank you for reviewing.

I have read all the previous messages - and yes, you are right. I don’tknow why I didn’t consider using a hash table approach initially. Youridea makes a lot of sense.


To evaluate it, I ran benchmarks on JOB with three variants:

$ ./benchmark.sh master
$ ./benchmark.sh merge
$ ./benchmark.sh hash

I compared total planning time across all 113 queries.

$ python3 planning_time.py master hash

default_statistics_target | Planner Speedup (×) | Planner Before (ms) |Planner After (ms)

--------------------------------------------------------------------------------

100 | 1.00 | 1892.627 |1886.9691000 | 1.09 | 2286.922 |2100.0992500 | 1.94 | 4647.167 |2400.7115000 | 6.15 | 17964.779 |2919.9147500 | 10.58 | 38622.443 |3650.37510000 | 16.33 | 69538.085 |4257.864


$ python3 planning_time.py master merge

default_statistics_target | Planner Speedup (×) | Planner Before (ms) |Planner After (ms)

--------------------------------------------------------------------------------

100 | 1.00 | 1892.627 |1898.6221000 | 1.12 | 2286.922 |2033.5532500 | 1.92 | 4647.167 |2423.5525000 | 5.94 | 17964.779 |3025.7397500 | 10.48 | 38622.443 |3684.26210000 | 16.72 | 69538.085 |4159.418

Based on these results, I’d prefer the hash lookup implementation, so Ithink it makes sense to improve your patch further and bring it intogood shape. Shall I take care of that, or would you prefer to do ityourself?


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com

<<attachment: benchmark.zip>>

Re: Use merge-based matching for MCVs in eqjoinsel

Reply via email to