Hi hackers,

On 10.09.2025 16:56, Ilia Evdokimov wrote:
Unfortunately, the JOB benchmark does not contain semi join nodes. However, TPC-DS does. I'll look for the queries with slowest planner times there and check them.

I'll need some time to check both join and semi join cases with small and large default_statistics_target. I'll share the results later.

JOIN
==============================

I’ve benchmarked the new implementation of eqjoinsel() with different values of default_statistics_target. On small targets (1, 5, 10, 25, 50, 75, 100) the results are all within statistical noise, and I did not observe any regressions. In my view, it’s reasonable to keep the current condition that the hash table is not used for default_statistics_target = 1. Raising that threshold does not seem useful.

Here are the results for JOB queries (where the effect of semi join is not visible due to different data distributions):

default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms)
------------------------------------------------------------------------------------------
1                         | 1.00                | 1846.643            | 1847.409 5                         | 1.00                | 1836.391            | 1828.318 10                        | 0.95                | 1841.750            | 1929.722 25                        | 0.99                | 1873.172            | 1890.741 50                        | 0.98                | 1869.897            | 1898.470 75                        | 1.02                | 1969.368            | 1929.521 100                       | 0.97                | 1857.890            | 1921.207 1000                      | 1.14                | 2279.700            | 1997.102 2500                      | 1.78                | 4682.658            | 2636.202 5000                      | 6.45                | 15943.696           | 2471.242 7500                      | 12.45               | 34350.855           | 2758.565 10000                     | 20.52               | 62519.342           | 3046.819

SEMI JOIN
==============================

Unfortunately, in TPC-DS it is not possible to clearly see improvements for semi joins. To address this, I designed a synthetic example where the data distribution forces the loop to run fully, without exiting early, which makes the effect on semi joins more visible. In this setup, I also ensured that the length of the MCV array is equal to the chosen default_statistics_target.

CREATE TABLE t1 AS
SELECT CASE
         WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 1
         ELSE (g % 1000000) + 10000
       END AS id
FROM generate_series(1, 3000000) g;

CREATE TABLE t2 AS
SELECT CASE
         WHEN g <= 3000000 * 0.9 THEN (g % 10000) + 10001
         ELSE (g % 1000000) + 20000
       END AS id
FROM generate_series(1, 3000000) g;

ANALYZE t1, t2;

The results of the query are:

SELECT * FROM t1
WHERE id IN (SELECT id FROM t2);

default_statistics_target | Planner Speedup (×) | Planner Before (ms) | Planner After (ms)
------------------------------------------------------------------------------------------
1                         | 1.12                | 1.191               | 1.062 5                         | 1.02                | 0.493               | 0.481 10                        | 0.92                | 0.431               | 0.471 25                        | 1.27                | 0.393               | 0.309 50                        | 1.04                | 0.432               | 0.416 75                        | 0.96                | 0.398               | 0.415 100                       | 0.95                | 0.450               | 0.473 1000                      | 9.42                | 6.742               | 0.716 2500                      | 19.15               | 21.621              | 1.129 5000                      | 46.74               | 85.667              | 1.833 7500                      | 73.26               | 194.806             | 2.659 10000                     | 107.95              | 349.981             | 3.242

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC,
https://tantorlabs.com



Reply via email to