Re: Convert NOT IN sublinks to anti-joins when safe

wenhui qiu Tue, 03 Feb 2026 19:53:39 -0800

Hi Richard
    That's just brilliant! Here’s my testing process, which also tackles a
commonly criticized aspect of PostgreSQL, effectively addressing its
shortcomings.Moreover, the path test scenarios are thorough, encompassing
cases that involve "NOT IN".Once again, thank you for your hard work.
###########
create table join1 (id integer,name varchar(300),k1 integer);
create table join2 (id integer,name varchar(300),score integer);
insert into join1 values (
generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values (
generate_series(150201,1350300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join2 values (
generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',1);
insert into join2 values (
generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',2);
insert into join2 values (
generate_series(20001,22000),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
insert into join2 values (
generate_series(150201,950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
create index idx_j1 on join1(id);
create index idx_j2 on join2(id);
VACUUM ANALYZE JOIN1;
VACUUM ANALYZE JOIN2;


postgres=#   explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 WHERE T2.ID>10000);
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=20060.65..58729.40 rows=630150 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
   SubPlan any_1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..17904.72
rows=862371 width=4)
           Index Cond: (id > 10000)
(5 rows)

postgres=#   explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 and t2.id is not null);
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=22216.57..60885.32 rows=630150 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
   SubPlan any_1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..20060.65
rows=862371 width=4)
           Index Cond: ((id > 10000) AND (id IS NOT NULL))
(5 rows)
postgres=# alter table join1 alter id set not null;
ALTER TABLE
postgres=# alter table join2 alter id set not null;
ALTER TABLE
postgres=#   explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 );
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=28684.35..72424.36 rows=393902 width=8)
   Hash Cond: (t1.id = t2.id)
   ->  Seq Scan on join1 t1  (cost=0.00..35518.00 rows=1260300 width=8)
   ->  Hash  (cost=17904.72..17904.72 rows=862371 width=4)
         ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..17904.72
rows=862371 width=4)
               Index Cond: (id > 10000)
(6 rows)

postgres=#
postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
                                                                     QUERY
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=27134.57..70874.58 rows=393902 width=8) (actual
time=239.777..602.433 rows=400200.00 loops=1)
   Hash Cond: (t1.id = t2.id)
   Buffers: shared hit=25304
   ->  Seq Scan on join1 t1  (cost=0.00..35518.00 rows=1260300 width=8)
(actual time=0.007..85.255 rows=1260300.00 loops=1)
         Buffers: shared hit=22915
   ->  Hash  (cost=16108.32..16108.32 rows=882100 width=4) (actual
time=220.949..220.951 rows=882100.00 loops=1)
         Buckets: 1048576  Batches: 1  Memory Usage: 39204kB
         Buffers: shared hit=2389
         ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..16108.32
rows=882100 width=4) (actual time=0.008..103.306 rows=882100.00 loops=1)
               Heap Fetches: 0
               Index Searches: 1
               Buffers: shared hit=2389
 Planning:
   Buffers: shared hit=12
 Planning Time: 0.193 ms
 Execution Time: 617.668 ms
(16 rows)

postgres=# explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 );
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=27134.57..70874.58 rows=393902 width=8)
   Hash Cond: (t1.id = t2.id)
   ->  Seq Scan on join1 t1  (cost=0.00..35518.00 rows=1260300 width=8)
   ->  Hash  (cost=16108.32..16108.32 rows=882100 width=4)
         ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..16108.32
rows=882100 width=4)
(5 rows)

postgres=# alter table join2 alter id drop not null ;
ALTER TABLE
postgres=# explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 );
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=18313.57..56982.32 rows=630150 width=8)
   Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
   SubPlan any_1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..16108.32
rows=882100 width=4)
(4 rows)

postgres=# explain  SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN
(SELECT T2.id FROM join2 t2 where t2.id is not null);
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=29339.83..73079.83 rows=393902 width=8)
   Hash Cond: (t1.id = t2.id)
   ->  Seq Scan on join1 t1  (cost=0.00..35518.00 rows=1260300 width=8)
   ->  Hash  (cost=18313.58..18313.58 rows=882100 width=4)
         ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.42..18313.58
rows=882100 width=4)
               Index Cond: (id IS NOT NULL)
(6 rows)




postgres=# insert into join2 values (
generate_series(950300,1950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 1000001
postgres=# analyze join2;
ANALYZE
postgres=#
postgres=# insert into join2 values (
generate_series(1950300,3950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 2000001
postgres=# analyze join2;
ANALYZE
postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);

 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.86..73003.87 rows=1 width=8) (actual
time=348.632..371.158 rows=200.00 loops=1)
   Workers Planned: 3
   Workers Launched: 3
   Buffers: shared hit=99166 read=3398
   ->  Merge Anti Join  (cost=0.86..72003.77 rows=1 width=8) (actual
time=177.574..334.714 rows=50.00 loops=4)
         Merge Cond: (t1.id = t2.id)
         Buffers: shared hit=99166 read=3398
         ->  Parallel Index Scan using idx_j1 on join1 t1
 (cost=0.43..37380.06 rows=406548 width=8) (actual time=0.090..81.338
rows=315075.00 loops=4)
               Index Searches: 1
               Buffers: shared hit=85241 read=3398
         ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.43..80682.41
rows=3882102 width=4) (actual time=0.049..145.187 rows=1281766.50 loops=4)
               Index Cond: (id IS NOT NULL)
               Heap Fetches: 0
               Index Searches: 4
               Buffers: shared hit=13925
 Planning:
   Buffers: shared hit=12
 Planning Time: 0.191 ms
 Execution Time: 371.191 ms

 postgres=# set work_mem ='128MB';
SET
postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1477.040..1835.205 rows=200.00 loops=1)
   Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
   Rows Removed by Filter: 1260100
   Buffers: shared hit=33502
   SubPlan any_1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
           Heap Fetches: 0
           Index Searches: 1
           Buffers: shared hit=10587
 Planning Time: 0.083 ms
 Execution Time: 1845.041 ms
(11 rows)

postgres=# set work_mem ='8MB';
SET
postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1468.254..1825.373 rows=200.00 loops=1)
   Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
   Rows Removed by Filter: 1260100
   Buffers: shared hit=33502
   SubPlan any_1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
           Heap Fetches: 0
           Index Searches: 1
           Buffers: shared hit=10587
 Planning Time: 0.085 ms
 Execution Time: 1835.148 ms
(11 rows)

postgres=# set work_mem ='128MB';
SET
postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1477.040..1835.205 rows=200.00 loops=1)
   Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
   Rows Removed by Filter: 1260100
   Buffers: shared hit=33502
   SubPlan any_1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
           Heap Fetches: 0
           Index Searches: 1
           Buffers: shared hit=10587
 Planning Time: 0.083 ms
 Execution Time: 1845.041 ms
(11 rows)

postgres=# set work_mem ='8MB';
SET
postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
                                                                    QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on join1 t1  (cost=80682.42..119351.17 rows=630150 width=8)
(actual time=1468.254..1825.373 rows=200.00 loops=1)
   Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
   Rows Removed by Filter: 1260100
   Buffers: shared hit=33502
   SubPlan any_1
     ->  Index Only Scan using idx_j2 on join2 t2  (cost=0.43..70977.16
rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
           Heap Fetches: 0
           Index Searches: 1
           Buffers: shared hit=10587
 Planning Time: 0.085 ms
 Execution Time: 1835.148 ms
(11 rows)

postgres=# set work_mem ='2MB';
SET
postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR:  canceling statement due to user request
postgres=# ^C
postgres=#

postgres=# set work_mem ='2MB';
SET
postgres=# select now();
              now
-------------------------------
 2026-02-04 11:31:42.957009+08
(1 row)

postgres=# select now();
              now
-------------------------------
 2026-02-04 11:32:10.811249+08
(1 row)

postgres=# explain  analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id
NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR:  canceling statement due to user request
postgres=# select  now();
              now
-------------------------------
 2026-02-04 11:42:29.698854+08
(1 row)

postgres=#


Thanks

On Tue, Feb 3, 2026 at 5:41 PM wenhui qiu <[email protected]> wrote:

> Hi Richard
>
> > I believe we are now in a much better position to attempt this again.
> > The planner has accumulated significant infrastructure that makes this
> > proof straightforward and reliable.  Specifically, we can now leverage
> > the outer-join-aware-Var infrastructure to tell whether a Var comes
> > from the nullable side of an outer join, and the not-null-attnums hash
> > table to efficiently check whether a Var is defined NOT NULL.  We also
> > have the expr_is_nonnullable() function that is smart enough to deduce
> > non-nullability for expressions more complex than simple Vars/Consts.
>      Thank you for working on this.Indeed, the benefits are substantial
> and highly necessary, as Oracle, SQL Server, and MySQL have all implemented
> varying degrees of support.I shall test this path in my spare time.
>
> Thanks ，
>
> On Tue, Feb 3, 2026 at 3:13 PM Richard Guo <[email protected]> wrote:
>
>> This topic has been discussed several times in the past.  Due to the
>> semantic mismatch regarding NULL handling, NOT IN is not ordinarily
>> equivalent to an anti-join.  However, if we can prove that neither the
>> outer expressions nor the subquery outputs can yield NULL values, it
>> should be safe to convert NOT IN to an anti-join.
>>
>> I believe we are now in a much better position to attempt this again.
>> The planner has accumulated significant infrastructure that makes this
>> proof straightforward and reliable.  Specifically, we can now leverage
>> the outer-join-aware-Var infrastructure to tell whether a Var comes
>> from the nullable side of an outer join, and the not-null-attnums hash
>> table to efficiently check whether a Var is defined NOT NULL.  We also
>> have the expr_is_nonnullable() function that is smart enough to deduce
>> non-nullability for expressions more complex than simple Vars/Consts.
>>
>> Attached is a draft patch for this attempt (part of the code is
>> adapted from an old patch [1] by David and Tom).  This patch aims for
>> a conservative implementation: the goal is not to handle every
>> theoretical case, but to handle canonical query patterns with minimal
>> code complexity.
>>
>> The patch primarily targets patterns like:
>>
>>   SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned_users);
>>
>> ... and
>>
>>   SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM
>> banned_users WHERE user_id IS NOT NULL);
>>
>> This is a very typical syntax for exclusion.  In well-modeled
>> databases, join keys like id and user_id are very likely to be defined
>> as NOT NULL.
>>
>> It seems to me that the ROI here is quite positive: the added code
>> complexity is very low (thanks to the existing infrastructure), while
>> the benefit is that users writing this typical pattern will finally
>> get efficient anti-join plans without needing manual rewrites.
>>
>> (For the outer expressions, we could potentially also use outer query
>> quals to prove non-nullability.  This patch does not attempt to do so.
>> Implementing this would require passing state down during the
>> pull_up_sublinks recursion; and given that find_nonnullable_vars can
>> fail to prove non-nullability in many cases due to the lack of
>> const-simplification at this stage, I'm not sure whether it is worth
>> the code complexity.  Besides, I haven't fully convinced myself that
>> doing this does not introduce correctness issues.)
>>
>> Any thoughts?
>>
>> [1] https://postgr.es/m/[email protected]
>>
>> - Richard
>>
>

Re: Convert NOT IN sublinks to anti-joins when safe

Reply via email to