Hi all,

I've figured out how to solve the performance issues I've been encountering
with a particular query, but I'm interested in better understanding the
intuition behind why the better query is so much more performant.

The query in question involves a NOT IN filter from a CTE:

WITH temp as (
  SELECT temp_id
  FROM other_table
  WHERE ...
)
SELECT COUNT(*)
FROM main_table m
WHERE m.main_id NOT IN (SELECT temp_id from temp);

The query plan for (the non-masked version of) this query includes:
Filter: (NOT (SubPlan 1))
   Rows Removed by Filter: 1
   SubPlan 1
     ->  CTE Scan on temp  (cost=0.00..1950.60 rows=97530 width=418)
(actual time=0.018..4.170 rows=15697 loops=100731)

My understanding is that PostgreSQL is not able to make this into a hashed
Subplan because the expected number of rows * width of rows is too large
for the work_mem setting, and so instead it has to do many repeated linear
passes to find whether the various values of main_id are in the list of
temp_ids.

The resolution to this problem was discovered via
https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/,
and involves rewriting as so:

WITH temp as (
  SELECT temp_id
  FROM other_table
  WHERE ...
)
SELECT COUNT(*)
FROM main_table m
WHERE NOT EXISTS (SELECT temp_id from temp where temp_id = m.main_id);

The query plan for this query (which I believe is equivalent to what it
would be if I instead explicitly LEFT JOINed main_table to temp and used a
WHERE to filter out NULL values of temp_id) does not involve a high number
of loops like above:

->  Merge Anti Join  (cost=147305.45..149523.68 rows=192712 width=4)
(actual time=5050.773..5622.266 rows=158850 loops=1)
   Merge Cond: (m.main_id = temp.temp_id)
   ->  Sort  (cost=115086.98..115637.10 rows=220050 width=19) (actual
time=1290.829..1655.066 rows=199226 loops=1)
         Sort Key: m.main_id
         Sort Method: external merge  Disk: 5632kB
   ->  Materialize  (cost=32218.47..32764.37 rows=109180 width=418) (actual
time=3759.936..3787.724 rows=38268 loops=1)
         ->  Sort  (cost=32218.47..32491.42 rows=109180 width=418) (actual
time=3759.933..3771.117 rows=38268 loops=1)
               Sort Key: temp.temp_id
               Sort Method: quicksort  Memory: 3160kB
               ->  CTE Scan on temp  (cost=0.00..2183.60 rows=109180
width=418) (actual time=2316.745..3735.486 rows=38268 loops=1)

Instead it sorts using disk, and uses a MERGE ANTI JOIN, which I understand
to eliminate the need for multiple passes through the table temp.

My primary question is: why is this approach only possible (for data too
large for memory) when using NOT EXISTS, and not when using NOT IN?

I understand that there is a slight difference in the meaning of the two
expressions, in that NOT IN will produce NULL if there are any NULL values
in the right hand side (in this case there are none, and the queries should
return the same COUNT). But if anything, I would expect that to improve
performance of the NOT IN operation, since a single pass through that data
should reveal if there are any NULL values, at which point that information
could be used to short-circuit. So I am a bit baffled.

Thanks very much for your help!

Best,
Lincoln

--
Lincoln Swaine-Moore

Reply via email to