I was able to speedup original query a lot by using CTE. It still uses seq
scan on `diff` table, but looks like it does this once:

explain
with
      diff as (select id from drug_refills_diff)
delete from drug_refills
where id in (select id from diff);
                                            QUERY
PLAN
---------------------------------------------------------------------------------------------------
 Delete on drug_refills  (cost=989844.94..990366.86 rows=456888836 width=46)
   CTE diff
     ->  Seq Scan on drug_refills_diff  (cost=0.00..720404.88 rows=11975088
width=16)
   ->  Nested Loop  (cost=269440.05..269961.98 rows=456888836 width=46)
         ->  HashAggregate  (cost=269439.48..269441.48 rows=200 width=56)
               Group Key: diff.id
               ->  CTE Scan on diff  (cost=0.00..239501.76 rows=11975088
width=56)
         ->  Index Scan using drug_refills_pkey on drug_refills
(cost=0.57..2.59 rows=1 width=22)
               Index Cond: (id = diff.id)
(9 rows)



2017-12-01 15:17 GMT+02:00 Danylo Hlynskyi <abcz2.upr...@gmail.com>:

> Oh, sorry, this happens on Postgresql 9.6.6. I've checked that on
> Postgresql 10.0 query plan from setup (1)
> now uses two seq scans, like in setup (2).
>
>
> 2017-12-01 15:03 GMT+02:00 Danylo Hlynskyi <abcz2.upr...@gmail.com>:
>
>> Hello. I want to remove rows from first table, that exist in second
>> (equality is done using PK). However I experience seq scan on second table,
>> which counters my intuition - I think it should be index-only. Because
>> tables are large, performance of query is very bad.
>>
>> However I got mixed results when trying to reproduce this behavior on
>> syntetic tables. Here I'll show 3 different plans, which I got for the same
>> query.
>>
>> 1. Setup is:
>> ---------------------------
>> create table diff (id uuid constraint diff_pkey primary key);
>> create table origin (id uuid constraint origin_pkey primary key);
>> ---------------------------
>>
>> The query generates correct plan, which performs only index scans:
>>
>> explain delete from origin where exists (select id from diff where
>> origin.id = diff.id);
>>                                         QUERY PLAN
>> ------------------------------------------------------------
>> -------------------------------
>>  Delete on origin  (cost=0.30..105.56 rows=1850 width=12)
>>    ->  Merge Semi Join  (cost=0.30..105.56 rows=1850 width=12)
>>          Merge Cond: (origin.id = diff.id)
>>          ->  Index Scan using origin_pkey on origin  (cost=0.15..38.90
>> rows=1850 width=22)
>>          ->  Index Scan using diff_pkey on diff  (cost=0.15..38.90
>> rows=1850 width=22)
>> (5 rows)
>>
>> 2. Setup is:
>> --------------------------------
>> create table origin (id uuid constraint origin_pkey primary key, data
>> jsonb);
>> create table diff (id uuid constraint diff_pkey primary key, data jsonb);
>> --------------------------------
>>
>> The query generates plan with two seq scans:
>>
>> explain delete from origin where exists (select id from diff where
>> origin.id = diff.id);
>>                                 QUERY PLAN
>> ------------------------------------------------------------
>> ---------------
>>  Delete on origin  (cost=34.08..69.49 rows=1070 width=12)
>>    ->  Hash Semi Join  (cost=34.08..69.49 rows=1070 width=12)
>>          Hash Cond: (origin.id = diff.id)
>>          ->  Seq Scan on origin  (cost=0.00..20.70 rows=1070 width=22)
>>          ->  Hash  (cost=20.70..20.70 rows=1070 width=22)
>>                ->  Seq Scan on diff  (cost=0.00..20.70 rows=1070 width=22)
>> (6 rows)
>>
>> 3. My real `origin` table has 26 fields and 800 billion rows, real `diff`
>> table has 12 million rows and the query generates plan with nested loop and
>> seq scan on `diff` table:
>>
>> explain delete from drug_refills origin where exists (select id from
>> drug_refills_diff diff where origin.id = diff.id);
>>                                                 QUERY PLAN
>> ------------------------------------------------------------
>> ----------------------------------------------
>>  Delete on drug_refills origin  (cost=0.57..22049570.11 rows=11975161
>> width=12)
>>    ->  Nested Loop  (cost=0.57..22049570.11 rows=11975161 width=12)
>>          ->  Seq Scan on drug_refills_diff diff  (cost=0.00..720405.61
>> rows=11975161 width=22)
>>          ->  Index Scan using drug_refills_pkey on drug_refills origin
>> (cost=0.57..1.77 rows=1 width=22)
>>                Index Cond: (id = diff.id)
>> (5 rows)
>>
>> I have run ANALYZE on both tables, but it didn't help. Here are column
>> types in origin and diff (same schema), if that matters:
>>
>> uuid
>> timestamp with time zone
>> timestamp with time zone
>> character varying(255)
>> character varying(255)
>> character varying(1024)
>> numeric(10,4)
>> integer
>> numeric(14,8)
>> numeric(14,8)
>> numeric(14,8)
>> numeric(14,8)
>> numeric(14,8)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> character varying(16)
>> date
>> jsonb
>> text[]
>> uuid
>> uuid
>> uuid
>> uuid
>>
>>
>

Reply via email to