Tom Lane, 24.07.2012 19:12:
Well, it would only help if you're running a PG version that's new
enough to recognize the NOT EXISTS as an anti-join; and even then,
it's possible that joining on a tid column forecloses enough plan
types that you don't get any real benefit. But I'm just guessing.
Can you show exactly what you tried and what EXPLAIN ANALYZE results
you got?
I am using 9.1.4 (as I said in my initial post).
I finally found a solution that runs fine:
DELETE FROM dupes a
WHERE EXISTS (SELECT 1
FROM dupes b
WHERE b.first_name = a.first_name
AND b.last_name = a.last_name
AND b.ctid > a.ctid);
The execution plan for this is:
Delete on public.dupes a (cost=14575.95..16978.87 rows=25000 width=12) (actual
time=2419.334..2419.334 rows=0 loops=1)
Buffers: shared hit=18029
-> Merge Semi Join (cost=14575.95..16978.87 rows=25000 width=12) (actual
time=2043.674..2392.707 rows=17097 loops=1)
Output: a.ctid, b.ctid
Merge Cond: ((a.first_name = b.first_name) AND (a.last_name =
b.last_name))
Join Filter: (b.ctid > a.ctid)
Buffers: shared hit=930
-> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual
time=1024.195..1030.051 rows=75000 loops=1)
Output: a.ctid, a.first_name, a.last_name
Sort Key: a.first_name, a.last_name
Sort Method: quicksort Memory: 8870kB
Buffers: shared hit=465
-> Seq Scan on public.dupes a (cost=0.00..1215.00 rows=75000
width=20) (actual time=0.025..23.234 rows=75000 loops=1)
Output: a.ctid, a.first_name, a.last_name
Buffers: shared hit=465
-> Sort (cost=7287.98..7475.48 rows=75000 width=20) (actual
time=1019.148..1028.483 rows=105841 loops=1)
Output: b.ctid, b.first_name, b.last_name
Sort Key: b.first_name, b.last_name
Sort Method: quicksort Memory: 8870kB
Buffers: shared hit=465
-> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000
width=20) (actual time=0.017..19.133 rows=75000 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 2420.953 ms
Which is a lot better than the plan using "WHERE ctid NOT IN (.....)":
Delete on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6) (actual
time=582515.094..582515.094 rows=0 loops=1)
Buffers: shared hit=18027
-> Seq Scan on public.dupes (cost=1777.50..4925055.00 rows=37500 width=6)
(actual time=1038.164..582332.927 rows=17097 loops=1)
Output: dupes.ctid
Filter: (NOT (SubPlan 1))
Buffers: shared hit=930
SubPlan 1
-> Materialize (cost=1777.50..1890.00 rows=7500 width=20) (actual
time=0.001..2.283 rows=35552 loops=75000)
Output: (min(b.ctid)), b.first_name, b.last_name
Buffers: shared hit=465
-> HashAggregate (cost=1777.50..1852.50 rows=7500 width=20)
(actual time=90.964..120.228 rows=57903 loops=1)
Output: min(b.ctid), b.first_name, b.last_name
Buffers: shared hit=465
-> Seq Scan on public.dupes b (cost=0.00..1215.00
rows=75000 width=20) (actual time=0.008..25.515 rows=75000 loops=1)
Output: b.ctid, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 582517.711 ms
Using "WHERE id NOT IN (...)" is the fastest way:
Delete on public.dupes (cost=1871.25..3273.75 rows=37500 width=6) (actual
time=187.949..187.949 rows=0 loops=1)
Buffers: shared hit=18490
-> Seq Scan on public.dupes (cost=1871.25..3273.75 rows=37500 width=6)
(actual time=125.351..171.108 rows=17097 loops=1)
Output: dupes.ctid
Filter: (NOT (hashed SubPlan 1))
Buffers: shared hit=930
SubPlan 1
-> HashAggregate (cost=1777.50..1852.50 rows=7500 width=18) (actual
time=73.131..93.421 rows=57903 loops=1)
Output: min(b.id), b.first_name, b.last_name
Buffers: shared hit=465
-> Seq Scan on public.dupes b (cost=0.00..1215.00 rows=75000
width=18) (actual time=0.004..8.515 rows=75000 loops=1)
Output: b.id, b.first_name, b.last_name
Buffers: shared hit=465
Total runtime: 189.222 ms
Regards
Thomas
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance