>>> On Mon, Oct 22, 2007 at  1:30 PM, Simon Riggs wrote: 
> On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
>> I've requested this before without response, but I'm asking again
>> because it just caused me pain again: could we get a TODO added to
>> have the planner recognize equivalent IN and EXISTS constructs and
>> have them compete on cost estimates?  I know it's not a trivial
>> improvement, but if it's on the list maybe someone will pick it up,
>> and I see it as the single biggest weakness in PostgreSQL
>> performance.
> 
> I'll pick it up as a default unless someone requests they have it
from
> me.
 
Since Simon's focus has shifted to other issues, I'm hoping this can
go onto the TODO list.
 
I'm adding some NOT EXISTS examples to the thread for completeness of
what someone might want to address while working on it.  For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over five
orders of magnitude.  (Six if you compare to the LEFT JOIN ... WHERE
not_null_right_column IS NULL trick.)
 
Below are the cost estimates of the three techniques for a
medium-sized table joining to a large table, and for a large table
joining to a small table.  The IN behavior has the worst worst-case
behavior, at least in queries that I've run, although many people
report that it is usually faster.  The technique of doing an existence
test with a LEFT JOIN and then checking whether a NOT NULL column from
the right-hand table is null is often faster than either technique,
and seldom much worse than the best technique for any given test.
 
Queries and plans attached.  Summary of costs below, in millions of
cost units. (Fractions of a million discarded.)
 
NOT IN (independent_subquery)
19745843, 5
 
WHERE NOT EXISTS
74, 318
 
LEFT JOIN WHERE not_null_right_column IS NULL
10, 17
 
These cost estimates tend to come out in pretty consistent ratio to
the actual run times.
 
-Kevin
cir=> explain
cir-> SELECT "A"."countyNo", "A"."caseNo", "A"."chargeSeqNo", "A"."jdgmtSeqNo", 
"A"."jdgmtHistSeqNo"
cir->   FROM "Jdgmt" "A"
cir->   WHERE "A"."jdgmtHistSeqNo" IS NOT NULL
cir->     AND ("A"."countyNo", "A"."caseNo", "A"."jdgmtHistSeqNo") NOT IN 
(SELECT "B"."countyNo", "B"."caseNo", "B"."histSeqNo" FROM "CaseHist" "B")
cir->   ORDER BY "countyNo", "caseNo", "chargeSeqNo", "jdgmtSeqNo"
cir-> ;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using "Jdgmt_pkey" on "Jdgmt" "A"  
(cost=4003123.55..19745843977029.96 rows=6896890 width=24)
   Filter: (("jdgmtHistSeqNo" IS NOT NULL) AND (NOT (subplan)))
   SubPlan
     ->  Materialize  (cost=4003123.55..6253176.10 rows=167689505 width=20)
           ->  Seq Scan on "CaseHist" "B"  (cost=0.00..3262276.55 
rows=167689505 width=20)
(5 rows)

cir=>
cir=> explain
cir-> SELECT "A"."countyNo", "A"."caseNo", "A"."chargeSeqNo", "A"."jdgmtSeqNo", 
"A"."jdgmtHistSeqNo"
cir->   FROM "Jdgmt" "A"
cir->   WHERE "A"."jdgmtHistSeqNo" IS NOT NULL
cir->     AND NOT EXISTS
cir->     (
cir(>       SELECT * FROM "CaseHist" "B"
cir(>         WHERE "B"."countyNo" = "A"."countyNo"
cir(>           AND "B"."caseNo" = "A"."caseNo"
cir(>           AND "B"."histSeqNo" = "A"."jdgmtHistSeqNo"
cir(>     )
cir->   ORDER BY "countyNo", "caseNo", "chargeSeqNo", "jdgmtSeqNo"
cir-> ;
                                                                      QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using "Jdgmt_pkey" on "Jdgmt" "A"  (cost=0.00..74966880.92 
rows=6896890 width=24)
   Filter: (("jdgmtHistSeqNo" IS NOT NULL) AND (NOT (subplan)))
   SubPlan
     ->  Index Scan using "CaseHist_pkey" on "CaseHist" "B"  (cost=0.00..10.07 
rows=2 width=323)
           Index Cond: ((("countyNo")::smallint = ($0)::smallint) AND 
(("caseNo")::text = ($1)::text) AND (("histSeqNo")::smallint = ($2)::smallint))
(5 rows)

cir=>
cir=> explain
cir-> SELECT "A"."countyNo", "A"."caseNo", "A"."chargeSeqNo", "A"."jdgmtSeqNo", 
"A"."jdgmtHistSeqNo"
cir->   FROM "Jdgmt" "A"
cir->   LEFT JOIN "CaseHist" "B"
cir->     ON ( "B"."countyNo" = "A"."countyNo"
cir(>      AND "B"."caseNo" = "A"."caseNo"
cir(>      AND "B"."histSeqNo" = "A"."jdgmtHistSeqNo"
cir(>        )
cir->   WHERE "A"."jdgmtHistSeqNo" IS NOT NULL
cir->     AND "B"."countyNo" IS NULL
cir->   ORDER BY "countyNo", "caseNo", "chargeSeqNo", "jdgmtSeqNo"
cir-> ;
                                                                                
    QUERY PLAN                                                                  
                
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10616941.83..10645732.70 rows=11516349 width=24)
   Sort Key: "A"."countyNo", "A"."caseNo", "A"."chargeSeqNo", "A"."jdgmtSeqNo"
   ->  Merge Right Join  (cost=2020174.37..9175701.57 rows=11516349 width=24)
         Merge Cond: ((("B"."countyNo")::smallint = "inner"."?column6?") AND 
(("B"."caseNo")::text = "inner"."?column7?") AND (("B"."histSeqNo")::smallint = 
"inner"."?column8?"))
         Filter: ("B"."countyNo" IS NULL)
         ->  Index Scan using "CaseHist_pkey" on "CaseHist" "B"  
(cost=0.00..5437201.95 rows=167689505 width=20)
         ->  Sort  (cost=2020174.37..2054658.82 rows=13793779 width=24)
               Sort Key: ("A"."countyNo")::smallint, ("A"."caseNo")::text, 
("A"."jdgmtHistSeqNo")::smallint
               ->  Seq Scan on "Jdgmt" "A"  (cost=0.00..275965.51 rows=13793779 
width=24)
                     Filter: ("jdgmtHistSeqNo" IS NOT NULL)
(10 rows)


cir=> explain
cir-> SELECT "A"."countyNo", "A"."caseNo", "A"."histSeqNo", "A"."eventType"
cir->   FROM "CaseHist" "A"
cir->   WHERE "A"."eventType" NOT IN (SELECT "B"."eventType" FROM "HistEvent" 
"B")
cir->   ORDER BY "countyNo", "caseNo", "histSeqNo"
cir-> ;
                                             QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Index Scan using "CaseHist_pkey" on "CaseHist" "A"  (cost=88.39..5855942.54 
rows=83836608 width=27)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on "HistEvent" "B"  (cost=0.00..83.51 rows=1951 width=7)
(4 rows)

cir=> explain
cir-> SELECT "A"."countyNo", "A"."caseNo", "A"."histSeqNo", "A"."eventType"
cir->   FROM "CaseHist" "A"
cir->   WHERE NOT EXISTS
cir->     (
cir(>       SELECT * FROM "HistEvent" "B"
cir(>         WHERE "B"."eventType" = "A"."eventType"
cir(>     )
cir->   ORDER BY "countyNo", "caseNo", "histSeqNo"
cir-> ;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Index Scan using "CaseHist_pkey" on "CaseHist" "A"  (cost=0.00..318612009.10 
rows=83836608 width=27)
   Filter: (NOT (subplan))
   SubPlan
     ->  Index Scan using "HistEvent_pkey" on "HistEvent" "B"  (cost=0.00..1.87 
rows=1 width=119)
           Index Cond: (("eventType")::text = ($0)::text)
(5 rows)

cir=> explain
cir-> SELECT "A"."countyNo", "A"."caseNo", "A"."histSeqNo", "A"."eventType"
cir->   FROM "CaseHist" "A"
cir->   LEFT JOIN "HistEvent" "B"
cir->     ON ("B"."eventType" = "A"."eventType")
cir->   WHERE "B"."eventType" IS NULL
cir->   ORDER BY "countyNo", "caseNo", "histSeqNo"
cir-> ;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Sort  (cost=17563734.98..17773326.50 rows=83836608 width=27)
   Sort Key: "A"."countyNo", "A"."caseNo", "A"."histSeqNo"
   ->  Hash Left Join  (cost=107.90..5777165.80 rows=83836608 width=27)
         Hash Cond: (("A"."eventType")::text = ("B"."eventType")::text)
         Filter: ("B"."eventType" IS NULL)
         ->  Seq Scan on "CaseHist" "A"  (cost=0.00..3261959.66 rows=167673216 
width=27)
         ->  Hash  (cost=83.51..83.51 rows=1951 width=7)
               ->  Seq Scan on "HistEvent" "B"  (cost=0.00..83.51 rows=1951 
width=7)
(8 rows)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to