One of the few situations where I experience poor performance under PostgreSQL, 
compared to other, commercial databases, is when an EXISTS predicate is used.  
Actually, these often do perform quite well, but there are some situations 
where there are optimizations available which other products detect and use, 
which PostgreSQL misses.  Looking back at the mailing list archives, it appears 
that optimizations similar to what other products use for both IN and EXISTS 
were added to the IN predicate circa 2003:
 
http://archives.postgresql.org/pgsql-hackers/2003-08/msg00438.php
 
  -Make IN/NOT IN have similar performance to EXISTS/NOT EXISTS (Tom)
 
There are many situations where the EXISTS predicate and an IN predicate with a 
table subquery produce identical results.  After reviewing the SQL standard and 
thinking about it a bit, I think that the equivalence holds unless:
 
(1)  the equivalent IN predicate is capable of producing a result of UNKNOWN, 
and
 
(2)  the predicate is used in a context where the difference between UNKNOWN 
and FALSE is significant.
 
One additional point: Martijn van Oosterhout pointed out in an earlier email 
that if the subquery contains any non-immutable functions there could be a 
difference, although he described that issue as "minor".
 
The most common case for UNKNOWN logical values is when comparisons involve a 
NULL on either or both sides of an operator.  In some cases, such as the one I 
posted about a month ago, it is quickly clear to a human reader that none of 
the above conditions exist -- the columns are all NOT NULL (so the result of 
the comparisons can never be UNKNOWN), they are used in a context where UNKNOWN 
and FALSE produce the same results, and no non-immutable functions are invoked.
 
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01408.php
 
The plan for the EXISTS predicate (running in 8.2.3) has a cost of 72736.37, 
while the logically equivalent query using IN has a cost of 36.38 (and these do 
approximate reality), so if the faster option was visible to the planner, it 
would be chosen.  Some would argue that I should just change the query to use 
IN.  There are three problems with that.
 
(1)  It requires the use of a multi-value row value constructor, which is a 
construct not supported by all databases we currently use.  (We have a 
heterogeneous environment, where the same queries must run on multiple 
platforms.)  We have a somewhat ugly but valid query to use as a workaround 
which runs on all of our databases; it has a PostgreSQL cost of 130.98, which 
is tolerable, if not optimal.
 
(2)  I have seen a number of cases where the logically equivalent EXISTS 
predicate performs better than the IN predicate.  The failure to recognize 
equivalence and to cost both approaches risks suboptimal performance.  Avoiding 
that requires careful testing of both forms to coerce the planner into choosing 
the best plan.
 
(3)  I have been trying to move our application programmers away from a focus 
on how they want to navigate through the data, toward declaring what they want 
as the result.  (Some programmers routinely use cursors to navigate each table, 
row by row, based on what they think is the best plan, stuffing the data into a 
temporary table as they go, then selecting from the temporary table, when a 
single SELECT statement with a few subqueries will produce the desired data.)  
The current situation with these predicates diverts the focus from "what to 
show" back to "how to get it".
 
I hate to see any queries run slower on PostgreSQL than on other databases, so 
I'm suggesting we address this.  We are talking about an optimization that I've 
seen in some other products for at least 15 years.
 
-Kevin
 


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

                http://www.postgresql.org/about/donate

Reply via email to