Alvaro Herrera <alvhe...@2ndquadrant.com> writes: > Tom Lane wrote: >> That is not an adequate argument for such a change being okay. Postgres, >> with its extensible set of datatypes, has to be much more careful about >> the semantic soundness of optimizations than some other DBs do.
> Can we use the record_image_ops mechanism here? Don't see how that helps? But I went through the logic and think I convinced myself that pushdown to LHS only is okay, and no more questionable than what happens for INTERSECT. Argue thus: Assume a group of non-distinct-according-to-the-setop rows includes m rows from left-hand side, n rows from right-hand side, of which the upper qual would pass m' <= m and n' <= n rows respectively. UNION: unoptimized setop produces 1 row (since we may assume the group is nonempty). That row might or might not pass the upper qual, so we get 0 or 1 row out. User can be certain about what will happen only if m'=m and n'=n (row must pass qual) or m'=n'=0 (row must not pass qual). After optimizing by pushing qual down: we get one row if m' > 0 or n' > 0. This is equivalent to the unoptimized behavior if we assume that the setop always magically selects a representative row passing the qual if possible, so OK. INTERSECT: unoptimized setop produces 1 row if m>0 and n>0, which again might or might not pass the upper qual, so 0 or 1 row out. Optimized case: we get 1 row if m' > 0 and n' > 0. This is equivalent to unoptimized behavior if you allow that the representative row could have been taken from either input. (For example, if m' = m, n > 0, and n' = 0, producing 0 rows is legit only if the hypothetical representative row was taken from RHS.) That's legal per SQL, though I'm unsure offhand if our implementation actually would do that. INTERSECT ALL: unoptimized setop produces min(m,n) rows, which might or might not pass the upper qual, so anywhere from 0 to min(m,n) rows out. (It is probably possible to set a tighter upper bound when m', n' are small, but I've not bothered to reason that out.) Optimized case: we get min(m', n') rows out, which seems fine; again it could be produced by the unoptimized implementation selecting representative row(s) that happen to have that many members passing the upper qual. But actually, our implementation doesn't produce some random collection of min(m,n) elements of the row group. What you get is min(m,n) copies of a single representative row; therefore, assuming the upper qual is nonvolatile, either they all pass or none do, so the actual output is either 0 or min(m,n) duplicates. (The spec says "Let R be a row that is a duplicate of some row in ET1 or of some row in ET2 or both. Let m be the number of duplicates of R in ET1 and let n be the number of duplicates of R in ET2, where m >= 0 and n >= 0. [For INTERSECT ALL] the number of duplicates of R that T contains is the minimum of m and n." So depending on what you want to assume that "duplicates of R" means, you could argue that the spec actually requires our implementation's behavior here. It certainly doesn't appear to forbid it.) The optimized output is therefore distinguishable from unoptimized, because it could produce a number of rows that the unoptimized implementation cannot. But nobody's complained about it, and it's hard to see there being a legitimate beef. EXCEPT: unoptimized setop produces 1 row if m>0 and n=0, which again might or might not pass the upper qual, so 0 or 1 row out. If we were to push qual into both sides: we'd get one row if m' > 0 and n' = 0. So if the qual passes some of LHS and none of RHS, we'd get a row, breaking SQL semantics. If we push into LHS only: we get one row if m' > 0 and n = 0. Certainly then m > 0. So this is equivalent to behavior where the setop magically selects a representative LHS row passing the qual. OTOH, if m' = 0, then the setop wouldn't produce a row, but if it had that row would have failed the upper qual, so behavior is still equivalent. EXCEPT ALL: unoptimized setop produces max(m-n,0) rows, which might or might not pass the upper qual, so anywhere from 0 to max(m-n,0) rows out. (Again it is probably possible to set a tighter upper bound when m', n' are small.) Again, pushing into RHS would allow n to be reduced, perhaps allowing more rows to be returned than SQL would allow, so we can't do that. If we push into LHS only: we get max(m'-n,0) rows. Again, it would be possible for the unoptimized setop to select its representative rows in such a way that this happens (it just has to prefer LHS rows passing the qual over those that don't). Again, this doesn't quite match reality of the current setop implementation, since it returns max(m-n,0) identical rows, not that many random members of the current group. But it's still hard to believe anyone would have a legitimate beef about it. tl;dr: I now think what the patch proposes to do is legit. There's a heck of a lot of work to be done on the comments it's falsified, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers