Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

Jim Nasby Sun, 19 Mar 2017 19:55:45 -0700

On 3/19/17 2:32 PM, Tom Lane wrote:

Jim Nasby <jim.na...@openscg.com> writes:

Having thought about it, I share Tom's concerns. And now I'm worried
about what happens if there are multiple separate OR clauses. I guess
those would be handled by separate UNIONs?


As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.


Agreed.

I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely.  But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication.  To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.

It might still be worth-while in some circumstances. In your example, ifthere were these indexes:


a__id ON a(id), a__x ON a(x)
b__id ON b(id), b__y ON b(y)

then it might be faster to nested loop a__x=42 to b__id=a.id and unionthat with b__y=43 nested to a__id=b.id.


That said, now isn't the time to be adding more complexity.

So full joins definitely break this whole optimization.  I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step.  If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm.  The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion.  In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row.  That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.

The only case I can think of is: would it be possible (perhaps nottoday; maybe in the future) for other parts of the query to affect joinelimination? I can't conceive of how that might happen, but if it didthen it's possible that the elimination would work differently with theUNION than it would with an OR.

The comment on join_is_removable() does mention that there's otherpotentially interesting cases that we can't handle now; it's maybe worthmentioning


Other than that, I can't see any issues with your logic.

Any clearer yet?

Absolutely. I think it would be very valuable to include that with theinitial comment in planunionor.c. Join reduction and removal is alreadytricky enough to wrap your head around.


Other comments:

+  * is retty mechanical, but we can't do it until we have a RelOptInfo for the


s/retty/pretty/

I suspect that in many systems single-table queries are far more commonthan CTEs, so maybe it's worth reversing those two tests insplit_join_or_clauses().

For the Unique path, it would be nice if the code did what would benecessary to consider a TID hash join, since that means a user couldcreate the appropriate operator and it would just be picked up.Certainly not worth much effort at this point though.

+       /*
+        * Must not have any volatile functions in FROM or WHERE (see notes at
+        * head of file).
+        */
+       if (contain_volatile_functions((Node *) parse->jointree))

Is there by chance anywhere else that needs to check that? Maybe worthadding the info to the Query struct if so.

+        * We insist that all baserels used in the query be plain relations, so


Dumb question... views have already be decomposed at this point, right?

Perhaps less dumb question... what happens if the original query alreadyhad setOps? AIUI setOps work has already been done by the timesplit_join_or_clauses() happens; I just want to check that that's OK.

I'm not sure a GUC is worth it... I suspect that any query with multiplerels and an OR condition is going to be expensive enough that whateveradditional plan time there is won't be noticeable.


I've verified that the patch still applies and make check-world is clean.
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG                 http://OpenSCG.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

Reply via email to