[GitHub] spark issue #15763: [SPARK-17348][SQL] Incorrect results from subquery trans...

nsyca Mon, 07 Nov 2016 18:03:06 -0800

Github user nsyca commented on the issue:

https://github.com/apache/spark/pull/15763

@srinathshankar It is intentional. It is impossible to do an analysis on
what in-between operations we can allow and what we cannot. Correlated
predicates can be placed in any arbitrary level of depth in a subquery. Spark
may not support more than one level of correlations today but it may in the
next version. So my argument is if we cannot proof that pulling up a correlated
predicate thru any operations will still preserve its original semantics, then
we should not do it. The paper @rxin mentioned in the previous note
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf)
makes this claim in the paragraph after Q2 in page 2.

This PR is not a full solution. It is intended to be a temporary stop-gap
solution to close off the incorrect result cases Spark exposes today. It could
be argued that your example above is a regression but the statement can be
rewritten to make it work by collapsing the 2 levels of subselects. It is
harder to implement a solution that will walk the whole plan tree of a subquery
to determine which operations are fine to let the correlated predicate pulled
through.

A permanent solution, as I proposed in one of my comments above, is to
separate the transformation of any correlated predicates to Optimizer phase and
leave Analyzer phase to just resolve the references and validate the input SQL
is valid. This way, the two subselects in your example will probably be merged
and then the correlated predicate pull up follows.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15763: [SPARK-17348][SQL] Incorrect results from subquery trans...

Reply via email to