[GitHub] spark pull request #15763: [SPARK-17348][SQL] Incorrect results from subquer...

hvanhovell Mon, 07 Nov 2016 12:18:31 -0800

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15763#discussion_r86859510
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
    @@ -1044,6 +1044,34 @@ class Analyzer(
               failOnOuterReference(p)
               p
           }
    +
    +      // SPARK-17348
    +      // Looking for a potential incorrect result case.
    +      // When a correlated predicate is a non-equality predicate
    +      // it must be placed at the immediate child operator.
    +      // Otherwise, the pull up of the correlated predicate
    +      // will generate a plan with a different semantics
    +      // which could return incorrect result.
    +      var continue : Boolean = true
    --- End diff --
    
    @nsyca thanks for the explanation of how DB2 works with subqueries. A 
different perspective or approach can be very helpful; we all suffer from 
myopia at some point. It most certainly has merit to add a general node for 
subquery processing to Spark. Do you have time to work on this for Spark 2.2?
    
    I would also like to take the opportunity to explain why we do so much 
rewriting during analysis. We wanted support the following use case:
    ```sql
    -- hive: subquery_exists_having.q
    select b.key, min(b.value)
    from src b
    group by b.key
    having exists ( select a.key
                    from src a
                    where a.value > 'val_9' and a.value = min(b.value)
                    )
    ```
    The difficulty here is that we need to evaluate the `min(b.value)` in the 
aggregate. So we needed a way to extract the entire `min(b.value)` expression. 
The most straightforward way was to extract the entire predicate and rewrite 
the tree in the process. This is quite an aggressive approach, and it breaks as 
soon as you cannot/should not move the predicate. In hindsight it might have 
been better to isolate the entire outer expression instead of only isolating 
the outer reference, and to do the rewriting in a later stage.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15763: [SPARK-17348][SQL] Incorrect results from subquer...

Reply via email to