[ 
https://issues.apache.org/jira/browse/SPARK-31519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31519:
----------------------------------
    Affects Version/s: 2.3.4

> Cast in having aggregate expressions returns the wrong result
> -------------------------------------------------------------
>
>                 Key: SPARK-31519
>                 URL: https://issues.apache.org/jira/browse/SPARK-31519
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.4, 2.4.5, 3.0.0
>            Reporter: Yuanjian Li
>            Assignee: Yuanjian Li
>            Priority: Major
>              Labels: correctness
>             Fix For: 2.4.6, 3.0.0
>
>
> Cast in having aggregate expressions returns the wrong result.
> See the below tests: 
> {code:java}
> scala> spark.sql("create temp view t(a, b) as values (1,10), (2, 20)")
> res0: org.apache.spark.sql.DataFrame = []
> scala> val query = """
>      | select sum(a) as b, '2020-01-01' as fake
>      | from t
>      | group by b
>      | having b > 10;"""
> scala> spark.sql(query).show()
> +---+----------+
> |  b|      fake|
> +---+----------+
> |  2|2020-01-01|
> +---+----------+
> scala> val query = """
>      | select sum(a) as b, cast('2020-01-01' as date) as fake
>      | from t
>      | group by b
>      | having b > 10;"""
> scala> spark.sql(query).show()
> +---+----+
> |  b|fake|
> +---+----+
> +---+----+
> {code}
> The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING 
> query, and Spark has a special analyzer rule ResolveAggregateFunctions to 
> resolve the aggregate functions and grouping columns in the Filter operator.
>  
> It works for simple cases in a very tricky way as it relies on rule execution 
> order:
> 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes 
> inside aggregate functions, but the function itself is still unresolved as 
> it's an UnresolvedFunction. This stops resolving the Filter operator as the 
> child Aggrege operator is still unresolved.
> 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege 
> operator resolved.
> 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child 
> is a resolved Aggregate. This rule can correctly resolve the grouping columns.
>  
> In the example query, I put a CAST, which needs to be resolved by rule 
> ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 
> 3 as the Aggregate operator is unresolved at that time. Then the analyzer 
> starts next round and the Filter operator is resolved by ResolveReferences, 
> which wrongly resolves the grouping columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to