[GitHub] spark issue #19193: [WIP][SPARK-21896][SQL] Fix Stack Overflow when window f...

aokolnychyi Thu, 31 May 2018 14:03:02 -0700

Github user aokolnychyi commented on the issue:

    https://github.com/apache/spark/pull/19193
  
    @cloud-fan @hvanhovell I created PR #21473 that fixes StackOverflow.
    
    Apart from that, I think we might have other potential problems.
    
    **1. Window functions inside WHERE and HAVING**
    
    Why such cases should be prohibited is described 
[here](https://stackoverflow.com/questions/13997177/why-no-windowed-functions-in-where-clauses).
    Spark, on the other hand, does not handle this explicitly and will fail 
with non-descriptive exceptions.
    
    ```
    val df = Seq((1, 2), (1, 3), (2, 4), (5, 5)).toDF("a", "b")
    df.createTempView("t1")
    
    spark.sql("SELECT t1.a FROM t1 WHERE RANK() OVER(ORDER BY t1.b) = 
1").explain(true)
    spark.sql("SELECT t1.a FROM t1 WHERE RANK() OVER(ORDER BY t1.b) = 
1").show(false)
    
    Exception in thread "main" java.lang.UnsupportedOperationException: Cannot 
evaluate expression: rank(input[1, int, false]) windowspecdefinition(input[1, 
int, false] ASC NULLS FIRST, specifiedwindowframe(RowFrame, 
unboundedpreceding$(), currentrow$()))
        at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
        at 
org.apache.spark.sql.catalyst.expressions.WindowExpression.doGenCode(windowExpressions.scala:278)
        at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
        at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
        at scala.Option.getOrElse(Option.scala:121)
        ...
    ```
    
    ```
    val df = Seq((1, 2), (1, 3), (2, 4), (5, 5)).toDF("a", "b")
    df.createTempView("t1")
    
    spark.sql("SELECT t1.a, MAX(t1.b) FROM t1 GROUP BY t1.a HAVING RANK() 
OVER(ORDER BY t1.a) = 1").explain(true)
    spark.sql("SELECT t1.a, MAX(t1.b) FROM t1 GROUP BY t1.a HAVING RANK() 
OVER(ORDER BY t1.a) = 1").show(false)
    
    Exception in thread "main" java.lang.UnsupportedOperationException: Cannot 
evaluate expression: rank(input[1, int, false]) windowspecdefinition(input[1, 
int, false] ASC NULLS FIRST, specifiedwindowframe(RowFrame, 
unboundedpreceding$(), currentrow$()))
        at 
org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
        at 
org.apache.spark.sql.catalyst.expressions.WindowExpression.doGenCode(windowExpressions.scala:278)
        at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
        at 
org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
        at scala.Option.getOrElse(Option.scala:121)
    
    ```
    Shall this be explicitly validated?
    
    **2. HAVING clause using the Dataset API**
    
    It seems we cannot use HAVING with aggregate functions in the Dataset API 
if you have a window function in the same query.
    
    The following query works correctly as ``ResolveAggregateFunctions`` will 
apply once you have the complete plan. I.e., ``ResolveAggregateFunctions`` will 
apply when you call ``where``.
    
    ```
    val df = Seq((1, 2), (1, 3), (2, 4), (5, 5)).toDF("a", "b")
    df.groupBy('a).agg(max('b)).where(sum('b) === 5).show(false)
    +---+------+
    |a  |max(b)|
    +---+------+
    |1  |3     |
    |5  |5     |
    +---+------+
    ```
    
    The query below, however, will fail even though it is a valid one (notice 
window functions).
    ```
    val df = Seq((1, 2), (1, 3), (2, 4), (5, 5)).toDF("a", "b")
    df.groupBy('a).agg(max('b), rank().over(window)).where(sum('b) === 
5).show(false)
    
    Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve '`b`' given input columns: [a, max(b), RANK() OVER (ORDER BY a ASC 
NULLS FIRST unspecifiedframe$())];;
    'Filter (sum('b) = 5)
    +- AnalysisBarrier
          +- Project [a#5, max(b)#14, RANK() OVER (ORDER BY a ASC NULLS FIRST 
unspecifiedframe$())#15]
             +- Project [a#5, max(b)#14, RANK() OVER (ORDER BY a ASC NULLS 
FIRST unspecifiedframe$())#15, RANK() OVER (ORDER BY a ASC NULLS FIRST 
unspecifiedframe$())#15]
                +- Window [rank(a#5) windowspecdefinition(a#5 ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS RANK() 
OVER (ORDER BY a ASC NULLS FIRST unspecifiedframe$())#15], [a#5 ASC NULLS FIRST]
                   +- Aggregate [a#5], [a#5, max(b#6) AS max(b)#14]
                      +- Project [_1#2 AS a#5, _2#3 AS b#6]
                         +- LocalRelation [_1#2, _2#3]
    ```
    
    It fails because ``ExtractWindowExpressions`` will apply when you call 
``agg`` and not ``where``. At that point of time, you do not have the full plan 
and ``ExtractWindowExpressions`` will not use the correct case.
    
    The same query will work with SQL.
    
    ```
    val df = Seq((1, 2), (1, 3), (2, 4), (5, 5)).toDF("a", "b")
    df.createTempView("t1")
    
    spark.sql("SELECT t1.a, MAX(t1.b), RANK() OVER(ORDER BY t1.a) FROM t1 GROUP 
BY t1.a HAVING SUM(t1.b) = 5").explain(true)
    spark.sql("SELECT t1.a, MAX(t1.b), RANK() OVER(ORDER BY t1.a) FROM t1 GROUP 
BY t1.a HAVING SUM(t1.b) = 5").show(false)
    
    
+---+------+-----------------------------------------------------------------------------------------+
    |a  |max(b)|RANK() OVER (ORDER BY a ASC NULLS FIRST ROWS BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW)|
    
+---+------+-----------------------------------------------------------------------------------------+
    |1  |3     |1                                                               
                         |
    |5  |5     |2                                                               
                         |
    
+---+------+-----------------------------------------------------------------------------------------+
    
    ```
    
    I am just curious if my statement is correct: you cannot use HAVING clause 
in the Dataset API if your query contains a window function. Or do I miss 
something?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19193: [WIP][SPARK-21896][SQL] Fix Stack Overflow when window f...

Reply via email to