Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/12850
  
    Thank you for deep discussion on this. I think like this.
    
    For 1), there are **machine-generated** queries by BI tools. This is an 
important category of queries. In many cases, BIs (or other tools having UI) 
will generated queries by simple rules and those rule does not care about the 
output queries. The optimization is the role of **DBMS** or **Spark**. So, 
static optimizations are always important. This PR also minimizes the size of 
generated codes, too.
    
    For 2), other optimizers already **remove** or **duplicate** UDFs. Spark 
dose not give the control of the execution order. As you know, we already made 
the conclusion to leave an explicit note like the following for this (in 
SPARK-15282 and https://github.com/apache/spark/pull/13087).
    ```
    Note that the user-defined functions must be deterministic. Due to 
optimization,
    duplicate invocations may be eliminated or the function may even be invoked 
more times than
    it is present in the query.
    ```
    
    For 3), could you give some problematic real cases? This PR reordered only 
**addition** or **multiplications**, but I think this PR does not change the 
final result value. The following is the behavior of current Spark. (Not this 
PR. You can see that in the physical plan.)
    
    ```scala
    scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3)) 
a)").explain()
    == Physical Plan ==
    *Project [((2147483640 + a#8) + 7) AS ((2147483640 + a) + 7)#9]
    +- Generate explode([1,2,3]), false, false, [a#8]
       +- Scan OneRowRelation[]
    
    scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3)) 
a)").collect()
    res1: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647], 
[-2147483646])
    
    scala>  sql("select a + 2147483647 from (select explode(array(1,2,3)) 
a)").collect()
    res2: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647], 
[-2147483646])
    
    scala> sql("select 214748364 * a from (select explode(array(1,2,3)) 
a)").collect()
    res3: Array[org.apache.spark.sql.Row] = Array([214748364], [429496728], 
[644245092])
    
    scala> sql("select 214748364 * a * 10 from (select explode(array(1,2,3)) 
a)").collect()
    res4: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16], 
[2147483624])
    
    scala> sql("select a * 2147483640 from (select explode(array(1,2,3)) 
a)").collect()
    res5: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16], 
[2147483624])
    ```
    
    Apparently, the optimization of this PR will work like the above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to