Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/12850 Thank you for deep discussion on this. I think like this. For 1), there are **machine-generated** queries by BI tools. This is an important category of queries. In many cases, BIs (or other tools having UI) will generated queries by simple rules and those rule does not care about the output queries. The optimization is the role of **DBMS** or **Spark**. So, static optimizations are always important. This PR also minimizes the size of generated codes, too. For 2), other optimizers already **remove** or **duplicate** UDFs. Spark dose not give the control of the execution order. As you know, we already made the conclusion to leave an explicit note like the following for this (in SPARK-15282 and https://github.com/apache/spark/pull/13087). ``` Note that the user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query. ``` For 3), could you give some problematic real cases? This PR reordered only **addition** or **multiplications**, but I think this PR does not change the final result value. The following is the behavior of current Spark. (Not this PR. You can see that in the physical plan.) ```scala scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3)) a)").explain() == Physical Plan == *Project [((2147483640 + a#8) + 7) AS ((2147483640 + a) + 7)#9] +- Generate explode([1,2,3]), false, false, [a#8] +- Scan OneRowRelation[] scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3)) a)").collect() res1: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647], [-2147483646]) scala> sql("select a + 2147483647 from (select explode(array(1,2,3)) a)").collect() res2: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647], [-2147483646]) scala> sql("select 214748364 * a from (select explode(array(1,2,3)) a)").collect() res3: Array[org.apache.spark.sql.Row] = Array([214748364], [429496728], [644245092]) scala> sql("select 214748364 * a * 10 from (select explode(array(1,2,3)) a)").collect() res4: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16], [2147483624]) scala> sql("select a * 2147483640 from (select explode(array(1,2,3)) a)").collect() res5: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16], [2147483624]) ``` Apparently, the optimization of this PR will work like the above.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org