Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/12850
Thank you for deep discussion on this. I think like this.
For 1), there are **machine-generated** queries by BI tools. This is an
important category of queries. In many cases, BIs (or other tools having UI)
will generated queries by simple rules and those rule does not care about the
output queries. The optimization is the role of **DBMS** or **Spark**. So,
static optimizations are always important. This PR also minimizes the size of
generated codes, too.
For 2), other optimizers already **remove** or **duplicate** UDFs. Spark
dose not give the control of the execution order. As you know, we already made
the conclusion to leave an explicit note like the following for this (in
SPARK-15282 and https://github.com/apache/spark/pull/13087).
```
Note that the user-defined functions must be deterministic. Due to
optimization,
duplicate invocations may be eliminated or the function may even be invoked
more times than
it is present in the query.
```
For 3), could you give some problematic real cases? This PR reordered only
**addition** or **multiplications**, but I think this PR does not change the
final result value. The following is the behavior of current Spark. (Not this
PR. You can see that in the physical plan.)
```scala
scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3))
a)").explain()
== Physical Plan ==
*Project [((2147483640 + a#8) + 7) AS ((2147483640 + a) + 7)#9]
+- Generate explode([1,2,3]), false, false, [a#8]
+- Scan OneRowRelation[]
scala> sql("select 2147483640 + a + 7 from (select explode(array(1,2,3))
a)").collect()
res1: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647],
[-2147483646])
scala> sql("select a + 2147483647 from (select explode(array(1,2,3))
a)").collect()
res2: Array[org.apache.spark.sql.Row] = Array([-2147483648], [-2147483647],
[-2147483646])
scala> sql("select 214748364 * a from (select explode(array(1,2,3))
a)").collect()
res3: Array[org.apache.spark.sql.Row] = Array([214748364], [429496728],
[644245092])
scala> sql("select 214748364 * a * 10 from (select explode(array(1,2,3))
a)").collect()
res4: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16],
[2147483624])
scala> sql("select a * 2147483640 from (select explode(array(1,2,3))
a)").collect()
res5: Array[org.apache.spark.sql.Row] = Array([2147483640], [-16],
[2147483624])
```
Apparently, the optimization of this PR will work like the above.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]