GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/15048
[SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once
### What changes were proposed in this pull request?
As explained in https://github.com/apache/spark/pull/14797:
>Some analyzer rules have assumptions on logical plans, optimizer may break
these assumption, we should not pass an optimized query plan into
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the
precision before binary operations, use PromotePrecision as placeholder to
indicate that this rule should not apply twice. But a Optimizer rule will
remove this placeholder, that break the assumption, then the rule applied
twice, cause wrong result.
We should not optimize the query in CTAS more than once. For example,
```Scala
spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18))
as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))
```
Before this PR, the results do not match
```
== Results ==
!== Correct Answer - 2 == == Spark Answer - 2 ==
![100,100.000000000000000000] [100,null]
[99,99.000000000000000000] [99,99.000000000000000000]
```
After this PR, the results match.
```
+---+----------------------+
|id |num |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+
```
In this PR, we do not treat the `query` in CTAS as a child. Thus, the
`query` will not be optimized when optimizing CTAS statement. However, we still
need to analyze it for normalize and verify the CTAS in the Analyzer. Thus, we
do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs
the analyzed plan of the `query`.
### How was this patch tested?
Added a test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark ctasOptimized
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15048.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15048
----
commit f7941e846c5ed42a4453518500fbf4938f3f1032
Author: gatorsmile <[email protected]>
Date: 2016-09-11T04:09:02Z
fix
commit 3a203f920abf742b2f2ab344d0231f992d8e5355
Author: gatorsmile <[email protected]>
Date: 2016-09-11T04:20:39Z
Merge remote-tracking branch 'upstream/master' into ctasOptimized
commit da7deed2e1e9e350affcee909159a200a4b7d5b8
Author: gatorsmile <[email protected]>
Date: 2016-09-11T04:38:07Z
one more test case
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]