GitHub user gatorsmile opened a pull request:
https://github.com/apache/spark/pull/15502
[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once
#15048
### What changes were proposed in this pull request?
This PR is to backport https://github.com/apache/spark/pull/15048 and
https://github.com/apache/spark/pull/15459.
However, in 2.0, we do not have a unified logical node `CreateTable` and
the analyzer rule `PreWriteCheck` is also different. To minimize the code
changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it
as a new PR to review. Thanks!
As explained in https://github.com/apache/spark/pull/14797:
>Some analyzer rules have assumptions on logical plans, optimizer may break
these assumption, we should not pass an optimized query plan into
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the
precision before binary operations, use PromotePrecision as placeholder to
indicate that this rule should not apply twice. But a Optimizer rule will
remove this placeholder, that break the assumption, then the rule applied
twice, cause wrong result.
We should not optimize the query in CTAS more than once. For example,
```Scala
spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18))
as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))
```
Before this PR, the results do not match
```
== Results ==
!== Correct Answer - 2 == == Spark Answer - 2 ==
![100,100.000000000000000000] [100,null]
[99,99.000000000000000000] [99,99.000000000000000000]
```
After this PR, the results match.
```
+---+----------------------+
|id |num |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+
```
In this PR, we do not treat the `query` in CTAS as a child. Thus, the
`query` will not be optimized when optimizing CTAS statement. However, we still
need to analyze it for normalizing and verifying the CTAS in the Analyzer.
Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this
rule needs the analyzed plan of the `query`.
### How was this patch tested?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/gatorsmile/spark ctasOptimize2.0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15502.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15502
----
commit a9931a538912aeed620df216beb355970979c3f0
Author: gatorsmile <[email protected]>
Date: 2016-10-14T05:16:08Z
the first set of changes
commit d5f91871b4329ca9292a9ca129d6c603f4cf47fc
Author: gatorsmile <[email protected]>
Date: 2016-10-15T15:06:21Z
2nd change set
commit a658da47983001260205c97406dbf744fd9abfcd
Author: gatorsmile <[email protected]>
Date: 2016-10-15T15:12:57Z
more comment
commit 9cfebc523e4b88c3df3ffae8ca5ea92e98a0a616
Author: gatorsmile <[email protected]>
Date: 2016-10-15T15:17:26Z
rename
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]