GitHub user gatorsmile opened a pull request:

    [SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once 

    ### What changes were proposed in this pull request?
    This PR is to backport and 
    However, in 2.0, we do not have a unified logical node `CreateTable` and 
the analyzer rule `PreWriteCheck` is also different. To minimize the code 
changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it 
as a new PR to review. Thanks!
    As explained in
    >Some analyzer rules have assumptions on logical plans, optimizer may break 
these assumption, we should not pass an optimized query plan into 
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the 
precision before binary operations, use PromotePrecision as placeholder to 
indicate that this rule should not apply twice. But a Optimizer rule will 
remove this placeholder, that break the assumption, then the rule applied 
twice, cause wrong result.
    We should not optimize the query in CTAS more than once. For example, 
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) 
as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    Before this PR, the results do not match
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    After this PR, the results match.
    |id |num                   |
    |99 |99.000000000000000000 |
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the 
`query` will not be optimized when optimizing CTAS statement. However, we still 
need to analyze it for normalizing and verifying the CTAS in the Analyzer. 
Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this 
rule needs the analyzed plan of the `query`.
    ### How was this patch tested?

You can merge this pull request into a Git repository by running:

    $ git pull ctasOptimize2.0

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15502
commit a9931a538912aeed620df216beb355970979c3f0
Author: gatorsmile <>
Date:   2016-10-14T05:16:08Z

    the first set of changes

commit d5f91871b4329ca9292a9ca129d6c603f4cf47fc
Author: gatorsmile <>
Date:   2016-10-15T15:06:21Z

    2nd change set

commit a658da47983001260205c97406dbf744fd9abfcd
Author: gatorsmile <>
Date:   2016-10-15T15:12:57Z

    more comment

commit 9cfebc523e4b88c3df3ffae8ca5ea92e98a0a616
Author: gatorsmile <>
Date:   2016-10-15T15:17:26Z



If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to