[GitHub] spark pull request #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS...

gatorsmile Sat, 10 Sep 2016 21:48:07 -0700

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/15048


    [SPARK-17409] [SQL] Do Not Optimize Query in CTAS More Than Once

    ### What changes were proposed in this pull request?
    As explained in https://github.com/apache/spark/pull/14797:
    >Some analyzer rules have assumptions on logical plans, optimizer may break 
these assumption, we should not pass an optimized query plan into 
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the 
precision before binary operations, use PromotePrecision as placeholder to 
indicate that this rule should not apply twice. But a Optimizer rule will 
remove this placeholder, that break the assumption, then the rule applied 
twice, cause wrong result.
    
    We should not optimize the query in CTAS more than once. For example, 
    ```Scala
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) 
as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    ```
    Before this PR, the results do not match
    ```
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    ```
    After this PR, the results match.
    ```
    +---+----------------------+
    |id |num                   |
    +---+----------------------+
    |99 |99.000000000000000000 |
    |100|100.000000000000000000|
    +---+----------------------+
    ```
    
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the 
`query` will not be optimized when optimizing CTAS statement. However, we still 
need to analyze it for normalize and verify the CTAS in the Analyzer. Thus, we 
do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs 
the analyzed plan of the `query`.
    
    ### How was this patch tested?
    Added a test

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark ctasOptimized

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15048.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15048
    
----
commit f7941e846c5ed42a4453518500fbf4938f3f1032
Author: gatorsmile <[email protected]>
Date:   2016-09-11T04:09:02Z

    fix

commit 3a203f920abf742b2f2ab344d0231f992d8e5355
Author: gatorsmile <[email protected]>
Date:   2016-09-11T04:20:39Z

    Merge remote-tracking branch 'upstream/master' into ctasOptimized

commit da7deed2e1e9e350affcee909159a200a4b7d5b8
Author: gatorsmile <[email protected]>
Date:   2016-09-11T04:38:07Z

    one more test case

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS...

Reply via email to