[GitHub] spark pull request #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query i...

gatorsmile Sat, 15 Oct 2016 08:23:00 -0700

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/15502


    [SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once 
#15048

    ### What changes were proposed in this pull request?
    This PR is to backport https://github.com/apache/spark/pull/15048 and 
https://github.com/apache/spark/pull/15459. 
    
    However, in 2.0, we do not have a unified logical node `CreateTable` and 
the analyzer rule `PreWriteCheck` is also different. To minimize the code 
changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it 
as a new PR to review. Thanks!
    
    As explained in https://github.com/apache/spark/pull/14797:
    >Some analyzer rules have assumptions on logical plans, optimizer may break 
these assumption, we should not pass an optimized query plan into 
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
    For example, we have a rule for decimal calculation to promote the 
precision before binary operations, use PromotePrecision as placeholder to 
indicate that this rule should not apply twice. But a Optimizer rule will 
remove this placeholder, that break the assumption, then the rule applied 
twice, cause wrong result.
    
    We should not optimize the query in CTAS more than once. For example, 
    ```Scala
    spark.range(99, 101).createOrReplaceTempView("tab1")
    val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) 
as num FROM tab1"
    sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
    checkAnswer(spark.table("tab2"), sql(sqlStmt))
    ```
    Before this PR, the results do not match
    ```
    == Results ==
    !== Correct Answer - 2 ==       == Spark Answer - 2 ==
    ![100,100.000000000000000000]   [100,null]
     [99,99.000000000000000000]     [99,99.000000000000000000]
    ```
    After this PR, the results match.
    ```
    +---+----------------------+
    |id |num                   |
    +---+----------------------+
    |99 |99.000000000000000000 |
    |100|100.000000000000000000|
    +---+----------------------+
    ```
    
    In this PR, we do not treat the `query` in CTAS as a child. Thus, the 
`query` will not be optimized when optimizing CTAS statement. However, we still 
need to analyze it for normalizing and verifying the CTAS in the Analyzer. 
Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this 
rule needs the analyzed plan of the `query`.
    
    ### How was this patch tested?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark ctasOptimize2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15502.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15502
    
----
commit a9931a538912aeed620df216beb355970979c3f0
Author: gatorsmile <[email protected]>
Date:   2016-10-14T05:16:08Z

    the first set of changes

commit d5f91871b4329ca9292a9ca129d6c603f4cf47fc
Author: gatorsmile <[email protected]>
Date:   2016-10-15T15:06:21Z

    2nd change set

commit a658da47983001260205c97406dbf744fd9abfcd
Author: gatorsmile <[email protected]>
Date:   2016-10-15T15:12:57Z

    more comment

commit 9cfebc523e4b88c3df3ffae8ca5ea92e98a0a616
Author: gatorsmile <[email protected]>
Date:   2016-10-15T15:17:26Z

    rename

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15502: [SPARK-17892] [SQL] [2.0] Do Not Optimize Query i...

Reply via email to