[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...

jiangxb1987 Tue, 06 Dec 2016 02:26:07 -0800

GitHub user jiangxb1987 opened a pull request:

    https://github.com/apache/spark/pull/16168


    [SPARK-18209][SQL] More robust view canonicalization without full SQL 
expansion

    ## What changes were proposed in this pull request?
    
    Currently we canonicalize the view definition to provide the context for 
the database as well as star expansion, this is fragile for the following 
reasons:
    
    1. It is non-trivial to guarantee that the generated SQL is correct without 
being extremely verbose, given the current set of operators.
    2. We need extensive testing for all combination of operators.
    3. Whenever we introduce a new logical plan operator, we need to be super 
careful because it might break SQL generation. 
    
    We use a late binding approach to do view canonicalization, which resolve 
the dependent views every time the view is used. This PR is expected to achieve 
the following goals:
    
    1. Views should reflect changes to the underlying tables and intermediate 
views. For example:
        ```
            CREATE TABLE T1(a int, b string)
            CREATE TABLE T2(a int, b string, c int)
            CREATE VIEW A AS SELECT * FROM T1
            CREATE VIEW B AS SELECT * FROM A
            ALTER VIEW A AS SELECT * FROM T2
            SELECT * FROM B
        ```
    will return data from table T2 (Note that we still need to validate the 
schema).
    2. Make views more robust by not using SQL generation to store the viewâs 
logical plan.
    3. The views generated by older versions of Spark or HIVE should still work 
or throw a meaningful error if the view definition is invalid.
    4. A view should throw a meaningful error when the view in an invalid state 
due to an underlying change. Such errors should be thrown as early as possible.
    
    The follow up works will be:
    
    1. Remove the SQL generation (in particular the SQLBuilder class) for Spark 
SQL operators and the test suites(and also golden files) for them.
    2. Disallow cyclic view reference(See 
[SPARK-18389](https://issues.apache.org/jira/browse/SPARK-18389)).
    3. View dependency management similar to Postgres.
    4. Underlying table schema change when star is used.
    5. Support create permanent views on non-SQL DataFrames.
    
    ## How was this patch tested?
    Add new test cases in `SQLViewSuite`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jiangxb1987/spark view

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16168
    
----
commit aa81a36dbd038d083d8240a34b45e1a1777d2ac6
Author: jiangxingbo <[email protected]>
Date:   2016-12-06T09:58:06Z

    refactor view canonicalization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16168: [SPARK-18209][SQL] More robust view canonicalizat...

Reply via email to