GitHub user jiangxb1987 opened a pull request:
https://github.com/apache/spark/pull/16168
[SPARK-18209][SQL] More robust view canonicalization without full SQL
expansion
## What changes were proposed in this pull request?
Currently we canonicalize the view definition to provide the context for
the database as well as star expansion, this is fragile for the following
reasons:
1. It is non-trivial to guarantee that the generated SQL is correct without
being extremely verbose, given the current set of operators.
2. We need extensive testing for all combination of operators.
3. Whenever we introduce a new logical plan operator, we need to be super
careful because it might break SQL generation.
We use a late binding approach to do view canonicalization, which resolve
the dependent views every time the view is used. This PR is expected to achieve
the following goals:
1. Views should reflect changes to the underlying tables and intermediate
views. For example:
```
CREATE TABLE T1(a int, b string)
CREATE TABLE T2(a int, b string, c int)
CREATE VIEW A AS SELECT * FROM T1
CREATE VIEW B AS SELECT * FROM A
ALTER VIEW A AS SELECT * FROM T2
SELECT * FROM B
```
will return data from table T2 (Note that we still need to validate the
schema).
2. Make views more robust by not using SQL generation to store the viewâs
logical plan.
3. The views generated by older versions of Spark or HIVE should still work
or throw a meaningful error if the view definition is invalid.
4. A view should throw a meaningful error when the view in an invalid state
due to an underlying change. Such errors should be thrown as early as possible.
The follow up works will be:
1. Remove the SQL generation (in particular the SQLBuilder class) for Spark
SQL operators and the test suites(and also golden files) for them.
2. Disallow cyclic view reference(See
[SPARK-18389](https://issues.apache.org/jira/browse/SPARK-18389)).
3. View dependency management similar to Postgres.
4. Underlying table schema change when star is used.
5. Support create permanent views on non-SQL DataFrames.
## How was this patch tested?
Add new test cases in `SQLViewSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jiangxb1987/spark view
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16168.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16168
----
commit aa81a36dbd038d083d8240a34b45e1a1777d2ac6
Author: jiangxingbo <[email protected]>
Date: 2016-12-06T09:58:06Z
refactor view canonicalization
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]