Github user hvanhovell commented on the issue:
https://github.com/apache/spark/pull/16233
### Background
I think we should view this PR in the proper context. This PR is the first
PR in a series to get the following things done:
- Properly support nested views. The main advantage is that if you update
an underlying view, the current view also gets updated.
- Get rid of SQL generation.
### Approach
This PR should be the first in a series of three. This PR should lay the
ground work by adding a view node by introducing a View node as a concept, and
by making the analyzer resolve (potentially nested) views. This approach has a
number of advantages:
#### Explicit management of the view's default database
A view will have the concept of a default database (using Yin's proposed
terminology here). We need a solid way of managing this for nested views. There
are a couple of options here:
- We could transform the view tree in `SessionCatalog.lookupRelation(...)`,
and make sure all UnresolvedRelations without a database defined are assigned
the views default database. This is basically what @cloud-fan proposed. The
problem with this is that this breaks Common Table Expressions.
- We could use the `SessionCatalog.currentDB`, and set/reset this as soon
as we hit a view node while resolving relations. The major downside to this is
that sessions can be shared between different users, and this might cause weird
behavior.
- We could use an analysis context, and set the views default database in
that context. This is actually similar to the second option, with the benefit
that is won't be as visible. The downside is that we either need to make the
analyzer stateful (note that the session catalog is) using a thread-local, or
that we need to pass this context to every analyzer rule. This approach seems
quite heavy weight, and can require a lot of code changes.
- We can make ResolveRelations View aware, and make it keep track of the
default databases (plural - in case of nested views). The default database will
be the one of the last seen parent view. This approach makes is trivial to
limit the depth of nested views (which might be needed at some point), or we
can make this only resolve one layer of nested views at a time and use the
analyzer's `maxIterations` as an implicit limit.
I am in favor of the last option.
#### Decoupling output and its underlying structure
Introducing a view node with its own attributes, allows us to decouple the
output from the underlying structure. This means we can decouple planning of
query, from the planning the view; this allows us to cache resolved views. Note
that this is not in scope for the current series of PRs, and is a nice to have.
#### View visibility
I added a field to `SubqueryAlias` so it is easier to see where what part
of the plan originated. This is a bit of hack, and a view node is a natural
replacement.
### State of PR
I think we need to be more aggressive when it comes to view nodes. They
should have their own attributes, and should also reference the view desc
(CatalogTable), they are based on. Something like this:
```scala
case class View(
desc: CatalogTable,
output: Seq[Attribute],
child: Option[LogicalPlan] = None)
extends LogicalPlan
with MultiInstanceRelation {
def this(desc: CatalogTable) = this(desc, desc.schema.toAttributes, None)
override lazy val resolved: Boolean = child.exists(_.resolved)
override def children: Seq[LogicalPlan] = child.toSeq
override def newInstance(): LogicalPlan = copy(output =
output.map(_.newInstance()))
}
```
CatalogTable.lookupRelation(...) should just return a basic view node. All
resolution should be moved into the analyzer.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]