[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

hvanhovell Thu, 22 Dec 2016 09:08:00 -0800

Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/16233
  
    ### Background
    I think we should view this PR in the proper context. This PR is the first 
PR in a series to get the following things done:
    - Properly support nested views. The main advantage is that if you update 
an underlying view, the current view also gets updated.
    - Get rid of SQL generation.
    
    ### Approach
    This PR should be the first in a series of three. This PR should lay the 
ground work by adding a view node by introducing a View node as a concept, and 
by making the analyzer resolve (potentially nested) views. This approach has a 
number of advantages:
    
    #### Explicit management of the view's default database
    A view will have the concept of a default database (using Yin's proposed 
terminology here). We need a solid way of managing this for nested views. There 
are a couple of options here:
    - We could transform the view tree in `SessionCatalog.lookupRelation(...)`, 
and make sure all UnresolvedRelations without a database defined are assigned 
the views default database. This is basically what @cloud-fan proposed. The 
problem with this is that this breaks Common Table Expressions.
    - We could use the `SessionCatalog.currentDB`, and set/reset this as soon 
as we hit a view node while resolving relations. The major downside to this is 
that sessions can be shared between different users, and this might cause weird 
behavior.
    - We could use an analysis context, and set the views default database in 
that context. This is actually similar to the second option, with the benefit 
that is won't be as visible. The downside is that we either need to make the 
analyzer stateful (note that the session catalog is) using a thread-local, or 
that we need to pass this context to every analyzer rule. This approach seems 
quite heavy weight, and can require a lot of code changes.
    - We can make ResolveRelations View aware, and make it keep track of the 
default databases (plural - in case of nested views). The default database will 
be the one of the last seen parent view. This approach makes is trivial to 
limit the depth of nested views (which might be needed at some point), or we 
can make this only resolve one layer of nested views at a time and use the 
analyzer's `maxIterations` as an implicit limit.
    
    I am in favor of the last option.
    
    #### Decoupling output and its underlying structure
    Introducing a view node with its own attributes, allows us to decouple the 
output from the underlying structure. This means we can decouple planning of 
query, from the planning the view; this allows us to cache resolved views. Note 
that this is not in scope for the current series of PRs, and is a nice to have.
    
    #### View visibility
    I added a field to `SubqueryAlias` so it is easier to see where what part 
of the plan originated. This is a bit of hack, and a view node is a natural 
replacement.
    
    ### State of PR
    I think we need to be more aggressive when it comes to view nodes. They 
should have their own attributes, and should also reference the view desc 
(CatalogTable), they are based on. Something like this:
    ```scala
    case class View(
        desc: CatalogTable,
        output: Seq[Attribute],
        child: Option[LogicalPlan] = None)
      extends LogicalPlan
      with MultiInstanceRelation {
    
      def this(desc: CatalogTable) = this(desc, desc.schema.toAttributes, None)
    
      override lazy val resolved: Boolean = child.exists(_.resolved)
    
      override def children: Seq[LogicalPlan] = child.toSeq
      
      override def newInstance(): LogicalPlan = copy(output = 
output.map(_.newInstance()))
    }
    ```
    
    CatalogTable.lookupRelation(...) should just return a basic view node. All 
resolution should be moved into the analyzer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16233: [SPARK-18801][SQL] Add `View` operator to help resolve a...

Reply via email to