Wenchen, sorry for the delay, I will post an update shortly. On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> Any updates here? I agree that a new View API is better, but we need a > solution to avoid performance regression. We need to elaborate on the cache > idea. > > On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote: > >> I think it is a good idea to keep tables and views separate. >> >> The main two arguments I’ve heard for combining lookup into a single >> function are the ones brought up in this thread. First, an identifier in a >> catalog must be either a view or a table and should not collide. Second, a >> single lookup is more likely to require a single RPC. I think the RPC >> concern is well addressed by caching, which we already do in the Spark >> catalog, so I’ll primarily focus on the first. >> >> Table/view name collision is unlikely to be a problem. Metastores that >> support both today store them in a single namespace, so this is not a >> concern for even a naive implementation that talks to the Hive MetaStore. I >> know that a new metastore catalog could choose to implement both >> ViewCatalog and TableCatalog and store the two sets separately, but that >> would be a very strange choice: if the metastore itself has different >> namespaces for tables and views, then it makes much more sense to expose >> them through separate catalogs because Spark will always prefer one over >> the other. >> >> In a similar line of reasoning, catalogs that expose both views and >> tables are much more rare than catalogs that only expose one. For example, >> v2 catalogs for JDBC and Cassandra expose data through the Table interface >> and implementing ViewCatalog would make little sense. Exposing new data >> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are >> likely to be the same. Say I have a way to convert Pig statements or some >> other representation into a SQL view. It would make little sense to combine >> that with some other TableCatalog. >> >> I also don’t think there is benefit from an API perspective to justify >> combining the Table and View interfaces. The two share only schema and >> properties, and are handled very differently internally — a View’s SQL >> query is parsed and substituted into the plan, while a Table is wrapped in >> a relation that eventually becomes a Scan node using SupportsRead. A view’s >> SQL also needs additional context to be resolved correctly: the current >> catalog and namespace from the time the view was created. >> >> Query planning is distinct between tables and views, so Spark doesn’t >> benefit from combining them. I think it has actually caused problems that >> both were resolved by the same method in v1: the resolution rule grew >> extremely complicated trying to look up a reference just once because it >> had to parse a view plan and resolve relations within it using the view’s >> context (current database). In contrast, John’s new view substitution rules >> are cleaner and can stay within the substitution batch. >> >> People implementing views would also not benefit from combining the two >> interfaces: >> >> - There is little overlap between View and Table, only schema and >> properties >> - Most catalogs won’t implement both interfaces, so returning a >> ViewOrTable is more difficult for implementations >> - TableCatalog assumes that ViewCatalog will be added separately like >> John proposes, so we would have to break or replace that API >> >> I understand the initial appeal of combining TableCatalog and ViewCatalog >> since it is done that way in the existing interfaces. But I think that Hive >> chose to do that mostly on the fact that the two were already stored >> together, and not because it made sense for users of the API, or any other >> implementer of the API. >> >> rb >> >> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> wrote: >> >>> >>> >>> >>>> > AFAIK view schema is only used by DESCRIBE. >>>> >>>> Correction: Spark adds a new Project at the top of the parsed plan from >>>> view, based on the stored schema, to make sure the view schema doesn't >>>> change. >>>> >>> >>> Thanks Wenchen! I thought I forgot something :) Yes it is the validation >>> done in *checkAnalysis*: >>> >>> // If the view output doesn't have the same number of columns >>> neither with the child >>> // output, nor with the query column names, throw an >>> AnalysisException. >>> // If the view's child output can't up cast to the view output, >>> // throw an AnalysisException, too. >>> >>> The view output comes from the schema: >>> >>> val child = View( >>> desc = metadata, >>> output = metadata.schema.toAttributes, >>> child = parser.parsePlan(viewText)) >>> >>> So it is a validation (here) or cache (in DESCRIBE) nice to have but not >>> "required" or "should be frozen". Thanks Ryan and Burak for pointing that >>> out in SPIP. I will add a new paragraph accordingly. >>> >> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- John Zhuge