SPIP <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing> has been updated. Please review.
On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzh...@apache.org> wrote: > Wenchen, sorry for the delay, I will post an update shortly. > > On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Any updates here? I agree that a new View API is better, but we need a >> solution to avoid performance regression. We need to elaborate on the cache >> idea. >> >> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote: >> >>> I think it is a good idea to keep tables and views separate. >>> >>> The main two arguments I’ve heard for combining lookup into a single >>> function are the ones brought up in this thread. First, an identifier in a >>> catalog must be either a view or a table and should not collide. Second, a >>> single lookup is more likely to require a single RPC. I think the RPC >>> concern is well addressed by caching, which we already do in the Spark >>> catalog, so I’ll primarily focus on the first. >>> >>> Table/view name collision is unlikely to be a problem. Metastores that >>> support both today store them in a single namespace, so this is not a >>> concern for even a naive implementation that talks to the Hive MetaStore. I >>> know that a new metastore catalog could choose to implement both >>> ViewCatalog and TableCatalog and store the two sets separately, but that >>> would be a very strange choice: if the metastore itself has different >>> namespaces for tables and views, then it makes much more sense to expose >>> them through separate catalogs because Spark will always prefer one over >>> the other. >>> >>> In a similar line of reasoning, catalogs that expose both views and >>> tables are much more rare than catalogs that only expose one. For example, >>> v2 catalogs for JDBC and Cassandra expose data through the Table interface >>> and implementing ViewCatalog would make little sense. Exposing new data >>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs are >>> likely to be the same. Say I have a way to convert Pig statements or some >>> other representation into a SQL view. It would make little sense to combine >>> that with some other TableCatalog. >>> >>> I also don’t think there is benefit from an API perspective to justify >>> combining the Table and View interfaces. The two share only schema and >>> properties, and are handled very differently internally — a View’s SQL >>> query is parsed and substituted into the plan, while a Table is wrapped in >>> a relation that eventually becomes a Scan node using SupportsRead. A view’s >>> SQL also needs additional context to be resolved correctly: the current >>> catalog and namespace from the time the view was created. >>> >>> Query planning is distinct between tables and views, so Spark doesn’t >>> benefit from combining them. I think it has actually caused problems that >>> both were resolved by the same method in v1: the resolution rule grew >>> extremely complicated trying to look up a reference just once because it >>> had to parse a view plan and resolve relations within it using the view’s >>> context (current database). In contrast, John’s new view substitution rules >>> are cleaner and can stay within the substitution batch. >>> >>> People implementing views would also not benefit from combining the two >>> interfaces: >>> >>> - There is little overlap between View and Table, only schema and >>> properties >>> - Most catalogs won’t implement both interfaces, so returning a >>> ViewOrTable is more difficult for implementations >>> - TableCatalog assumes that ViewCatalog will be added separately >>> like John proposes, so we would have to break or replace that API >>> >>> I understand the initial appeal of combining TableCatalog and >>> ViewCatalog since it is done that way in the existing interfaces. But I >>> think that Hive chose to do that mostly on the fact that the two were >>> already stored together, and not because it made sense for users of the >>> API, or any other implementer of the API. >>> >>> rb >>> >>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> wrote: >>> >>>> >>>> >>>> >>>>> > AFAIK view schema is only used by DESCRIBE. >>>>> >>>>> Correction: Spark adds a new Project at the top of the parsed plan >>>>> from view, based on the stored schema, to make sure the view schema >>>>> doesn't >>>>> change. >>>>> >>>> >>>> Thanks Wenchen! I thought I forgot something :) Yes it is the >>>> validation done in *checkAnalysis*: >>>> >>>> // If the view output doesn't have the same number of columns >>>> neither with the child >>>> // output, nor with the query column names, throw an >>>> AnalysisException. >>>> // If the view's child output can't up cast to the view >>>> output, >>>> // throw an AnalysisException, too. >>>> >>>> The view output comes from the schema: >>>> >>>> val child = View( >>>> desc = metadata, >>>> output = metadata.schema.toAttributes, >>>> child = parser.parsePlan(viewText)) >>>> >>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but >>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing >>>> that out in SPIP. I will add a new paragraph accordingly. >>>> >>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> > > -- > John Zhuge > -- John Zhuge