Yea let's move forward first. We can discuss the caching approach and TableViewCatalog approach during the PR review.
On Tue, May 25, 2021 at 1:48 AM John Zhuge <jzh...@apache.org> wrote: > Hi everyone, > > Is there any more discussion before we start a vote on ViewCatalog? With > FunctionCatalog merged, I hope this feature can complete the offerings of > catalog plugins in 3.2. > > Once approved, I will refresh the WIP PR. Implementation details can be > ironed out during review. > > Thanks, > > On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> An extra RPC call is a concern for the catalog implementation. It is >> simple to cache the result of a call to avoid a second one if the catalog >> chooses. >> >> I don't think that an extra RPC that can be easily avoided is a >> reasonable justification to add caches in Spark. For one thing, it doesn't >> solve the problem because the proposed API still requires separate lookups >> for tables and views. >> >> The only solution that would help is to use a combined trait, but that >> has issues. For one, view substitution is much cleaner when it happens well >> before table resolution. And, View and Table are very different objects; >> returning Object from this API doesn't make much sense. >> >> One extra RPC is not unreasonable, and the choice should be left to >> sources. That's the easiest place to cache results from the underlying >> store. >> >> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cloud0...@gmail.com> wrote: >> >>> Moving back the discussion to this thread. The current argument is how >>> to avoid extra RPC calls for catalogs supporting both table and view. There >>> are several options: >>> 1. ignore it as extra PRC calls are cheap compared to the query execution >>> 2. have a per session cache for loaded table/view >>> 3. have a per query cache for loaded table/view >>> 4. add a new trait TableViewCatalog >>> >>> I think it's important to avoid perf regression with new APIs. RPC calls >>> can be significant for short queries. We may also double the RPC >>> traffic which is bad for the metastore service. Normally I would not >>> recommend caching as cache invalidation is a hard problem. Personally I >>> prefer option 4 as it only affects catalogs that support both table and >>> view, and it fits the hive catalog very well. >>> >>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jzh...@apache.org> wrote: >>> >>>> SPIP >>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing> >>>> has been updated. Please review. >>>> >>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzh...@apache.org> wrote: >>>> >>>>> Wenchen, sorry for the delay, I will post an update shortly. >>>>> >>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> Any updates here? I agree that a new View API is better, but we need >>>>>> a solution to avoid performance regression. We need to elaborate on the >>>>>> cache idea. >>>>>> >>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote: >>>>>> >>>>>>> I think it is a good idea to keep tables and views separate. >>>>>>> >>>>>>> The main two arguments I’ve heard for combining lookup into a single >>>>>>> function are the ones brought up in this thread. First, an identifier >>>>>>> in a >>>>>>> catalog must be either a view or a table and should not collide. >>>>>>> Second, a >>>>>>> single lookup is more likely to require a single RPC. I think the RPC >>>>>>> concern is well addressed by caching, which we already do in the Spark >>>>>>> catalog, so I’ll primarily focus on the first. >>>>>>> >>>>>>> Table/view name collision is unlikely to be a problem. Metastores >>>>>>> that support both today store them in a single namespace, so this is >>>>>>> not a >>>>>>> concern for even a naive implementation that talks to the Hive >>>>>>> MetaStore. I >>>>>>> know that a new metastore catalog could choose to implement both >>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that >>>>>>> would be a very strange choice: if the metastore itself has different >>>>>>> namespaces for tables and views, then it makes much more sense to expose >>>>>>> them through separate catalogs because Spark will always prefer one over >>>>>>> the other. >>>>>>> >>>>>>> In a similar line of reasoning, catalogs that expose both views and >>>>>>> tables are much more rare than catalogs that only expose one. For >>>>>>> example, >>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table >>>>>>> interface >>>>>>> and implementing ViewCatalog would make little sense. Exposing new data >>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs >>>>>>> are >>>>>>> likely to be the same. Say I have a way to convert Pig statements or >>>>>>> some >>>>>>> other representation into a SQL view. It would make little sense to >>>>>>> combine >>>>>>> that with some other TableCatalog. >>>>>>> >>>>>>> I also don’t think there is benefit from an API perspective to >>>>>>> justify combining the Table and View interfaces. The two share only >>>>>>> schema >>>>>>> and properties, and are handled very differently internally — a View’s >>>>>>> SQL >>>>>>> query is parsed and substituted into the plan, while a Table is wrapped >>>>>>> in >>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A >>>>>>> view’s >>>>>>> SQL also needs additional context to be resolved correctly: the current >>>>>>> catalog and namespace from the time the view was created. >>>>>>> >>>>>>> Query planning is distinct between tables and views, so Spark >>>>>>> doesn’t benefit from combining them. I think it has actually caused >>>>>>> problems that both were resolved by the same method in v1: the >>>>>>> resolution >>>>>>> rule grew extremely complicated trying to look up a reference just once >>>>>>> because it had to parse a view plan and resolve relations within it >>>>>>> using >>>>>>> the view’s context (current database). In contrast, John’s new view >>>>>>> substitution rules are cleaner and can stay within the substitution >>>>>>> batch. >>>>>>> >>>>>>> People implementing views would also not benefit from combining the >>>>>>> two interfaces: >>>>>>> >>>>>>> - There is little overlap between View and Table, only schema >>>>>>> and properties >>>>>>> - Most catalogs won’t implement both interfaces, so returning a >>>>>>> ViewOrTable is more difficult for implementations >>>>>>> - TableCatalog assumes that ViewCatalog will be added separately >>>>>>> like John proposes, so we would have to break or replace that API >>>>>>> >>>>>>> I understand the initial appeal of combining TableCatalog and >>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I >>>>>>> think that Hive chose to do that mostly on the fact that the two were >>>>>>> already stored together, and not because it made sense for users of the >>>>>>> API, or any other implementer of the API. >>>>>>> >>>>>>> rb >>>>>>> >>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> > AFAIK view schema is only used by DESCRIBE. >>>>>>>>> >>>>>>>>> Correction: Spark adds a new Project at the top of the parsed plan >>>>>>>>> from view, based on the stored schema, to make sure the view schema >>>>>>>>> doesn't >>>>>>>>> change. >>>>>>>>> >>>>>>>> >>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the >>>>>>>> validation done in *checkAnalysis*: >>>>>>>> >>>>>>>> // If the view output doesn't have the same number of >>>>>>>> columns neither with the child >>>>>>>> // output, nor with the query column names, throw an >>>>>>>> AnalysisException. >>>>>>>> // If the view's child output can't up cast to the view >>>>>>>> output, >>>>>>>> // throw an AnalysisException, too. >>>>>>>> >>>>>>>> The view output comes from the schema: >>>>>>>> >>>>>>>> val child = View( >>>>>>>> desc = metadata, >>>>>>>> output = metadata.schema.toAttributes, >>>>>>>> child = parser.parsePlan(viewText)) >>>>>>>> >>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have >>>>>>>> but not "required" or "should be frozen". Thanks Ryan and Burak for >>>>>>>> pointing that out in SPIP. I will add a new paragraph accordingly. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ryan Blue >>>>>>> Software Engineer >>>>>>> Netflix >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> John Zhuge >>>>> >>>> >>>> >>>> -- >>>> John Zhuge >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > > > -- > John Zhuge >