Hi everyone, Is there any more discussion before we start a vote on ViewCatalog? With FunctionCatalog merged, I hope this feature can complete the offerings of catalog plugins in 3.2.
Once approved, I will refresh the WIP PR. Implementation details can be ironed out during review. Thanks, On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <rb...@netflix.com.invalid> wrote: > An extra RPC call is a concern for the catalog implementation. It is > simple to cache the result of a call to avoid a second one if the catalog > chooses. > > I don't think that an extra RPC that can be easily avoided is a reasonable > justification to add caches in Spark. For one thing, it doesn't solve the > problem because the proposed API still requires separate lookups for tables > and views. > > The only solution that would help is to use a combined trait, but that has > issues. For one, view substitution is much cleaner when it happens well > before table resolution. And, View and Table are very different objects; > returning Object from this API doesn't make much sense. > > One extra RPC is not unreasonable, and the choice should be left to > sources. That's the easiest place to cache results from the underlying > store. > > On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cloud0...@gmail.com> wrote: > >> Moving back the discussion to this thread. The current argument is how to >> avoid extra RPC calls for catalogs supporting both table and view. There >> are several options: >> 1. ignore it as extra PRC calls are cheap compared to the query execution >> 2. have a per session cache for loaded table/view >> 3. have a per query cache for loaded table/view >> 4. add a new trait TableViewCatalog >> >> I think it's important to avoid perf regression with new APIs. RPC calls >> can be significant for short queries. We may also double the RPC >> traffic which is bad for the metastore service. Normally I would not >> recommend caching as cache invalidation is a hard problem. Personally I >> prefer option 4 as it only affects catalogs that support both table and >> view, and it fits the hive catalog very well. >> >> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jzh...@apache.org> wrote: >> >>> SPIP >>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing> >>> has been updated. Please review. >>> >>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzh...@apache.org> wrote: >>> >>>> Wenchen, sorry for the delay, I will post an update shortly. >>>> >>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>>> Any updates here? I agree that a new View API is better, but we need a >>>>> solution to avoid performance regression. We need to elaborate on the >>>>> cache >>>>> idea. >>>>> >>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote: >>>>> >>>>>> I think it is a good idea to keep tables and views separate. >>>>>> >>>>>> The main two arguments I’ve heard for combining lookup into a single >>>>>> function are the ones brought up in this thread. First, an identifier in >>>>>> a >>>>>> catalog must be either a view or a table and should not collide. Second, >>>>>> a >>>>>> single lookup is more likely to require a single RPC. I think the RPC >>>>>> concern is well addressed by caching, which we already do in the Spark >>>>>> catalog, so I’ll primarily focus on the first. >>>>>> >>>>>> Table/view name collision is unlikely to be a problem. Metastores >>>>>> that support both today store them in a single namespace, so this is not >>>>>> a >>>>>> concern for even a naive implementation that talks to the Hive >>>>>> MetaStore. I >>>>>> know that a new metastore catalog could choose to implement both >>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that >>>>>> would be a very strange choice: if the metastore itself has different >>>>>> namespaces for tables and views, then it makes much more sense to expose >>>>>> them through separate catalogs because Spark will always prefer one over >>>>>> the other. >>>>>> >>>>>> In a similar line of reasoning, catalogs that expose both views and >>>>>> tables are much more rare than catalogs that only expose one. For >>>>>> example, >>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table >>>>>> interface >>>>>> and implementing ViewCatalog would make little sense. Exposing new data >>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs >>>>>> are >>>>>> likely to be the same. Say I have a way to convert Pig statements or some >>>>>> other representation into a SQL view. It would make little sense to >>>>>> combine >>>>>> that with some other TableCatalog. >>>>>> >>>>>> I also don’t think there is benefit from an API perspective to >>>>>> justify combining the Table and View interfaces. The two share only >>>>>> schema >>>>>> and properties, and are handled very differently internally — a View’s >>>>>> SQL >>>>>> query is parsed and substituted into the plan, while a Table is wrapped >>>>>> in >>>>>> a relation that eventually becomes a Scan node using SupportsRead. A >>>>>> view’s >>>>>> SQL also needs additional context to be resolved correctly: the current >>>>>> catalog and namespace from the time the view was created. >>>>>> >>>>>> Query planning is distinct between tables and views, so Spark doesn’t >>>>>> benefit from combining them. I think it has actually caused problems that >>>>>> both were resolved by the same method in v1: the resolution rule grew >>>>>> extremely complicated trying to look up a reference just once because it >>>>>> had to parse a view plan and resolve relations within it using the view’s >>>>>> context (current database). In contrast, John’s new view substitution >>>>>> rules >>>>>> are cleaner and can stay within the substitution batch. >>>>>> >>>>>> People implementing views would also not benefit from combining the >>>>>> two interfaces: >>>>>> >>>>>> - There is little overlap between View and Table, only schema and >>>>>> properties >>>>>> - Most catalogs won’t implement both interfaces, so returning a >>>>>> ViewOrTable is more difficult for implementations >>>>>> - TableCatalog assumes that ViewCatalog will be added separately >>>>>> like John proposes, so we would have to break or replace that API >>>>>> >>>>>> I understand the initial appeal of combining TableCatalog and >>>>>> ViewCatalog since it is done that way in the existing interfaces. But I >>>>>> think that Hive chose to do that mostly on the fact that the two were >>>>>> already stored together, and not because it made sense for users of the >>>>>> API, or any other implementer of the API. >>>>>> >>>>>> rb >>>>>> >>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> > AFAIK view schema is only used by DESCRIBE. >>>>>>>> >>>>>>>> Correction: Spark adds a new Project at the top of the parsed plan >>>>>>>> from view, based on the stored schema, to make sure the view schema >>>>>>>> doesn't >>>>>>>> change. >>>>>>>> >>>>>>> >>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the >>>>>>> validation done in *checkAnalysis*: >>>>>>> >>>>>>> // If the view output doesn't have the same number of >>>>>>> columns neither with the child >>>>>>> // output, nor with the query column names, throw an >>>>>>> AnalysisException. >>>>>>> // If the view's child output can't up cast to the view >>>>>>> output, >>>>>>> // throw an AnalysisException, too. >>>>>>> >>>>>>> The view output comes from the schema: >>>>>>> >>>>>>> val child = View( >>>>>>> desc = metadata, >>>>>>> output = metadata.schema.toAttributes, >>>>>>> child = parser.parsePlan(viewText)) >>>>>>> >>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but >>>>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing >>>>>>> that out in SPIP. I will add a new paragraph accordingly. >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Software Engineer >>>>>> Netflix >>>>>> >>>>> >>>> >>>> -- >>>> John Zhuge >>>> >>> >>> >>> -- >>> John Zhuge >>> >> > > -- > Ryan Blue > Software Engineer > Netflix > -- John Zhuge