Great! I will start a vote thread. On Mon, May 24, 2021 at 10:54 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> Yea let's move forward first. We can discuss the caching approach > and TableViewCatalog approach during the PR review. > > On Tue, May 25, 2021 at 1:48 AM John Zhuge <jzh...@apache.org> wrote: > >> Hi everyone, >> >> Is there any more discussion before we start a vote on ViewCatalog? With >> FunctionCatalog merged, I hope this feature can complete the offerings of >> catalog plugins in 3.2. >> >> Once approved, I will refresh the WIP PR. Implementation details can be >> ironed out during review. >> >> Thanks, >> >> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >>> An extra RPC call is a concern for the catalog implementation. It is >>> simple to cache the result of a call to avoid a second one if the catalog >>> chooses. >>> >>> I don't think that an extra RPC that can be easily avoided is a >>> reasonable justification to add caches in Spark. For one thing, it doesn't >>> solve the problem because the proposed API still requires separate lookups >>> for tables and views. >>> >>> The only solution that would help is to use a combined trait, but that >>> has issues. For one, view substitution is much cleaner when it happens well >>> before table resolution. And, View and Table are very different objects; >>> returning Object from this API doesn't make much sense. >>> >>> One extra RPC is not unreasonable, and the choice should be left to >>> sources. That's the easiest place to cache results from the underlying >>> store. >>> >>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cloud0...@gmail.com> wrote: >>> >>>> Moving back the discussion to this thread. The current argument is how >>>> to avoid extra RPC calls for catalogs supporting both table and view. There >>>> are several options: >>>> 1. ignore it as extra PRC calls are cheap compared to the query >>>> execution >>>> 2. have a per session cache for loaded table/view >>>> 3. have a per query cache for loaded table/view >>>> 4. add a new trait TableViewCatalog >>>> >>>> I think it's important to avoid perf regression with new APIs. RPC >>>> calls can be significant for short queries. We may also double the RPC >>>> traffic which is bad for the metastore service. Normally I would not >>>> recommend caching as cache invalidation is a hard problem. Personally I >>>> prefer option 4 as it only affects catalogs that support both table and >>>> view, and it fits the hive catalog very well. >>>> >>>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jzh...@apache.org> wrote: >>>> >>>>> SPIP >>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing> >>>>> has been updated. Please review. >>>>> >>>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzh...@apache.org> wrote: >>>>> >>>>>> Wenchen, sorry for the delay, I will post an update shortly. >>>>>> >>>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Any updates here? I agree that a new View API is better, but we need >>>>>>> a solution to avoid performance regression. We need to elaborate on the >>>>>>> cache idea. >>>>>>> >>>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote: >>>>>>> >>>>>>>> I think it is a good idea to keep tables and views separate. >>>>>>>> >>>>>>>> The main two arguments I’ve heard for combining lookup into a >>>>>>>> single function are the ones brought up in this thread. First, an >>>>>>>> identifier in a catalog must be either a view or a table and should not >>>>>>>> collide. Second, a single lookup is more likely to require a single >>>>>>>> RPC. I >>>>>>>> think the RPC concern is well addressed by caching, which we already >>>>>>>> do in >>>>>>>> the Spark catalog, so I’ll primarily focus on the first. >>>>>>>> >>>>>>>> Table/view name collision is unlikely to be a problem. Metastores >>>>>>>> that support both today store them in a single namespace, so this is >>>>>>>> not a >>>>>>>> concern for even a naive implementation that talks to the Hive >>>>>>>> MetaStore. I >>>>>>>> know that a new metastore catalog could choose to implement both >>>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but >>>>>>>> that >>>>>>>> would be a very strange choice: if the metastore itself has different >>>>>>>> namespaces for tables and views, then it makes much more sense to >>>>>>>> expose >>>>>>>> them through separate catalogs because Spark will always prefer one >>>>>>>> over >>>>>>>> the other. >>>>>>>> >>>>>>>> In a similar line of reasoning, catalogs that expose both views and >>>>>>>> tables are much more rare than catalogs that only expose one. For >>>>>>>> example, >>>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table >>>>>>>> interface >>>>>>>> and implementing ViewCatalog would make little sense. Exposing new data >>>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs >>>>>>>> are >>>>>>>> likely to be the same. Say I have a way to convert Pig statements or >>>>>>>> some >>>>>>>> other representation into a SQL view. It would make little sense to >>>>>>>> combine >>>>>>>> that with some other TableCatalog. >>>>>>>> >>>>>>>> I also don’t think there is benefit from an API perspective to >>>>>>>> justify combining the Table and View interfaces. The two share only >>>>>>>> schema >>>>>>>> and properties, and are handled very differently internally — a View’s >>>>>>>> SQL >>>>>>>> query is parsed and substituted into the plan, while a Table is >>>>>>>> wrapped in >>>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A >>>>>>>> view’s >>>>>>>> SQL also needs additional context to be resolved correctly: the current >>>>>>>> catalog and namespace from the time the view was created. >>>>>>>> >>>>>>>> Query planning is distinct between tables and views, so Spark >>>>>>>> doesn’t benefit from combining them. I think it has actually caused >>>>>>>> problems that both were resolved by the same method in v1: the >>>>>>>> resolution >>>>>>>> rule grew extremely complicated trying to look up a reference just once >>>>>>>> because it had to parse a view plan and resolve relations within it >>>>>>>> using >>>>>>>> the view’s context (current database). In contrast, John’s new view >>>>>>>> substitution rules are cleaner and can stay within the substitution >>>>>>>> batch. >>>>>>>> >>>>>>>> People implementing views would also not benefit from combining the >>>>>>>> two interfaces: >>>>>>>> >>>>>>>> - There is little overlap between View and Table, only schema >>>>>>>> and properties >>>>>>>> - Most catalogs won’t implement both interfaces, so returning a >>>>>>>> ViewOrTable is more difficult for implementations >>>>>>>> - TableCatalog assumes that ViewCatalog will be added >>>>>>>> separately like John proposes, so we would have to break or replace >>>>>>>> that API >>>>>>>> >>>>>>>> I understand the initial appeal of combining TableCatalog and >>>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I >>>>>>>> think that Hive chose to do that mostly on the fact that the two were >>>>>>>> already stored together, and not because it made sense for users of the >>>>>>>> API, or any other implementer of the API. >>>>>>>> >>>>>>>> rb >>>>>>>> >>>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> > AFAIK view schema is only used by DESCRIBE. >>>>>>>>>> >>>>>>>>>> Correction: Spark adds a new Project at the top of the parsed >>>>>>>>>> plan from view, based on the stored schema, to make sure the view >>>>>>>>>> schema >>>>>>>>>> doesn't change. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the >>>>>>>>> validation done in *checkAnalysis*: >>>>>>>>> >>>>>>>>> // If the view output doesn't have the same number of >>>>>>>>> columns neither with the child >>>>>>>>> // output, nor with the query column names, throw an >>>>>>>>> AnalysisException. >>>>>>>>> // If the view's child output can't up cast to the view >>>>>>>>> output, >>>>>>>>> // throw an AnalysisException, too. >>>>>>>>> >>>>>>>>> The view output comes from the schema: >>>>>>>>> >>>>>>>>> val child = View( >>>>>>>>> desc = metadata, >>>>>>>>> output = metadata.schema.toAttributes, >>>>>>>>> child = parser.parsePlan(viewText)) >>>>>>>>> >>>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have >>>>>>>>> but not "required" or "should be frozen". Thanks Ryan and Burak for >>>>>>>>> pointing that out in SPIP. I will add a new paragraph accordingly. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Ryan Blue >>>>>>>> Software Engineer >>>>>>>> Netflix >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> John Zhuge >>>>>> >>>>> >>>>> >>>>> -- >>>>> John Zhuge >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Software Engineer >>> Netflix >>> >> >> >> -- >> John Zhuge >> > -- John Zhuge