Re: SPIP: Catalog API for view metadata

Wenchen Fan Mon, 24 May 2021 10:54:09 -0700

Yea let's move forward first. We can discuss the caching approach
and TableViewCatalog approach during the PR review.


On Tue, May 25, 2021 at 1:48 AM John Zhuge <[email protected]> wrote:

> Hi everyone,
>
> Is there any more discussion before we start a vote on ViewCatalog? With
> FunctionCatalog merged, I hope this feature can complete the offerings of
> catalog plugins in 3.2.
>
> Once approved, I will refresh the WIP PR. Implementation details can be
> ironed out during review.
>
> Thanks,
>
> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <[email protected]>
> wrote:
>
>> An extra RPC call is a concern for the catalog implementation. It is
>> simple to cache the result of a call to avoid a second one if the catalog
>> chooses.
>>
>> I don't think that an extra RPC that can be easily avoided is a
>> reasonable justification to add caches in Spark. For one thing, it doesn't
>> solve the problem because the proposed API still requires separate lookups
>> for tables and views.
>>
>> The only solution that would help is to use a combined trait, but that
>> has issues. For one, view substitution is much cleaner when it happens well
>> before table resolution. And, View and Table are very different objects;
>> returning Object from this API doesn't make much sense.
>>
>> One extra RPC is not unreasonable, and the choice should be left to
>> sources. That's the easiest place to cache results from the underlying
>> store.
>>
>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <[email protected]> wrote:
>>
>>> Moving back the discussion to this thread. The current argument is how
>>> to avoid extra RPC calls for catalogs supporting both table and view. There
>>> are several options:
>>> 1. ignore it as extra PRC calls are cheap compared to the query execution
>>> 2. have a per session cache for loaded table/view
>>> 3. have a per query cache for loaded table/view
>>> 4. add a new trait TableViewCatalog
>>>
>>> I think it's important to avoid perf regression with new APIs. RPC calls
>>> can be significant for short queries. We may also double the RPC
>>> traffic which is bad for the metastore service. Normally I would not
>>> recommend caching as cache invalidation is a hard problem. Personally I
>>> prefer option 4 as it only affects catalogs that support both table and
>>> view, and it fits the hive catalog very well.
>>>
>>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <[email protected]> wrote:
>>>
>>>> SPIP
>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>> has been updated. Please review.
>>>>
>>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <[email protected]> wrote:
>>>>
>>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>>
>>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Any updates here? I agree that a new View API is better, but we need
>>>>>> a solution to avoid performance regression. We need to elaborate on the
>>>>>> cache idea.
>>>>>>
>>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>>
>>>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>>>> function are the ones brought up in this thread. First, an identifier 
>>>>>>> in a
>>>>>>> catalog must be either a view or a table and should not collide. 
>>>>>>> Second, a
>>>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>>>> catalog, so I’ll primarily focus on the first.
>>>>>>>
>>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>>> that support both today store them in a single namespace, so this is 
>>>>>>> not a
>>>>>>> concern for even a naive implementation that talks to the Hive 
>>>>>>> MetaStore. I
>>>>>>> know that a new metastore catalog could choose to implement both
>>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>>>> would be a very strange choice: if the metastore itself has different
>>>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>>>> them through separate catalogs because Spark will always prefer one over
>>>>>>> the other.
>>>>>>>
>>>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>>>> tables are much more rare than catalogs that only expose one. For 
>>>>>>> example,
>>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table 
>>>>>>> interface
>>>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs 
>>>>>>> are
>>>>>>> likely to be the same. Say I have a way to convert Pig statements or 
>>>>>>> some
>>>>>>> other representation into a SQL view. It would make little sense to 
>>>>>>> combine
>>>>>>> that with some other TableCatalog.
>>>>>>>
>>>>>>> I also don’t think there is benefit from an API perspective to
>>>>>>> justify combining the Table and View interfaces. The two share only 
>>>>>>> schema
>>>>>>> and properties, and are handled very differently internally — a View’s 
>>>>>>> SQL
>>>>>>> query is parsed and substituted into the plan, while a Table is wrapped 
>>>>>>> in
>>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A 
>>>>>>> view’s
>>>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>>>> catalog and namespace from the time the view was created.
>>>>>>>
>>>>>>> Query planning is distinct between tables and views, so Spark
>>>>>>> doesn’t benefit from combining them. I think it has actually caused
>>>>>>> problems that both were resolved by the same method in v1: the 
>>>>>>> resolution
>>>>>>> rule grew extremely complicated trying to look up a reference just once
>>>>>>> because it had to parse a view plan and resolve relations within it 
>>>>>>> using
>>>>>>> the view’s context (current database). In contrast, John’s new view
>>>>>>> substitution rules are cleaner and can stay within the substitution 
>>>>>>> batch.
>>>>>>>
>>>>>>> People implementing views would also not benefit from combining the
>>>>>>> two interfaces:
>>>>>>>
>>>>>>>    - There is little overlap between View and Table, only schema
>>>>>>>    and properties
>>>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>>>    ViewOrTable is more difficult for implementations
>>>>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>>>>    like John proposes, so we would have to break or replace that API
>>>>>>>
>>>>>>> I understand the initial appeal of combining TableCatalog and
>>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>>>> already stored together, and not because it made sense for users of the
>>>>>>> API, or any other implementer of the API.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>>>
>>>>>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>>>>>> from view, based on the stored schema, to make sure the view schema 
>>>>>>>>> doesn't
>>>>>>>>> change.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>>>> validation done in *checkAnalysis*:
>>>>>>>>
>>>>>>>>           // If the view output doesn't have the same number of
>>>>>>>> columns neither with the child
>>>>>>>>           // output, nor with the query column names, throw an
>>>>>>>> AnalysisException.
>>>>>>>>           // If the view's child output can't up cast to the view
>>>>>>>> output,
>>>>>>>>           // throw an AnalysisException, too.
>>>>>>>>
>>>>>>>> The view output comes from the schema:
>>>>>>>>
>>>>>>>>       val child = View(
>>>>>>>>         desc = metadata,
>>>>>>>>         output = metadata.schema.toAttributes,
>>>>>>>>         child = parser.parsePlan(viewText))
>>>>>>>>
>>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have
>>>>>>>> but not "required" or "should be frozen". Thanks Ryan and Burak for
>>>>>>>> pointing that out in SPIP. I will add a new paragraph accordingly.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>
> --
> John Zhuge
>

Re: SPIP: Catalog API for view metadata

Reply via email to