Re: SPIP: Catalog API for view metadata

John Zhuge Mon, 24 May 2021 10:48:25 -0700

Hi everyone,

Is there any more discussion before we start a vote on ViewCatalog? With
FunctionCatalog merged, I hope this feature can complete the offerings of
catalog plugins in 3.2.


Once approved, I will refresh the WIP PR. Implementation details can be
ironed out during review.

Thanks,

On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> An extra RPC call is a concern for the catalog implementation. It is
> simple to cache the result of a call to avoid a second one if the catalog
> chooses.
>
> I don't think that an extra RPC that can be easily avoided is a reasonable
> justification to add caches in Spark. For one thing, it doesn't solve the
> problem because the proposed API still requires separate lookups for tables
> and views.
>
> The only solution that would help is to use a combined trait, but that has
> issues. For one, view substitution is much cleaner when it happens well
> before table resolution. And, View and Table are very different objects;
> returning Object from this API doesn't make much sense.
>
> One extra RPC is not unreasonable, and the choice should be left to
> sources. That's the easiest place to cache results from the underlying
> store.
>
> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> Moving back the discussion to this thread. The current argument is how to
>> avoid extra RPC calls for catalogs supporting both table and view. There
>> are several options:
>> 1. ignore it as extra PRC calls are cheap compared to the query execution
>> 2. have a per session cache for loaded table/view
>> 3. have a per query cache for loaded table/view
>> 4. add a new trait TableViewCatalog
>>
>> I think it's important to avoid perf regression with new APIs. RPC calls
>> can be significant for short queries. We may also double the RPC
>> traffic which is bad for the metastore service. Normally I would not
>> recommend caching as cache invalidation is a hard problem. Personally I
>> prefer option 4 as it only affects catalogs that support both table and
>> view, and it fits the hive catalog very well.
>>
>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <jzh...@apache.org> wrote:
>>
>>> SPIP
>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>> has been updated. Please review.
>>>
>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <jzh...@apache.org> wrote:
>>>
>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>
>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>> Any updates here? I agree that a new View API is better, but we need a
>>>>> solution to avoid performance regression. We need to elaborate on the 
>>>>> cache
>>>>> idea.
>>>>>
>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>
>>>>>> The main two arguments I’ve heard for combining lookup into a single
>>>>>> function are the ones brought up in this thread. First, an identifier in 
>>>>>> a
>>>>>> catalog must be either a view or a table and should not collide. Second, 
>>>>>> a
>>>>>> single lookup is more likely to require a single RPC. I think the RPC
>>>>>> concern is well addressed by caching, which we already do in the Spark
>>>>>> catalog, so I’ll primarily focus on the first.
>>>>>>
>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>> that support both today store them in a single namespace, so this is not 
>>>>>> a
>>>>>> concern for even a naive implementation that talks to the Hive 
>>>>>> MetaStore. I
>>>>>> know that a new metastore catalog could choose to implement both
>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but that
>>>>>> would be a very strange choice: if the metastore itself has different
>>>>>> namespaces for tables and views, then it makes much more sense to expose
>>>>>> them through separate catalogs because Spark will always prefer one over
>>>>>> the other.
>>>>>>
>>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>>> tables are much more rare than catalogs that only expose one. For 
>>>>>> example,
>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table 
>>>>>> interface
>>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs 
>>>>>> are
>>>>>> likely to be the same. Say I have a way to convert Pig statements or some
>>>>>> other representation into a SQL view. It would make little sense to 
>>>>>> combine
>>>>>> that with some other TableCatalog.
>>>>>>
>>>>>> I also don’t think there is benefit from an API perspective to
>>>>>> justify combining the Table and View interfaces. The two share only 
>>>>>> schema
>>>>>> and properties, and are handled very differently internally — a View’s 
>>>>>> SQL
>>>>>> query is parsed and substituted into the plan, while a Table is wrapped 
>>>>>> in
>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A 
>>>>>> view’s
>>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>>> catalog and namespace from the time the view was created.
>>>>>>
>>>>>> Query planning is distinct between tables and views, so Spark doesn’t
>>>>>> benefit from combining them. I think it has actually caused problems that
>>>>>> both were resolved by the same method in v1: the resolution rule grew
>>>>>> extremely complicated trying to look up a reference just once because it
>>>>>> had to parse a view plan and resolve relations within it using the view’s
>>>>>> context (current database). In contrast, John’s new view substitution 
>>>>>> rules
>>>>>> are cleaner and can stay within the substitution batch.
>>>>>>
>>>>>> People implementing views would also not benefit from combining the
>>>>>> two interfaces:
>>>>>>
>>>>>>    - There is little overlap between View and Table, only schema and
>>>>>>    properties
>>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>>    ViewOrTable is more difficult for implementations
>>>>>>    - TableCatalog assumes that ViewCatalog will be added separately
>>>>>>    like John proposes, so we would have to break or replace that API
>>>>>>
>>>>>> I understand the initial appeal of combining TableCatalog and
>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>>> already stored together, and not because it made sense for users of the
>>>>>> API, or any other implementer of the API.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <jzh...@apache.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>>
>>>>>>>> Correction: Spark adds a new Project at the top of the parsed plan
>>>>>>>> from view, based on the stored schema, to make sure the view schema 
>>>>>>>> doesn't
>>>>>>>> change.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>>> validation done in *checkAnalysis*:
>>>>>>>
>>>>>>>           // If the view output doesn't have the same number of
>>>>>>> columns neither with the child
>>>>>>>           // output, nor with the query column names, throw an
>>>>>>> AnalysisException.
>>>>>>>           // If the view's child output can't up cast to the view
>>>>>>> output,
>>>>>>>           // throw an AnalysisException, too.
>>>>>>>
>>>>>>> The view output comes from the schema:
>>>>>>>
>>>>>>>       val child = View(
>>>>>>>         desc = metadata,
>>>>>>>         output = metadata.schema.toAttributes,
>>>>>>>         child = parser.parsePlan(viewText))
>>>>>>>
>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have but
>>>>>>> not "required" or "should be frozen". Thanks Ryan and Burak for pointing
>>>>>>> that out in SPIP. I will add a new paragraph accordingly.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>>
>>> --
>>> John Zhuge
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Reply via email to