Re: SPIP: Catalog API for view metadata

John Zhuge Mon, 24 May 2021 11:12:34 -0700

Great! I will start a vote thread.

On Mon, May 24, 2021 at 10:54 AM Wenchen Fan <[email protected]> wrote:


> Yea let's move forward first. We can discuss the caching approach
> and TableViewCatalog approach during the PR review.
>
> On Tue, May 25, 2021 at 1:48 AM John Zhuge <[email protected]> wrote:
>
>> Hi everyone,
>>
>> Is there any more discussion before we start a vote on ViewCatalog? With
>> FunctionCatalog merged, I hope this feature can complete the offerings of
>> catalog plugins in 3.2.
>>
>> Once approved, I will refresh the WIP PR. Implementation details can be
>> ironed out during review.
>>
>> Thanks,
>>
>> On Tue, Nov 10, 2020 at 5:23 PM Ryan Blue <[email protected]>
>> wrote:
>>
>>> An extra RPC call is a concern for the catalog implementation. It is
>>> simple to cache the result of a call to avoid a second one if the catalog
>>> chooses.
>>>
>>> I don't think that an extra RPC that can be easily avoided is a
>>> reasonable justification to add caches in Spark. For one thing, it doesn't
>>> solve the problem because the proposed API still requires separate lookups
>>> for tables and views.
>>>
>>> The only solution that would help is to use a combined trait, but that
>>> has issues. For one, view substitution is much cleaner when it happens well
>>> before table resolution. And, View and Table are very different objects;
>>> returning Object from this API doesn't make much sense.
>>>
>>> One extra RPC is not unreasonable, and the choice should be left to
>>> sources. That's the easiest place to cache results from the underlying
>>> store.
>>>
>>> On Mon, Nov 9, 2020 at 8:18 PM Wenchen Fan <[email protected]> wrote:
>>>
>>>> Moving back the discussion to this thread. The current argument is how
>>>> to avoid extra RPC calls for catalogs supporting both table and view. There
>>>> are several options:
>>>> 1. ignore it as extra PRC calls are cheap compared to the query
>>>> execution
>>>> 2. have a per session cache for loaded table/view
>>>> 3. have a per query cache for loaded table/view
>>>> 4. add a new trait TableViewCatalog
>>>>
>>>> I think it's important to avoid perf regression with new APIs. RPC
>>>> calls can be significant for short queries. We may also double the RPC
>>>> traffic which is bad for the metastore service. Normally I would not
>>>> recommend caching as cache invalidation is a hard problem. Personally I
>>>> prefer option 4 as it only affects catalogs that support both table and
>>>> view, and it fits the hive catalog very well.
>>>>
>>>> On Fri, Sep 4, 2020 at 4:21 PM John Zhuge <[email protected]> wrote:
>>>>
>>>>> SPIP
>>>>> <https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing>
>>>>> has been updated. Please review.
>>>>>
>>>>> On Thu, Sep 3, 2020 at 9:22 AM John Zhuge <[email protected]> wrote:
>>>>>
>>>>>> Wenchen, sorry for the delay, I will post an update shortly.
>>>>>>
>>>>>> On Thu, Sep 3, 2020 at 2:00 AM Wenchen Fan <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Any updates here? I agree that a new View API is better, but we need
>>>>>>> a solution to avoid performance regression. We need to elaborate on the
>>>>>>> cache idea.
>>>>>>>
>>>>>>> On Thu, Aug 20, 2020 at 7:43 AM Ryan Blue <[email protected]> wrote:
>>>>>>>
>>>>>>>> I think it is a good idea to keep tables and views separate.
>>>>>>>>
>>>>>>>> The main two arguments I’ve heard for combining lookup into a
>>>>>>>> single function are the ones brought up in this thread. First, an
>>>>>>>> identifier in a catalog must be either a view or a table and should not
>>>>>>>> collide. Second, a single lookup is more likely to require a single 
>>>>>>>> RPC. I
>>>>>>>> think the RPC concern is well addressed by caching, which we already 
>>>>>>>> do in
>>>>>>>> the Spark catalog, so I’ll primarily focus on the first.
>>>>>>>>
>>>>>>>> Table/view name collision is unlikely to be a problem. Metastores
>>>>>>>> that support both today store them in a single namespace, so this is 
>>>>>>>> not a
>>>>>>>> concern for even a naive implementation that talks to the Hive 
>>>>>>>> MetaStore. I
>>>>>>>> know that a new metastore catalog could choose to implement both
>>>>>>>> ViewCatalog and TableCatalog and store the two sets separately, but 
>>>>>>>> that
>>>>>>>> would be a very strange choice: if the metastore itself has different
>>>>>>>> namespaces for tables and views, then it makes much more sense to 
>>>>>>>> expose
>>>>>>>> them through separate catalogs because Spark will always prefer one 
>>>>>>>> over
>>>>>>>> the other.
>>>>>>>>
>>>>>>>> In a similar line of reasoning, catalogs that expose both views and
>>>>>>>> tables are much more rare than catalogs that only expose one. For 
>>>>>>>> example,
>>>>>>>> v2 catalogs for JDBC and Cassandra expose data through the Table 
>>>>>>>> interface
>>>>>>>> and implementing ViewCatalog would make little sense. Exposing new data
>>>>>>>> sources to Spark requires TableCatalog, not ViewCatalog. View catalogs 
>>>>>>>> are
>>>>>>>> likely to be the same. Say I have a way to convert Pig statements or 
>>>>>>>> some
>>>>>>>> other representation into a SQL view. It would make little sense to 
>>>>>>>> combine
>>>>>>>> that with some other TableCatalog.
>>>>>>>>
>>>>>>>> I also don’t think there is benefit from an API perspective to
>>>>>>>> justify combining the Table and View interfaces. The two share only 
>>>>>>>> schema
>>>>>>>> and properties, and are handled very differently internally — a View’s 
>>>>>>>> SQL
>>>>>>>> query is parsed and substituted into the plan, while a Table is 
>>>>>>>> wrapped in
>>>>>>>> a relation that eventually becomes a Scan node using SupportsRead. A 
>>>>>>>> view’s
>>>>>>>> SQL also needs additional context to be resolved correctly: the current
>>>>>>>> catalog and namespace from the time the view was created.
>>>>>>>>
>>>>>>>> Query planning is distinct between tables and views, so Spark
>>>>>>>> doesn’t benefit from combining them. I think it has actually caused
>>>>>>>> problems that both were resolved by the same method in v1: the 
>>>>>>>> resolution
>>>>>>>> rule grew extremely complicated trying to look up a reference just once
>>>>>>>> because it had to parse a view plan and resolve relations within it 
>>>>>>>> using
>>>>>>>> the view’s context (current database). In contrast, John’s new view
>>>>>>>> substitution rules are cleaner and can stay within the substitution 
>>>>>>>> batch.
>>>>>>>>
>>>>>>>> People implementing views would also not benefit from combining the
>>>>>>>> two interfaces:
>>>>>>>>
>>>>>>>>    - There is little overlap between View and Table, only schema
>>>>>>>>    and properties
>>>>>>>>    - Most catalogs won’t implement both interfaces, so returning a
>>>>>>>>    ViewOrTable is more difficult for implementations
>>>>>>>>    - TableCatalog assumes that ViewCatalog will be added
>>>>>>>>    separately like John proposes, so we would have to break or replace 
>>>>>>>> that API
>>>>>>>>
>>>>>>>> I understand the initial appeal of combining TableCatalog and
>>>>>>>> ViewCatalog since it is done that way in the existing interfaces. But I
>>>>>>>> think that Hive chose to do that mostly on the fact that the two were
>>>>>>>> already stored together, and not because it made sense for users of the
>>>>>>>> API, or any other implementer of the API.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>> On Tue, Aug 18, 2020 at 9:46 AM John Zhuge <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> > AFAIK view schema is only used by DESCRIBE.
>>>>>>>>>>
>>>>>>>>>> Correction: Spark adds a new Project at the top of the parsed
>>>>>>>>>> plan from view, based on the stored schema, to make sure the view 
>>>>>>>>>> schema
>>>>>>>>>> doesn't change.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks Wenchen! I thought I forgot something :) Yes it is the
>>>>>>>>> validation done in *checkAnalysis*:
>>>>>>>>>
>>>>>>>>>           // If the view output doesn't have the same number of
>>>>>>>>> columns neither with the child
>>>>>>>>>           // output, nor with the query column names, throw an
>>>>>>>>> AnalysisException.
>>>>>>>>>           // If the view's child output can't up cast to the view
>>>>>>>>> output,
>>>>>>>>>           // throw an AnalysisException, too.
>>>>>>>>>
>>>>>>>>> The view output comes from the schema:
>>>>>>>>>
>>>>>>>>>       val child = View(
>>>>>>>>>         desc = metadata,
>>>>>>>>>         output = metadata.schema.toAttributes,
>>>>>>>>>         child = parser.parsePlan(viewText))
>>>>>>>>>
>>>>>>>>> So it is a validation (here) or cache (in DESCRIBE) nice to have
>>>>>>>>> but not "required" or "should be frozen". Thanks Ryan and Burak for
>>>>>>>>> pointing that out in SPIP. I will add a new paragraph accordingly.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> John Zhuge
>>
>

-- 
John Zhuge

Re: SPIP: Catalog API for view metadata

Reply via email to