Thanks Ryan.

ViewCatalog API mimics TableCatalog API including how shared namespace is
handled:

   - The doc for createView
   
<https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R109>
states
   "it will throw ViewAlreadyExistsException when a view or table already
   exists for the identifier."
   - The doc for loadView
   
<https://github.com/apache/spark/pull/28147/files#diff-24f7e7a09707492d3e65d549002e5849R75>
states
   "If the catalog supports tables and contains a table for the identifier and
   not a view, this must throw NoSuchViewException."

Agree it is good to explicitly specify the order of resolution. I will add
a section in ViewCatalog javadoc to summarize the behavior for "shared
namespace". The loadView doc will also be updated to spell out the order of
resolution.

On Thu, Aug 13, 2020 at 1:41 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> I agree with Wenchen that we need to be clear about resolution and
> behavior. For example, I think that we would agree that CREATE VIEW
> catalog.schema.name should fail when there is a table named
> catalog.schema.name. We’ve already included this behavior in the
> documentation for the TableCatalog API
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/TableCatalog.html#createTable-org.apache.spark.sql.connector.catalog.Identifier-org.apache.spark.sql.types.StructType-org.apache.spark.sql.connector.expressions.Transform:A-java.util.Map->,
> where create should fail if a view exists for the identifier.
>
> I think it was simply assumed that we would use the same approach — the
> API requires that table and view names share a namespace. But it would be
> good to specifically note either the order in which resolution will happen
> (views are resolved first) or note that it is not allowed and behavior is
> not guaranteed. I prefer the first option.
>
> On Wed, Aug 12, 2020 at 5:14 PM John Zhuge <jzh...@apache.org> wrote:
>
>> Hi Wenchen,
>>
>> Thanks for the feedback!
>>
>> 1. Add a new View API. How to avoid name conflicts between table and
>>> view? When resolving relation, shall we lookup table catalog first or view
>>> catalog?
>>
>>
>>  See clarification in SPIP section "Proposed Changes - Namespace":
>>
>>    - The proposed new view substitution rule and the changes to
>>    ResolveCatalogs should ensure the view catalog is looked up first for a
>>    "dual" catalog.
>>    - The implementation for a "dual" catalog plugin should ensure:
>>       -  Creating a view in view catalog when a table of the same name
>>       exists should fail.
>>       -  Creating a table in table catalog when a view of the same name
>>       exists should fail as well.
>>
>> Agree with you that a new View API is more flexible. A couple of notes:
>>
>>    - We actually started a common view prototype using the single
>>    catalog approach, but once we added more and more view metadata, storing
>>    them in table properties became not manageable, especially for the feature
>>    like "versioning". Eventually we opted for a view backend of S3 JSON 
>> files.
>>    - We'd like to move away from Hive metastore
>>
>> For more details and discussion, see SPIP section "Background and
>> Motivation".
>>
>> Thanks,
>> John
>>
>> On Wed, Aug 12, 2020 at 10:15 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> Hi John,
>>>
>>> Thanks for working on this! View support is very important to the
>>> catalog plugin API.
>>>
>>> After reading your doc, I have one high-level question: should view be a
>>> separated API or it's just a special type of table?
>>>
>>> AFAIK in most databases, tables and views share the same namespace. You
>>> can't create a view if a same-name table exists. In Hive, view is just a
>>> special type of table, so they are in the same namespace naturally. If we
>>> have both table catalog and view catalog, we need a mechanism to make sure
>>> there are no name conflicts.
>>>
>>> On the other hand, the view metadata is very simple that can be put in
>>> table properties. I'd like to see more thoughts to evaluate these 2
>>> approaches:
>>> 1. *Add a new View API*. How to avoid name conflicts between table and
>>> view? When resolving relation, shall we lookup table catalog first or view
>>> catalog?
>>> 2. *Reuse the Table API*. How to indicate it's a view? What if we do
>>> want to store table and views separately?
>>>
>>> I think a new View API is more flexible. I'd vote for it if we can come
>>> up with a good mechanism to avoid name conflicts.
>>>
>>> On Wed, Aug 12, 2020 at 6:20 AM John Zhuge <jzh...@apache.org> wrote:
>>>
>>>> Hi Spark devs,
>>>>
>>>> I'd like to bring more attention to this SPIP. As Dongjoon indicated in
>>>> the email "Apache Spark 3.1 Feature Expectation (Dec. 2020)", this feature
>>>> can be considered for 3.2 or even 3.1.
>>>>
>>>> View catalog builds on top of the catalog plugin system introduced in
>>>> DataSourceV2. It adds the “ViewCatalog” API to load, create, alter, and
>>>> drop views. A catalog plugin can naturally implement both ViewCatalog and
>>>> TableCatalog.
>>>>
>>>> Our internal implementation has been in production for over 8 months.
>>>> Recently we extended it to support materialized views, for the read path
>>>> initially.
>>>>
>>>> The PR has conflicts that I will resolve them shortly.
>>>>
>>>> Thanks,
>>>>
>>>> On Wed, Apr 22, 2020 at 12:24 AM John Zhuge <jzh...@apache.org> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> In order to disassociate view metadata from Hive Metastore and support
>>>>> different storage backends, I am proposing a new view catalog API to load,
>>>>> create, alter, and drop views.
>>>>>
>>>>> Document:
>>>>> https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing
>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-31357
>>>>> WIP PR: https://github.com/apache/spark/pull/28147
>>>>>
>>>>> As part of a project to support common views across query engines like
>>>>> Spark and Presto, my team used the view catalog API in Spark
>>>>> implementation. The project has been in production over three months.
>>>>>
>>>>> Thanks,
>>>>> John Zhuge
>>>>>
>>>>
>>>>
>>>> --
>>>> John Zhuge
>>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
John Zhuge

Reply via email to