Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-14 Thread Rui Wang
There were some extra discussions that happened at
https://github.com/apache/spark/pull/37105.

As of now we agreed to have a "soft deprecation" that
1. document the limitation of four API and suggest to use alternatives in
the API doc
2. do not use the @deprecation annotation.

Please let us know if you don't agree.


-Rui

On Fri, Jul 8, 2022 at 11:18 AM Rui Wang  wrote:

> Yes. The current goal is a pure educational deprecation.
>
> So given the proposal:
> 1. existing users or users who do not care about catalog names in table
> identifiers can still use all the API that maintain their past behavior.
> 2. new users who intend to use table identifiers with catalog names
> get warned by the annotation (and maybe a bit more comments on the API
> surface) that 4 API will not serve their usage.
>
> I believe this proposal is conservative: do not intend to cause troubles
> for existing users; do not intend to force user migration; do not intend to
> delete APIs; do not intend to hurt supportability. If there is anything
> that we can make this goal clear, I can do it.
>
> Ultimately, the 4 API in this thread has the problem that it is not
> compatible with 3 layer namespace thus not the same as other API who is
> supporting 3 layer namespace. For people who want to include catalog names,
> the problem itself will stand and we probably have to do something about
> it.
>
> -Rui
>
> On Fri, Jul 8, 2022 at 7:24 AM Wenchen Fan  wrote:
>
>> It's better to keep all APIs working. But in this case, I really have no
>> idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
>> String, tableName: String) currently checks if table "dbName.tableName"
>> exists in the Hive metastore, and does not work with v2 catalogs at all.
>> It's not only a "not needed" API, but also a confusing API. We need a
>> mechanism to move users away from confusing APIs.
>>
>> I agree that we should not abuse deprecation. I think a general principle
>> to use deprecation is you have the intention to remove it eventually, which
>> is exactly the case here. We should remove these 4 APIs when most users
>> have moved away.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for starting the official discussion, Rui.
>>>
>>> 'Unneeded API' doesn't sound like a good frame for this discussion
>>> because it ignores the existing users and codes completely.
>>> Technically, the above mentioned reasons look irrelevant to any
>>> specific existing bugs or future maintenance cost saving. Instead, the
>>> deprecation already causes costs to the community (your PR, the future
>>> migration guide, and the communication with the customers like Q&A)
>>> and to the users for the actual migration to new API and validations.
>>> Given that, for now, the goal of this proposal looks like a pure
>>> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>>>
>>> Can we be more conservative at Apache Spark deprecation and allow
>>> users to use both APIs freely without any concern of uncertain
>>> insupportability? I simply want to avoid the situation where the pure
>>> educational deprecation itself becomes `Unneeded Deprecation` in the
>>> community.
>>>
>>> Dongjoon.
>>>
>>> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
>>> >
>>> > I want to highlight in case I missed this in the original email:
>>> >
>>> > The 4 API will not be deleted. They will just be marked as deprecated
>>> annotations and we encourage users to use their alternatives.
>>> >
>>> >
>>> > -Rui
>>> >
>>> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
>>> >>
>>> >> Hi Community,
>>> >>
>>> >> Proposal:
>>> >> I want to discuss a proposal to deprecate the following Catalog API:
>>> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
>>> >> def getTable(dbName: String, tableName: String): Table
>>> >> def getFunction(dbName: String, functionName: String): Function
>>> >> def tableExists(dbName: String, tableName: String): Boolean
>>> >>
>>> >>
>>> >> Context:
>>> >> We have been adding table identifier with catalog name (aka 3 layer
>>> namespace) support to Catalog API in
>>> https://issues.apache.org/jira/browse/SPARK-39235.
>>> >> The basic idea is, if an API accepts:
>>> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes
>>> analyzer which treats a as catalog name, b namespace name and c table name.
>>> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
>>> which treats a as catalog name, b namespace name.
>>> >> Meanwhile we still maintain the backwards compatibility for such API
>>> to make sure past behavior remains the same. E.g. If you only use tableName
>>> it is still recognized by the session catalog.
>>> >>
>>> >> With this effort ongoing, the above 4 API becomes not fully
>>> compatible with the 3 layer namespace.
>>> >>
>>> >> use tableExists(dbName: String, tableName: String) as an example,
>>> given that it takes two parameters but leaves n

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Rui Wang
Yes. The current goal is a pure educational deprecation.

So given the proposal:
1. existing users or users who do not care about catalog names in table
identifiers can still use all the API that maintain their past behavior.
2. new users who intend to use table identifiers with catalog names
get warned by the annotation (and maybe a bit more comments on the API
surface) that 4 API will not serve their usage.

I believe this proposal is conservative: do not intend to cause troubles
for existing users; do not intend to force user migration; do not intend to
delete APIs; do not intend to hurt supportability. If there is anything
that we can make this goal clear, I can do it.

Ultimately, the 4 API in this thread has the problem that it is not
compatible with 3 layer namespace thus not the same as other API who is
supporting 3 layer namespace. For people who want to include catalog names,
the problem itself will stand and we probably have to do something about
it.

-Rui

On Fri, Jul 8, 2022 at 7:24 AM Wenchen Fan  wrote:

> It's better to keep all APIs working. But in this case, I really have no
> idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
> String, tableName: String) currently checks if table "dbName.tableName"
> exists in the Hive metastore, and does not work with v2 catalogs at all.
> It's not only a "not needed" API, but also a confusing API. We need a
> mechanism to move users away from confusing APIs.
>
> I agree that we should not abuse deprecation. I think a general principle
> to use deprecation is you have the intention to remove it eventually, which
> is exactly the case here. We should remove these 4 APIs when most users
> have moved away.
>
> Thanks,
> Wenchen
>
> On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
> wrote:
>
>> Thank you for starting the official discussion, Rui.
>>
>> 'Unneeded API' doesn't sound like a good frame for this discussion
>> because it ignores the existing users and codes completely.
>> Technically, the above mentioned reasons look irrelevant to any
>> specific existing bugs or future maintenance cost saving. Instead, the
>> deprecation already causes costs to the community (your PR, the future
>> migration guide, and the communication with the customers like Q&A)
>> and to the users for the actual migration to new API and validations.
>> Given that, for now, the goal of this proposal looks like a pure
>> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>>
>> Can we be more conservative at Apache Spark deprecation and allow
>> users to use both APIs freely without any concern of uncertain
>> insupportability? I simply want to avoid the situation where the pure
>> educational deprecation itself becomes `Unneeded Deprecation` in the
>> community.
>>
>> Dongjoon.
>>
>> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
>> >
>> > I want to highlight in case I missed this in the original email:
>> >
>> > The 4 API will not be deleted. They will just be marked as deprecated
>> annotations and we encourage users to use their alternatives.
>> >
>> >
>> > -Rui
>> >
>> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
>> >>
>> >> Hi Community,
>> >>
>> >> Proposal:
>> >> I want to discuss a proposal to deprecate the following Catalog API:
>> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
>> >> def getTable(dbName: String, tableName: String): Table
>> >> def getFunction(dbName: String, functionName: String): Function
>> >> def tableExists(dbName: String, tableName: String): Boolean
>> >>
>> >>
>> >> Context:
>> >> We have been adding table identifier with catalog name (aka 3 layer
>> namespace) support to Catalog API in
>> https://issues.apache.org/jira/browse/SPARK-39235.
>> >> The basic idea is, if an API accepts:
>> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes
>> analyzer which treats a as catalog name, b namespace name and c table name.
>> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
>> which treats a as catalog name, b namespace name.
>> >> Meanwhile we still maintain the backwards compatibility for such API
>> to make sure past behavior remains the same. E.g. If you only use tableName
>> it is still recognized by the session catalog.
>> >>
>> >> With this effort ongoing, the above 4 API becomes not fully compatible
>> with the 3 layer namespace.
>> >>
>> >> use tableExists(dbName: String, tableName: String) as an example,
>> given that it takes two parameters but leaves no room for the extra catalog
>> name. Also if we want to reuse the two parameters, which one will be the
>> one that takes more than one name part?
>> >>
>> >>
>> >> How?
>> >> So how to improve the above 4 API? There are two options:
>> >> a. Expand those four API to let those API accept catalog names. For
>> example, tableExists(catalogName: String, dbName: String, tableName:
>> String).
>> >> b. mark those API as `deprecated`.
>> >>
>> >> I am proposing to follow option B which does API de

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Wenchen Fan
It's better to keep all APIs working. But in this case, I really have no
idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
String, tableName: String) currently checks if table "dbName.tableName"
exists in the Hive metastore, and does not work with v2 catalogs at all.
It's not only a "not needed" API, but also a confusing API. We need a
mechanism to move users away from confusing APIs.

I agree that we should not abuse deprecation. I think a general principle
to use deprecation is you have the intention to remove it eventually, which
is exactly the case here. We should remove these 4 APIs when most users
have moved away.

Thanks,
Wenchen

On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
wrote:

> Thank you for starting the official discussion, Rui.
>
> 'Unneeded API' doesn't sound like a good frame for this discussion
> because it ignores the existing users and codes completely.
> Technically, the above mentioned reasons look irrelevant to any
> specific existing bugs or future maintenance cost saving. Instead, the
> deprecation already causes costs to the community (your PR, the future
> migration guide, and the communication with the customers like Q&A)
> and to the users for the actual migration to new API and validations.
> Given that, for now, the goal of this proposal looks like a pure
> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>
> Can we be more conservative at Apache Spark deprecation and allow
> users to use both APIs freely without any concern of uncertain
> insupportability? I simply want to avoid the situation where the pure
> educational deprecation itself becomes `Unneeded Deprecation` in the
> community.
>
> Dongjoon.
>
> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
> >
> > I want to highlight in case I missed this in the original email:
> >
> > The 4 API will not be deleted. They will just be marked as deprecated
> annotations and we encourage users to use their alternatives.
> >
> >
> > -Rui
> >
> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
> >>
> >> Hi Community,
> >>
> >> Proposal:
> >> I want to discuss a proposal to deprecate the following Catalog API:
> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
> >> def getTable(dbName: String, tableName: String): Table
> >> def getFunction(dbName: String, functionName: String): Function
> >> def tableExists(dbName: String, tableName: String): Boolean
> >>
> >>
> >> Context:
> >> We have been adding table identifier with catalog name (aka 3 layer
> namespace) support to Catalog API in
> https://issues.apache.org/jira/browse/SPARK-39235.
> >> The basic idea is, if an API accepts:
> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes analyzer
> which treats a as catalog name, b namespace name and c table name.
> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
> which treats a as catalog name, b namespace name.
> >> Meanwhile we still maintain the backwards compatibility for such API to
> make sure past behavior remains the same. E.g. If you only use tableName it
> is still recognized by the session catalog.
> >>
> >> With this effort ongoing, the above 4 API becomes not fully compatible
> with the 3 layer namespace.
> >>
> >> use tableExists(dbName: String, tableName: String) as an example, given
> that it takes two parameters but leaves no room for the extra catalog name.
> Also if we want to reuse the two parameters, which one will be the one that
> takes more than one name part?
> >>
> >>
> >> How?
> >> So how to improve the above 4 API? There are two options:
> >> a. Expand those four API to let those API accept catalog names. For
> example, tableExists(catalogName: String, dbName: String, tableName:
> String).
> >> b. mark those API as `deprecated`.
> >>
> >> I am proposing to follow option B which does API deprecation.
> >>
> >> Why?
> >> 1. Reduce unneeded API. The existing API can support the same behavior
> given SPARK-39235. For example, tableExists(dbName, tableName) can be
> replaced to use tableExists("dbName.tableName").
> >> 2. Reduce incomplete API. The proposed API to deprecate does not
> support 3 layer namespace now, and it is hard to do so (where to take 3
> part names)?
> >> 3. Deprecation suggests users to migrate their usage on API.
> >> 4. There was existing practice that we deprecated CreateExternalTable
> API when adding CreateTable API:
> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
> >>
> >>
> >> What do you think?
> >>
> >> Thanks,
> >> Rui Wang
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-07 Thread Dongjoon Hyun
Thank you for starting the official discussion, Rui.

'Unneeded API' doesn't sound like a good frame for this discussion
because it ignores the existing users and codes completely.
Technically, the above mentioned reasons look irrelevant to any
specific existing bugs or future maintenance cost saving. Instead, the
deprecation already causes costs to the community (your PR, the future
migration guide, and the communication with the customers like Q&A)
and to the users for the actual migration to new API and validations.
Given that, for now, the goal of this proposal looks like a pure
educational purpose to advertise new APIs to Apache Spark 3.4+ users.

Can we be more conservative at Apache Spark deprecation and allow
users to use both APIs freely without any concern of uncertain
insupportability? I simply want to avoid the situation where the pure
educational deprecation itself becomes `Unneeded Deprecation` in the
community.

Dongjoon.

On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
>
> I want to highlight in case I missed this in the original email:
>
> The 4 API will not be deleted. They will just be marked as deprecated 
> annotations and we encourage users to use their alternatives.
>
>
> -Rui
>
> On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
>>
>> Hi Community,
>>
>> Proposal:
>> I want to discuss a proposal to deprecate the following Catalog API:
>> def listColumns(dbName: String, tableName: String): Dataset[Column]
>> def getTable(dbName: String, tableName: String): Table
>> def getFunction(dbName: String, functionName: String): Function
>> def tableExists(dbName: String, tableName: String): Boolean
>>
>>
>> Context:
>> We have been adding table identifier with catalog name (aka 3 layer 
>> namespace) support to Catalog API in 
>> https://issues.apache.org/jira/browse/SPARK-39235.
>> The basic idea is, if an API accepts:
>> 1. only tableName:String, we allow it accepts "a.b.c" and goes analyzer 
>> which treats a as catalog name, b namespace name and c table name.
>> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer which 
>> treats a as catalog name, b namespace name.
>> Meanwhile we still maintain the backwards compatibility for such API to make 
>> sure past behavior remains the same. E.g. If you only use tableName it is 
>> still recognized by the session catalog.
>>
>> With this effort ongoing, the above 4 API becomes not fully compatible with 
>> the 3 layer namespace.
>>
>> use tableExists(dbName: String, tableName: String) as an example, given that 
>> it takes two parameters but leaves no room for the extra catalog name. Also 
>> if we want to reuse the two parameters, which one will be the one that takes 
>> more than one name part?
>>
>>
>> How?
>> So how to improve the above 4 API? There are two options:
>> a. Expand those four API to let those API accept catalog names. For example, 
>> tableExists(catalogName: String, dbName: String, tableName: String).
>> b. mark those API as `deprecated`.
>>
>> I am proposing to follow option B which does API deprecation.
>>
>> Why?
>> 1. Reduce unneeded API. The existing API can support the same behavior given 
>> SPARK-39235. For example, tableExists(dbName, tableName) can be replaced to 
>> use tableExists("dbName.tableName").
>> 2. Reduce incomplete API. The proposed API to deprecate does not support 3 
>> layer namespace now, and it is hard to do so (where to take 3 part names)?
>> 3. Deprecation suggests users to migrate their usage on API.
>> 4. There was existing practice that we deprecated CreateExternalTable API 
>> when adding CreateTable API: 
>> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
>>
>>
>> What do you think?
>>
>> Thanks,
>> Rui Wang
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-07 Thread Rui Wang
I want to highlight in case I missed this in the original email:

The 4 API will not be deleted. They will just be marked as deprecated
annotations and we encourage users to use their alternatives.


-Rui

On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:

> Hi Community,
>
> Proposal:
> I want to discuss a proposal to deprecate the following Catalog API:
> def listColumns(dbName: String, tableName: String): Dataset[Column]
> def getTable(dbName: String, tableName: String): Table
> def getFunction(dbName: String, functionName: String): Function
> def tableExists(dbName: String, tableName: String): Boolean
>
>
> Context:
> We have been adding table identifier with catalog name (aka 3 layer
> namespace) support to Catalog API in
> https://issues.apache.org/jira/browse/SPARK-39235.
> The basic idea is, if an API accepts:
> 1. only tableName:String, we allow it accepts "a.b.c" and
> goes analyzer which treats a as catalog name, b namespace name and c table
> name.
> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer which
> treats a as catalog name, b namespace name.
> Meanwhile we still maintain the backwards compatibility for such API to
> make sure past behavior remains the same. E.g. If you only use tableName it
> is still recognized by the session catalog.
>
> With this effort ongoing, the above 4 API becomes not fully
> compatible with the 3 layer namespace.
>
> use tableExists(dbName: String, tableName: String) as an example, given
> that it takes two parameters but leaves no room for the extra catalog name.
> Also if we want to reuse the two parameters, which one will be the one that
> takes more than one name part?
>
>
> How?
> So how to improve the above 4 API? There are two options:
> a. Expand those four API to let those API accept catalog names. For
> example, tableExists(catalogName: String, dbName: String, tableName:
> String).
> b. mark those API as `deprecated`.
>
> I am proposing to follow option B which does API deprecation.
>
> Why?
> 1. Reduce unneeded API. The existing API can support the same behavior
> given SPARK-39235. For example, tableExists(dbName, tableName) can be
> replaced to use tableExists("dbName.tableName").
> 2. Reduce incomplete API. The proposed API to deprecate does not support 3
> layer namespace now, and it is hard to do so (where to take 3 part names)?
> 3. Deprecation suggests users to migrate their usage on API.
> 4. There was existing practice that we deprecated CreateExternalTable API
> when adding CreateTable API:
> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
>
>
> What do you think?
>
> Thanks,
> Rui Wang
>
>
>


[DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-07 Thread Rui Wang
Hi Community,

Proposal:
I want to discuss a proposal to deprecate the following Catalog API:
def listColumns(dbName: String, tableName: String): Dataset[Column]
def getTable(dbName: String, tableName: String): Table
def getFunction(dbName: String, functionName: String): Function
def tableExists(dbName: String, tableName: String): Boolean


Context:
We have been adding table identifier with catalog name (aka 3 layer
namespace) support to Catalog API in
https://issues.apache.org/jira/browse/SPARK-39235.
The basic idea is, if an API accepts:
1. only tableName:String, we allow it accepts "a.b.c" and
goes analyzer which treats a as catalog name, b namespace name and c table
name.
2. only dbName:String, we allow it accepts "a.b" and goes analyzer which
treats a as catalog name, b namespace name.
Meanwhile we still maintain the backwards compatibility for such API to
make sure past behavior remains the same. E.g. If you only use tableName it
is still recognized by the session catalog.

With this effort ongoing, the above 4 API becomes not fully compatible with
the 3 layer namespace.

use tableExists(dbName: String, tableName: String) as an example, given
that it takes two parameters but leaves no room for the extra catalog name.
Also if we want to reuse the two parameters, which one will be the one that
takes more than one name part?


How?
So how to improve the above 4 API? There are two options:
a. Expand those four API to let those API accept catalog names. For
example, tableExists(catalogName: String, dbName: String, tableName:
String).
b. mark those API as `deprecated`.

I am proposing to follow option B which does API deprecation.

Why?
1. Reduce unneeded API. The existing API can support the same behavior
given SPARK-39235. For example, tableExists(dbName, tableName) can be
replaced to use tableExists("dbName.tableName").
2. Reduce incomplete API. The proposed API to deprecate does not support 3
layer namespace now, and it is hard to do so (where to take 3 part names)?
3. Deprecation suggests users to migrate their usage on API.
4. There was existing practice that we deprecated CreateExternalTable API
when adding CreateTable API:
https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220


What do you think?

Thanks,
Rui Wang