Re: Re: [VOTE] SPARK-44444: Use ANSI SQL mode by default

2024-04-15 Thread Rui Wang
+1, non-binding.

Thanks Dongjoon to drive this!


-Rui

On Mon, Apr 15, 2024 at 10:10 AM Xinrong Meng  wrote:

> +1
>
> Thank you @Dongjoon Hyun  !
>
> On Mon, Apr 15, 2024 at 6:33 AM beliefer  wrote:
>
>> +1
>>
>>
>> 在 2024-04-15 15:54:07,"Peter Toth"  写道:
>>
>> +1
>>
>> Wenchen Fan  ezt írta (időpont: 2024. ápr. 15., H,
>> 9:08):
>>
>>> +1
>>>
>>> On Sun, Apr 14, 2024 at 6:28 AM Dongjoon Hyun 
>>> wrote:
>>>
 I'll start from my +1.

 Dongjoon.

 On 2024/04/13 22:22:05 Dongjoon Hyun wrote:
 > Please vote on SPARK-4 to use ANSI SQL mode by default.
 > The technical scope is defined in the following PR which is
 > one line of code change and one line of migration guide.
 >
 > - DISCUSSION:
 > https://lists.apache.org/thread/ztlwoz1v1sn81ssks12tb19x37zozxlz
 > - JIRA: https://issues.apache.org/jira/browse/SPARK-4
 > - PR: https://github.com/apache/spark/pull/46013
 >
 > The vote is open until April 17th 1AM (PST) and passes
 > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
 >
 > [ ] +1 Use ANSI SQL mode by default
 > [ ] -1 Do not use ANSI SQL mode by default because ...
 >
 > Thank you in advance.
 >
 > Dongjoon
 >

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Welcome to Our New Apache Spark Committer and PMCs

2023-10-03 Thread Rui Wang
Congratulations! Well deserved!

-Rui


On Mon, Oct 2, 2023 at 10:32 PM Gengliang Wang  wrote:

> Congratulations to all! Well deserved!
>
> On Mon, Oct 2, 2023 at 10:16 PM Xiao Li  wrote:
>
>> Hi all,
>>
>> The Spark PMC is delighted to announce that we have voted to add one new
>> committer and two new PMC members. These individuals have consistently
>> contributed to the project and have clearly demonstrated their expertise.
>>
>> New Committer:
>> - Jiaan Geng (focusing on Spark Connect and Spark SQL)
>>
>> New PMCs:
>> - Yuanjian Li
>> - Yikun Jiang
>>
>> Please join us in extending a warm welcome to them in their new roles!
>>
>> Sincerely,
>> The Spark PMC
>>
>


Re: [ANNOUNCE] Apache Spark 3.4.0 released

2023-04-19 Thread Rui Wang
Thank you, Xinrong!


-Rui

On Mon, Apr 17, 2023 at 10:05 AM Xinrong Meng 
wrote:

> Thank you, Dongjoon!
>
> On Sat, Apr 15, 2023 at 9:04 AM Dongjoon Hyun 
> wrote:
>
>> Nice catch, Xiao!
>>
>> All `latest` tags are updated to v3.4.0 now.
>>
>> https://hub.docker.com/r/apache/spark/tags
>> https://hub.docker.com/r/apache/spark-py/tags
>> https://hub.docker.com/r/apache/spark-r/tags
>>
>> Dongjoon.
>>
>>
>> On Fri, Apr 14, 2023 at 8:38 PM Xiao Li  wrote:
>>
>>> @Dongjoon Hyun  Thank you!
>>>
>>> Could you also help update the latest tag ?
>>> https://hub.docker.com/r/apache/spark/tags
>>>
>>> Xiao
>>>
>>> Dongjoon Hyun  于2023年4月14日周五 16:23写道:
>>>
 Apache Spark Docker images are published too.

 docker pull apache/spark:v3.4.0
 docker pull apache/spark-py:v3.4.0
 docker pull apache/spark-r:v3.4.0

 Thanks,
 Dongjoon


 On Fri, Apr 14, 2023 at 2:56 PM Dongjoon Hyun 
 wrote:

> Thank you, Xinrong!
>
> Dongjoon.
>
>
> On Fri, Apr 14, 2023 at 1:37 PM Xiao Li  wrote:
>
>> Thank you Xinrong!
>>
>> Congratulations everyone! This is a great release with tons of new
>> features!
>>
>>
>>
>> Gengliang Wang  于2023年4月14日周五 13:04写道:
>>
>>> Congratulations everyone!
>>> Thank you Xinrong for driving the release!
>>>
>>> On Fri, Apr 14, 2023 at 12:47 PM Xinrong Meng <
>>> xinrong.apa...@gmail.com> wrote:
>>>
 Hi All,

 We are happy to announce the availability of *Apache Spark 3.4.0*!

 Apache Spark 3.4.0 is the fifth release of the 3.x line.

 To download Spark 3.4.0, head over to the download page:
 https://spark.apache.org/downloads.html

 To view the release notes:
 https://spark.apache.org/releases/spark-release-3-4-0.html

 We would like to acknowledge all community members for contributing
 to this
 release. This release would not have been possible without you.

 Thanks,

 Xinrong Meng

>>>


Re: The Spark email setting should be update

2023-04-19 Thread Rui Wang
I am replying now and the default address is dev@spark.apache.org.


-Rui

On Mon, Apr 17, 2023 at 4:27 AM Jia Fan  wrote:

> Hi, everyone.
>
> I find that every time I reply to dev's mailing list, the default address
> of the reply is the sender of the mail, not dev@spark.apache.org. It
> caused me to think that the email reply to dev was successful several
> times, but it wasn't. This should not be a common problem, because when I
> reply to emails from other communities, the default reply address is
> d...@xxx.apache.org. Can spark modify the corresponding settings to reduce
> the chance of developers replying incorrectly?
>
> Thanks
>
>
> 
>
>
> Jia Fan
>


Re: Time for Spark 3.4.0 release?

2023-01-03 Thread Rui Wang
+1 to cut the branch starting from a workday!

Great to see this is happening!

Thanks Xinrong!

-Rui

On Tue, Jan 3, 2023 at 9:21 PM 416161...@qq.com 
wrote:

> +1, thank you Xinrong for driving this release!
>
> --
> Ruifeng Zheng
> ruife...@foxmail.com
>
> 
>
>
>
> -- Original --
> *From:* "Hyukjin Kwon" ;
> *Date:* Wed, Jan 4, 2023 01:15 PM
> *To:* "Xinrong Meng";
> *Cc:* "dev";
> *Subject:* Re: Time for Spark 3.4.0 release?
>
> SGTM +1
>
> On Wed, Jan 4, 2023 at 2:13 PM Xinrong Meng 
> wrote:
>
>> Hi All,
>>
>> Shall we cut *branch-3.4* on *January 16th, 2023*? We proposed January
>> 15th per
>> https://spark.apache.org/versioning-policy.html, but I would suggest we
>> postpone one day since January 15th is a Sunday.
>>
>> I would like to volunteer as the release manager for *Apache Spark 3.4.0*
>> .
>>
>> Thanks,
>>
>> Xinrong Meng
>>
>>


Re: Welcome Yikun Jiang as a Spark committer

2022-10-18 Thread Rui Wang
Well deserved! Congrats!


-Rui

On Mon, Oct 10, 2022 at 9:07 AM Xinrong Meng 
wrote:

> Congratulations, Yikun! Well deserved.
>
> On Sun, Oct 9, 2022 at 9:36 PM John Zhuge  wrote:
>
>> Congratulations, Yikun!
>>
>> On Sun, Oct 9, 2022 at 8:52 PM Senthil Kumar  wrote:
>>
>>> Congratulations Yikun
>>>
>>> On Mon, 10 Oct 2022, 09:11 Xiao Li,  wrote:
>>>
 Congratulations, Yikun!

 Xiao

 Yikun Jiang  于2022年10月9日周日 19:34写道:

> Thank you all!
>
> Regards,
> Yikun
>
>
> On Mon, Oct 10, 2022 at 3:18 AM Chao Sun  wrote:
>
>> Congratulations Yikun!
>>
>> On Sun, Oct 9, 2022 at 11:14 AM vaquar khan 
>> wrote:
>>
>>> Congratulations.
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Sun, Oct 9, 2022, 6:46 AM 叶先进  wrote:
>>>
 Congrats

 On Oct 9, 2022, at 16:44, XiDuo You  wrote:

 Congratulations, Yikun !

 Maxim Gekk  于2022年10月9日周日
 15:59写道:

> Keep up the great work, Yikun!
>
> On Sun, Oct 9, 2022 at 10:52 AM Gengliang Wang 
> wrote:
>
>> Congratulations, Yikun!
>>
>> On Sun, Oct 9, 2022 at 12:33 AM 416161...@qq.com <
>> ruife...@foxmail.com> wrote:
>>
>>> Congrats, Yikun!
>>>
>>> --
>>> Ruifeng Zheng
>>> ruife...@foxmail.com
>>>
>>> 
>>>
>>>
>>>
>>> -- Original --
>>> *From:* "Martin Grigorov" ;
>>> *Date:* Sun, Oct 9, 2022 05:01 AM
>>> *To:* "Hyukjin Kwon";
>>> *Cc:* "dev";"Yikun Jiang"<
>>> yikunk...@gmail.com>;
>>> *Subject:* Re: Welcome Yikun Jiang as a Spark committer
>>>
>>> Congratulations, Yikun!
>>>
>>> On Sat, Oct 8, 2022 at 7:41 AM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 The Spark PMC recently added Yikun Jiang as a committer on the
 project.
 Yikun is the major contributor of the infrastructure and GitHub
 Actions in Apache Spark as well as Kubernates and PySpark.
 He has put a lot of effort into stabilizing and optimizing the
 builds so we all can work together in Apache Spark more
 efficiently and effectively. He's also driving the SPIP for
 Docker official image in Apache Spark as well for users and 
 developers.
 Please join me in welcoming Yikun!


 --
>> John Zhuge
>>
>


Re: [DISCUSS] [Spark SQL, PySpark] Combining StructTypes into a new StructType

2022-08-09 Thread Rui Wang
Thanks for the idea!

I am thinking that the usage of "combined = StructType( a.fields +
b.fields)" is still good because
1) it is not horrible to merge a and b in this way.
2) itself clarifies the intention which is merge two struct's fields to
construct a new struct
3) you also have room to apply more complicated operations on fields
merging. For example remove duplicate files with the same name or use
a.fields but remove some fields if they are in b.

overloading "+" could be
1. it's ambiguous on what this plus is doing.
2. If you define + is a concatenation on the fields, then it's limited to
only do the concatenation. How about other operations like extract fields
from a based on b? Maybe overloading "-"? In this case the item list will
grow.

-Rui

On Tue, Aug 9, 2022 at 1:10 PM Tim  wrote:

> Hi all,
>
> this is my first message to the Spark mailing list, so please bear with
> me if I don't fully meet your communication standards.
> I just wanted to discuss one aspect that I've stumbled across several
> times over the past few weeks.
> When working with Spark, I often run into the problem of having to merge
> two (or more) existing StructTypes into a new one to define a schema.
> Usually this looks similar (in Python) to the following simplified
> example:
>
>  a = StructType([StuctField("field_a", StringType())])
>  b = StructType([StructField("field_b", IntegerType())])
>
>  combined = StructType( a.fields + b.fields)
>
> My idea, which I would like to discuss, is to shorten the above example
> in Python as follows by supporting Python's add operator for
> StructTypes:
>
>  combined = a + b
>
>
> What do you think of this idea? Are there any reasons why this is not
> yet part of StructType's functionality?
> If you support this idea, I could create a first PR for further and
> deeper discussion.
>
> Best
> Tim
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Rui Wang
Congrats Xinrong!


-Rui

On Tue, Aug 9, 2022 at 8:57 PM Xingbo Jiang  wrote:

> Congratulations!
>
> Yuanjian Li 于2022年8月9日 周二20:31写道:
>
>> Congratulations, Xinrong!
>>
>> XiDuo You 于2022年8月9日 周二19:18写道:
>>
>>> Congratulations!
>>>
>>> Haejoon Lee  于2022年8月10日周三 09:30写道:
>>> >
>>> > Congrats, Xinrong!!
>>> >
>>> > On Tue, Aug 9, 2022 at 5:12 PM Hyukjin Kwon 
>>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> The Spark PMC recently added Xinrong Meng as a committer on the
>>> project. Xinrong is the major contributor of PySpark especially Pandas API
>>> on Spark. She has guided a lot of new contributors enthusiastically. Please
>>> join me in welcoming Xinrong!
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: Update Spark 3.4 Release Window?

2022-07-21 Thread Rui Wang
+1 for code freeze in Jan 2023 then QA period and then RC in Feb 2023.


-Rui

On Thu, Jul 21, 2022 at 11:46 AM Chao Sun  wrote:

> +1 for Jan 2023 (Code freeze) and Feb 2023 (RC).
>
> Chao
>
> On Thu, Jul 21, 2022 at 11:43 AM L. C. Hsieh  wrote:
> >
> > I'm also +1 for Feb. 2023 (RC) and Jan. 2023 (Code freeze).
> >
> > Liang-Chi
> >
> > On Wed, Jul 20, 2022 at 2:02 PM Dongjoon Hyun 
> wrote:
> > >
> > > I fixed typos :)
> > >
> > > +1 for February 2023 (Release Candidate) and January 2023 (Code
> freeze).
> > >
> > > On 2022/07/20 20:59:30 Dongjoon Hyun wrote:
> > > > Thank you for initiating this discussion, Xinrong. I also agree with
> Sean.
> > > >
> > > > +1 for February 2023 (Release Candidate) and January 2021 (Code
> freeze).
> > > >
> > > > Dongjoon.
> > > >
> > > > On Wed, Jul 20, 2022 at 1:42 PM Sean Owen  wrote:
> > > > >
> > > > > I don't know any better than others when it will actually happen,
> though historically, it's more like 7-8 months between minor releases. I
> might therefore expect a release more like February 2023, and work
> backwards from there. Doesn't really matter, this is just a public guess
> and can be changed.
> > > > >
> > > > > On Wed, Jul 20, 2022 at 3:27 PM Xinrong Meng <
> xinrong.apa...@gmail.com> wrote:
> > > > >>
> > > > >> Hi All,
> > > > >>
> > > > >> Since Spark 3.3.0 was released on June 16, 2022, shall we update
> the release window https://spark.apache.org/versioning-policy.html for
> Spark 3.4?
> > > > >>
> > > > >> A proposal is as follows:
> > > > >>
> > > > >> | October 15th 2022 | Code freeze. Release branch cut.
> > > > >> | Late October 2022 | QA period. Focus on bug fixes, tests,
> stability and docs. Generally, no new features merged.
> > > > >> | November 2022 | Release candidates (RC), voting, etc. until
> final release passes
> > > > >>
> > > > >> Thanks!
> > > > >>
> > > > >> Xinrong Meng
> > > > >>
> > > >
> > > > -
> > > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-14 Thread Rui Wang
There were some extra discussions that happened at
https://github.com/apache/spark/pull/37105.

As of now we agreed to have a "soft deprecation" that
1. document the limitation of four API and suggest to use alternatives in
the API doc
2. do not use the @deprecation annotation.

Please let us know if you don't agree.


-Rui

On Fri, Jul 8, 2022 at 11:18 AM Rui Wang  wrote:

> Yes. The current goal is a pure educational deprecation.
>
> So given the proposal:
> 1. existing users or users who do not care about catalog names in table
> identifiers can still use all the API that maintain their past behavior.
> 2. new users who intend to use table identifiers with catalog names
> get warned by the annotation (and maybe a bit more comments on the API
> surface) that 4 API will not serve their usage.
>
> I believe this proposal is conservative: do not intend to cause troubles
> for existing users; do not intend to force user migration; do not intend to
> delete APIs; do not intend to hurt supportability. If there is anything
> that we can make this goal clear, I can do it.
>
> Ultimately, the 4 API in this thread has the problem that it is not
> compatible with 3 layer namespace thus not the same as other API who is
> supporting 3 layer namespace. For people who want to include catalog names,
> the problem itself will stand and we probably have to do something about
> it.
>
> -Rui
>
> On Fri, Jul 8, 2022 at 7:24 AM Wenchen Fan  wrote:
>
>> It's better to keep all APIs working. But in this case, I really have no
>> idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
>> String, tableName: String) currently checks if table "dbName.tableName"
>> exists in the Hive metastore, and does not work with v2 catalogs at all.
>> It's not only a "not needed" API, but also a confusing API. We need a
>> mechanism to move users away from confusing APIs.
>>
>> I agree that we should not abuse deprecation. I think a general principle
>> to use deprecation is you have the intention to remove it eventually, which
>> is exactly the case here. We should remove these 4 APIs when most users
>> have moved away.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for starting the official discussion, Rui.
>>>
>>> 'Unneeded API' doesn't sound like a good frame for this discussion
>>> because it ignores the existing users and codes completely.
>>> Technically, the above mentioned reasons look irrelevant to any
>>> specific existing bugs or future maintenance cost saving. Instead, the
>>> deprecation already causes costs to the community (your PR, the future
>>> migration guide, and the communication with the customers like Q)
>>> and to the users for the actual migration to new API and validations.
>>> Given that, for now, the goal of this proposal looks like a pure
>>> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>>>
>>> Can we be more conservative at Apache Spark deprecation and allow
>>> users to use both APIs freely without any concern of uncertain
>>> insupportability? I simply want to avoid the situation where the pure
>>> educational deprecation itself becomes `Unneeded Deprecation` in the
>>> community.
>>>
>>> Dongjoon.
>>>
>>> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
>>> >
>>> > I want to highlight in case I missed this in the original email:
>>> >
>>> > The 4 API will not be deleted. They will just be marked as deprecated
>>> annotations and we encourage users to use their alternatives.
>>> >
>>> >
>>> > -Rui
>>> >
>>> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
>>> >>
>>> >> Hi Community,
>>> >>
>>> >> Proposal:
>>> >> I want to discuss a proposal to deprecate the following Catalog API:
>>> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
>>> >> def getTable(dbName: String, tableName: String): Table
>>> >> def getFunction(dbName: String, functionName: String): Function
>>> >> def tableExists(dbName: String, tableName: String): Boolean
>>> >>
>>> >>
>>> >> Context:
>>> >> We have been adding table identifier with catalog name (aka 3 layer
>>> namespace) support to Catalog API in
>>> https://issues.apache.org/jira/browse/SPARK-39235.
>>> >> The basic idea is, if an 

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Rui Wang
Yes. The current goal is a pure educational deprecation.

So given the proposal:
1. existing users or users who do not care about catalog names in table
identifiers can still use all the API that maintain their past behavior.
2. new users who intend to use table identifiers with catalog names
get warned by the annotation (and maybe a bit more comments on the API
surface) that 4 API will not serve their usage.

I believe this proposal is conservative: do not intend to cause troubles
for existing users; do not intend to force user migration; do not intend to
delete APIs; do not intend to hurt supportability. If there is anything
that we can make this goal clear, I can do it.

Ultimately, the 4 API in this thread has the problem that it is not
compatible with 3 layer namespace thus not the same as other API who is
supporting 3 layer namespace. For people who want to include catalog names,
the problem itself will stand and we probably have to do something about
it.

-Rui

On Fri, Jul 8, 2022 at 7:24 AM Wenchen Fan  wrote:

> It's better to keep all APIs working. But in this case, I really have no
> idea how to make these 4 APIs reasonable. For example, tableExists(dbName:
> String, tableName: String) currently checks if table "dbName.tableName"
> exists in the Hive metastore, and does not work with v2 catalogs at all.
> It's not only a "not needed" API, but also a confusing API. We need a
> mechanism to move users away from confusing APIs.
>
> I agree that we should not abuse deprecation. I think a general principle
> to use deprecation is you have the intention to remove it eventually, which
> is exactly the case here. We should remove these 4 APIs when most users
> have moved away.
>
> Thanks,
> Wenchen
>
> On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun 
> wrote:
>
>> Thank you for starting the official discussion, Rui.
>>
>> 'Unneeded API' doesn't sound like a good frame for this discussion
>> because it ignores the existing users and codes completely.
>> Technically, the above mentioned reasons look irrelevant to any
>> specific existing bugs or future maintenance cost saving. Instead, the
>> deprecation already causes costs to the community (your PR, the future
>> migration guide, and the communication with the customers like Q)
>> and to the users for the actual migration to new API and validations.
>> Given that, for now, the goal of this proposal looks like a pure
>> educational purpose to advertise new APIs to Apache Spark 3.4+ users.
>>
>> Can we be more conservative at Apache Spark deprecation and allow
>> users to use both APIs freely without any concern of uncertain
>> insupportability? I simply want to avoid the situation where the pure
>> educational deprecation itself becomes `Unneeded Deprecation` in the
>> community.
>>
>> Dongjoon.
>>
>> On Thu, Jul 7, 2022 at 2:26 PM Rui Wang  wrote:
>> >
>> > I want to highlight in case I missed this in the original email:
>> >
>> > The 4 API will not be deleted. They will just be marked as deprecated
>> annotations and we encourage users to use their alternatives.
>> >
>> >
>> > -Rui
>> >
>> > On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:
>> >>
>> >> Hi Community,
>> >>
>> >> Proposal:
>> >> I want to discuss a proposal to deprecate the following Catalog API:
>> >> def listColumns(dbName: String, tableName: String): Dataset[Column]
>> >> def getTable(dbName: String, tableName: String): Table
>> >> def getFunction(dbName: String, functionName: String): Function
>> >> def tableExists(dbName: String, tableName: String): Boolean
>> >>
>> >>
>> >> Context:
>> >> We have been adding table identifier with catalog name (aka 3 layer
>> namespace) support to Catalog API in
>> https://issues.apache.org/jira/browse/SPARK-39235.
>> >> The basic idea is, if an API accepts:
>> >> 1. only tableName:String, we allow it accepts "a.b.c" and goes
>> analyzer which treats a as catalog name, b namespace name and c table name.
>> >> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer
>> which treats a as catalog name, b namespace name.
>> >> Meanwhile we still maintain the backwards compatibility for such API
>> to make sure past behavior remains the same. E.g. If you only use tableName
>> it is still recognized by the session catalog.
>> >>
>> >> With this effort ongoing, the above 4 API becomes not fully compatible
>> with the 3 layer namespace.
>> >>
>> >> use table

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-07 Thread Rui Wang
I want to highlight in case I missed this in the original email:

The 4 API will not be deleted. They will just be marked as deprecated
annotations and we encourage users to use their alternatives.


-Rui

On Thu, Jul 7, 2022 at 2:23 PM Rui Wang  wrote:

> Hi Community,
>
> Proposal:
> I want to discuss a proposal to deprecate the following Catalog API:
> def listColumns(dbName: String, tableName: String): Dataset[Column]
> def getTable(dbName: String, tableName: String): Table
> def getFunction(dbName: String, functionName: String): Function
> def tableExists(dbName: String, tableName: String): Boolean
>
>
> Context:
> We have been adding table identifier with catalog name (aka 3 layer
> namespace) support to Catalog API in
> https://issues.apache.org/jira/browse/SPARK-39235.
> The basic idea is, if an API accepts:
> 1. only tableName:String, we allow it accepts "a.b.c" and
> goes analyzer which treats a as catalog name, b namespace name and c table
> name.
> 2. only dbName:String, we allow it accepts "a.b" and goes analyzer which
> treats a as catalog name, b namespace name.
> Meanwhile we still maintain the backwards compatibility for such API to
> make sure past behavior remains the same. E.g. If you only use tableName it
> is still recognized by the session catalog.
>
> With this effort ongoing, the above 4 API becomes not fully
> compatible with the 3 layer namespace.
>
> use tableExists(dbName: String, tableName: String) as an example, given
> that it takes two parameters but leaves no room for the extra catalog name.
> Also if we want to reuse the two parameters, which one will be the one that
> takes more than one name part?
>
>
> How?
> So how to improve the above 4 API? There are two options:
> a. Expand those four API to let those API accept catalog names. For
> example, tableExists(catalogName: String, dbName: String, tableName:
> String).
> b. mark those API as `deprecated`.
>
> I am proposing to follow option B which does API deprecation.
>
> Why?
> 1. Reduce unneeded API. The existing API can support the same behavior
> given SPARK-39235. For example, tableExists(dbName, tableName) can be
> replaced to use tableExists("dbName.tableName").
> 2. Reduce incomplete API. The proposed API to deprecate does not support 3
> layer namespace now, and it is hard to do so (where to take 3 part names)?
> 3. Deprecation suggests users to migrate their usage on API.
> 4. There was existing practice that we deprecated CreateExternalTable API
> when adding CreateTable API:
> https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220
>
>
> What do you think?
>
> Thanks,
> Rui Wang
>
>
>


[DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-07 Thread Rui Wang
Hi Community,

Proposal:
I want to discuss a proposal to deprecate the following Catalog API:
def listColumns(dbName: String, tableName: String): Dataset[Column]
def getTable(dbName: String, tableName: String): Table
def getFunction(dbName: String, functionName: String): Function
def tableExists(dbName: String, tableName: String): Boolean


Context:
We have been adding table identifier with catalog name (aka 3 layer
namespace) support to Catalog API in
https://issues.apache.org/jira/browse/SPARK-39235.
The basic idea is, if an API accepts:
1. only tableName:String, we allow it accepts "a.b.c" and
goes analyzer which treats a as catalog name, b namespace name and c table
name.
2. only dbName:String, we allow it accepts "a.b" and goes analyzer which
treats a as catalog name, b namespace name.
Meanwhile we still maintain the backwards compatibility for such API to
make sure past behavior remains the same. E.g. If you only use tableName it
is still recognized by the session catalog.

With this effort ongoing, the above 4 API becomes not fully compatible with
the 3 layer namespace.

use tableExists(dbName: String, tableName: String) as an example, given
that it takes two parameters but leaves no room for the extra catalog name.
Also if we want to reuse the two parameters, which one will be the one that
takes more than one name part?


How?
So how to improve the above 4 API? There are two options:
a. Expand those four API to let those API accept catalog names. For
example, tableExists(catalogName: String, dbName: String, tableName:
String).
b. mark those API as `deprecated`.

I am proposing to follow option B which does API deprecation.

Why?
1. Reduce unneeded API. The existing API can support the same behavior
given SPARK-39235. For example, tableExists(dbName, tableName) can be
replaced to use tableExists("dbName.tableName").
2. Reduce incomplete API. The proposed API to deprecate does not support 3
layer namespace now, and it is hard to do so (where to take 3 part names)?
3. Deprecation suggests users to migrate their usage on API.
4. There was existing practice that we deprecated CreateExternalTable API
when adding CreateTable API:
https://github.com/apache/spark/blob/7dcb4bafd02dd43213d3cc4a936c170bda56ddc5/sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala#L220


What do you think?

Thanks,
Rui Wang