Re: [CONNECT] New Clients for Go and Rust

2023-05-26 Thread Maciej
It might be a good idea to have a discussion about how new connect 
clients fit into the overall process we have. In particular:


 * Under what conditions do we consider adding a new language to the
   official channels?  What process do we follow?
 * What guarantees do we offer in respect to these clients? Is adding a
   new client the same type of commitment as for the core API? In other
   words, do we commit to maintaining such clients "forever" or do we
   separate the "official" and "contrib" clients, with the later being
   governed by the ASF, but not guaranteed to be maintained in the future?
 * Do we follow the same release schedule as for the core project, or
   rather release each client separately, after the main release is
   completed?

Also, an elephant in the room is the future of the current API in Spark 
4 and onwards. As useful as connect is, it is not exactly a replacement 
for many existing deployments. Furthermore, it doesn't make extending 
Spark much easier and the current ecosystem is, subjectively speaking, a 
bit brittle.


--
Best regards,
Maciej


On 5/26/23 07:26, Martin Grund wrote:
Thanks everyone for your feedback! I will work on figuring out what it 
takes to get started with a repo for the go client.


On Thu 25. May 2023 at 21:51 Chao Sun  wrote:

+1 on separate repo too

On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun
 wrote:
>
> +1 for starting on a separate repo.
>
> Dongjoon.
>
> On Thu, May 25, 2023 at 9:53 AM yangjie01 
wrote:
>>
>> +1 on start this with a separate repo.
>>
>> Which new clients can be placed in the main repo should be
discussed after they are mature enough,
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> 发件人: Denny Lee 
>> 日期: 2023年5月24日 星期三 21:31
>> 收件人: Hyukjin Kwon 
>> 抄送: Maciej , "dev@spark.apache.org"

>> 主题: Re: [CONNECT] New Clients for Go and Rust
>>
>>
>>
>> +1 on separate repo allowing different APIs to run at different
speeds and ensuring they get community support.
>>
>>
>>
>> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon
 wrote:
>>
>> I think we can just start this with a separate repo.
>> I am fine with the second option too but in this case we would
have to triage which language to add into the main repo.
>>
>>
>>
>> On Fri, 19 May 2023 at 22:28, Maciej 
wrote:
>>
>> Hi,
>>
>>
>>
>> Personally, I'm strongly against the second option and have
some preference towards the third one (or maybe a mix of the first
one and the third one).
>>
>>
>>
>> The project is already pretty large as-is and, with an
extremely conservative approach towards removal of APIs, it only
tends to grow over time. Making it even larger is not going to
make things more maintainable and is likely to create an entry
barrier for new contributors (that's similar to Jia's arguments).
>>
>>
>>
>> Moreover, we've seen quite a few different language clients
over the years and all but one or two survived while none is
particularly active, as far as I'm aware.  Taking responsibility
for more clients, without being sure that we have resources to
maintain them and there is enough community around them to make
such effort worthwhile, doesn't seem like a good idea.
>>
>>
>>
>> --
>>
>> Best regards,
>>
>> Maciej Szymkiewicz
>>
>>
>>
>> Web: https://zero323.net
>>
>> PGP: A30CEF0C31A501EC
>>
>>
>>
>>
>>
>> On 5/19/23 14:57, Jia Fan wrote:
>>
>> Hi,
>>
>>
>>
>> Thanks for contribution!
>>
>> I prefer (1). There are some reason:
>>
>>
>>
>> 1. Different repository can maintain independent versions,
different release times, and faster bug fix releases.
>>
>>
>>
>> 2. Different languages have different build tools. Putting them
in one repository will make the main repository more and more
complicated, and it will become extremely difficult to perform a
complete build in the main repository.
>>
>>
>>
>> 3. Different repository will make CI configuration and execute
easier, and the PR and commit lists will be clearer.
>>
>>
>>
>> 4. Other repository also have different client to governed,
like clickhouse. It use different repository for jdbc, odbc, c++.
Please refer:
>>
>> https://github.com/ClickHouse/clickhouse-java
>>
>> https://github.com/ClickHouse/clickhouse-odbc
>>
>> https://github.com/ClickHouse/clickhouse-cpp
>>
>>
>>
>> PS: I'm looking forward to the javascript connect client!
>>
>>
>>
>> Thanks Regards
>>
>> Jia Fan
>>
>>
>>
>> Martin Grund  于2023年5月19日周五 20:03写道:
>>
>> Hi folks,
>>
>>
>>
>> When Bo 

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-26 Thread Maciej
Weren't some of these functions provided only for compatibility  and 
intentionally left out of the language APIs?


--
Best regards,
Maciej

On 5/25/23 23:21, Hyukjin Kwon wrote:
I don't think it'd be a release blocker .. I think we can implement 
them across multiple releases.


On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
 wrote:


Thank you for the proposal.

I'm wondering if we are going to consider them as release blockers
or not.

In general, I don't think those SQL functions should be available
in all languages as release blockers.
(Especially in R or new Spark Connect languages like Go and Rust).

If they are not release blockers, we may allow some existing or
future community PRs only before feature freeze (= branch cut).

Thanks,
Dongjoon.


On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:

+1
It is important that different APIs can be used to call the
same function

Ryan Berti  于2023年5月25日周四
01:48写道:

During my recent experience developing functions, I found
that identifying locations (sql + connect
functions.scala + functions.py, FunctionRegistry, +
whatever is required for R) and standards for adding
function signatures was not straight forward (should you
use optional args or overload functions? which col/lit
helpers should be used when?). Are there docs describing
all of the locations + standards for defining a function?
If not, that'd be great to have too.

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028





On Wed, May 24, 2023 at 12:44 AM Enrico Minack
 wrote:

+1

Functions available in SQL (more general in one API)
should be available in all APIs. I am very much in
favor of this.

Enrico


Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:


Hi all,

I would like to discuss adding all SQL functions into
Scala, Python and R API.
We have SQL functions that do not exist in Scala,
Python and R around 175.
For example, we don’t have
|pyspark.sql.functions.percentile| but you can invoke
it as a SQL function, e.g., |SELECT percentile(...)|.

The reason why we do not have all functions in the
first place is that we want to
only add commonly used functions, see also
https://github.com/apache/spark/pull/21318 (which I
agreed at that time)

However, this has been raised multiple times over
years, from the OSS community, dev mailing list,
JIRAs, stackoverflow, etc.
Seems it’s confusing about which function is
available or not.

Yes, we have a workaround. We can call all
expressions by |expr("...")| or |call_udf("...",
Columns ...)|
But still it seems that it’s not very user-friendly
because they expect them available under the
functions namespace.

Therefore, I would like to propose adding all
expressions into all languages so that Spark is
simpler and less confusing, e.g., which API is in
functions or not.

Any thoughts?







OpenPGP_signature
Description: OpenPGP digital signature