Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-31 Thread Hyukjin Kwon
Thanks all. I created a JIRA at
https://issues.apache.org/jira/browse/SPARK-43907.

On Mon, 29 May 2023 at 09:12, Hyukjin Kwon  wrote:

> Yes, some were cases like you mentioned.
> But I found myself explaining that reason to a lot of people, not only
> developers but users - I was asked in a conference, email, slack,
> internally and externally.
> Then realised that maybe we're doing something wrong. This is based on my
> experience so I wanted to open a discussion and see what others think about
> this :-).
>
>
>
>
> On Sat, 27 May 2023 at 00:19, Maciej  wrote:
>
>> Weren't some of these functions provided only for compatibility  and
>> intentionally left out of the language APIs?
>>
>> --
>> Best regards,
>> Maciej
>>
>> On 5/25/23 23:21, Hyukjin Kwon wrote:
>>
>> I don't think it'd be a release blocker .. I think we can implement them
>> across multiple releases.
>>
>> On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the proposal.
>>>
>>> I'm wondering if we are going to consider them as release blockers or
>>> not.
>>>
>>> In general, I don't think those SQL functions should be available in all
>>> languages as release blockers.
>>> (Especially in R or new Spark Connect languages like Go and Rust).
>>>
>>> If they are not release blockers, we may allow some existing or future
>>> community PRs only before feature freeze (= branch cut).
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>>>
 +1
 It is important that different APIs can be used to call the same
 function

 Ryan Berti  
 于2023年5月25日周四 01:48写道:

> During my recent experience developing functions, I found that
> identifying locations (sql + connect functions.scala + functions.py,
> FunctionRegistry, + whatever is required for R) and standards for adding
> function signatures was not straight forward (should you use optional args
> or overload functions? which col/lit helpers should be used when?). Are
> there docs describing all of the locations + standards for defining a
> function? If not, that'd be great to have too.
>
> Ryan Berti
>
> Senior Data Engineer  |  Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
> 
>
>
>
> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
> wrote:
>
>> +1
>>
>> Functions available in SQL (more general in one API) should be
>> available in all APIs. I am very much in favor of this.
>>
>> Enrico
>>
>>
>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>
>> Hi all,
>>
>> I would like to discuss adding all SQL functions into Scala, Python
>> and R API.
>> We have SQL functions that do not exist in Scala, Python and R around
>> 175.
>> For example, we don’t have pyspark.sql.functions.percentile but you
>> can invoke
>> it as a SQL function, e.g., SELECT percentile(...).
>>
>> The reason why we do not have all functions in the first place is
>> that we want to
>> only add commonly used functions, see also
>> https://github.com/apache/spark/pull/21318 (which I agreed at that
>> time)
>>
>> However, this has been raised multiple times over years, from the OSS
>> community, dev mailing list, JIRAs, stackoverflow, etc.
>> Seems it’s confusing about which function is available or not.
>>
>> Yes, we have a workaround. We can call all expressions by expr("...")
>>  or call_udf("...", Columns ...)
>> But still it seems that it’s not very user-friendly because they
>> expect them available under the functions namespace.
>>
>> Therefore, I would like to propose adding all expressions into all
>> languages so that Spark is simpler and less confusing, e.g., which API is
>> in functions or not.
>>
>> Any thoughts?
>>
>>
>>
>>


Apache Spark 4.0 Timeframe?

2023-05-31 Thread Dongjoon Hyun
Hi, All.

I'd like to propose to start to prepare Apache Spark 4.0 after creating
branch-3.5 on July 16th.

- https://spark.apache.org/versioning-policy.html

Historically, the Apache Spark release dates have the following timeframes
and we already have Spark 3.5 plan which will be maintained up to 2026.

Spark 1: 2014.05 (1.0.0) ~ 2016.11 (1.6.3)
Spark 2: 2016.07 (2.0.0) ~ 2021.05 (2.4.8)
Spark 3: 2020.06 (3.0.0) ~ 2026.xx (3.5.x)
Spark 4: 2024.06 (4.0.0, NEW)

As we discussed in the previous email thread, `Apache Spark 3.5.0
Expectations`, we cannot deliver some features without Apache Spark 4.

- "I wonder if it’s safer to do it in Spark 4 (which I believe will be
discussed soon)."
- "I would make it the default at 4.0, myself."

Although there exist more other features, let's focus on Scala language
support history.

Spark 2.0: SPARK-6363 Make Scala 2.11 the default Scala version (2016.07)
Spark 3.0: SPARK-25956 Make Scala 2.12 as default Scala version in Spark
3.0 (2020.06)

In addition, the Scala community released Scala 3.3.0 LTS yesterday.

- https://scala-lang.org/blog/2023/05/30/scala-3.3.0-released.html

If we decide to start, I believe we can support Scala 2.13 or Scala 3.3
next year with Apache Spark 4 while supporting Spark 3.4 and 3.5 for Scala
2.12 users.

WDYT?

Thanks,
Dongjoon.


Re: [CONNECT] New Clients for Go and Rust

2023-05-31 Thread bo yang
Just see the discussions here! Really appreciate Martin and other folks
helping on my previous Golang Spark Connect PR (
https://github.com/apache/spark/pull/41036)!

Great to see we have a new repo for Spark Golang Connect client.
Thanks Hyukjin!
I am thinking to migrate my PR to this new repo. Would like to hear any
feedback or suggestion before I make the new PR :)

Thanks,
Bo



On Tue, May 30, 2023 at 3:38 AM Martin Grund 
wrote:

> Hi folks,
>
> Thanks a lot to the help form Hykjin! We've create the
> https://github.com/apache/spark-connect-go as the first contrib
> repository for Spark Connect under the Apache Spark project. We will move
> the development of the Golang client to this repository and make it very
> clear from the README file that this is an experimental client.
>
> Looking forward to all your contributions!
>
> On Tue, May 30, 2023 at 11:50 AM Martin Grund 
> wrote:
>
>> I think it makes sense to split this discussion into two pieces. On the
>> contribution side, my personal perspective is that these new clients are
>> explicitly marked as experimental and unsupported until we deem them mature
>> enough to be supported using the standard release process etc. However, the
>> goal should be that the main contributors of these clients are aiming to
>> follow the same release and maintenance schedule. I think we should
>> encourage the community to contribute to the Spark Connect clients and as
>> such we should explicitly not make it as hard as possible to get started
>> (and for that reason reserve the right to abandon).
>>
>> How exactly the release schedule is going to look is going to require
>> probably some experimentation because it's a new area for Spark and it's
>> ecosystem. I don't think it requires us to have all answers upfront.
>>
>> > Also, an elephant in the room is the future of the current API in Spark
>> 4 and onwards. As useful as connect is, it is not exactly a replacement for
>> many existing deployments. Furthermore, it doesn't make extending Spark
>> much easier and the current ecosystem is, subjectively speaking, a bit
>> brittle.
>>
>> The goal of Spark Connect is not to replace the way users are currently
>> deploying Spark, it's not meant to be that. Users should continue deploying
>> Spark in exactly the way they prefer. Spark Connect allows bringing more
>> interactivity and connectivity to Spark. While Spark Connect extends Spark,
>> most new language consumers will not try to extend Spark, but simply
>> provide the existing surface to their native language. So the goal is not
>> so much extensibility but more availability. For example, I believe it
>> would be awesome if the Livy community would find a way to integrate with
>> Spark Connect to provide the routing capabilities to provide a stable DNS
>> endpoint for all different Spark deployments.
>>
>> > [...] the current ecosystem is, subjectively speaking, a bit brittle.
>>
>> Can you help me understand that a bit better? Do you mean the Spark
>> ecosystem or the Spark Connect ecosystem?
>>
>>
>>
>> Martin
>>
>>
>> On Fri, May 26, 2023 at 5:39 PM Maciej  wrote:
>>
>>> It might be a good idea to have a discussion about how new connect
>>> clients fit into the overall process we have. In particular:
>>>
>>>
>>>- Under what conditions do we consider adding a new language to the
>>>official channels?  What process do we follow?
>>>- What guarantees do we offer in respect to these clients? Is adding
>>>a new client the same type of commitment as for the core API? In other
>>>words, do we commit to maintaining such clients "forever" or do we 
>>> separate
>>>the "official" and "contrib" clients, with the later being governed by 
>>> the
>>>ASF, but not guaranteed to be maintained in the future?
>>>- Do we follow the same release schedule as for the core project, or
>>>rather release each client separately, after the main release is 
>>> completed?
>>>
>>> Also, an elephant in the room is the future of the current API in Spark
>>> 4 and onwards. As useful as connect is, it is not exactly a replacement for
>>> many existing deployments. Furthermore, it doesn't make extending Spark
>>> much easier and the current ecosystem is, subjectively speaking, a bit
>>> brittle.
>>>
>>> --
>>> Best regards,
>>> Maciej
>>>
>>>
>>> On 5/26/23 07:26, Martin Grund wrote:
>>>
>>> Thanks everyone for your feedback! I will work on figuring out what it
>>> takes to get started with a repo for the go client.
>>>
>>> On Thu 25. May 2023 at 21:51 Chao Sun  wrote:
>>>
 +1 on separate repo too

 On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun 
 wrote:
 >
 > +1 for starting on a separate repo.
 >
 > Dongjoon.
 >
 > On Thu, May 25, 2023 at 9:53 AM yangjie01 
 wrote:
 >>
 >> +1 on start this with a separate repo.
 >>
 >> Which new clients can be placed in the main repo should be discussed
 after they are mature enough,
 >>
 >>
 

Re: Apache Spark 3.5.0 Expectations (?)

2023-05-31 Thread Dongjoon Hyun
Thank you all for your replies.

1. Thank you, Jia, for those JIRAs.

2. Sounds great for "Scala 2.13 for Spark 4.0". I'll initiate a new thread
for that.
  - "I wonder if it’s safer to do it in Spark 4 (which I believe will be
discussed soon)."
  - "I would make it the default at 4.0, myself."
  - "Shall we initiate a new discussion thread for Scala 2.13 by default?"

3. Thanks. Did you try the pre-built one, Mich?
  - "I spent a day compiling  Spark 3.4.0 code against Scala 2.13.8 with
maven"

  -
https://downloads.apache.org/spark/spark-3.2.4/spark-3.2.4-bin-hadoop3.2-scala2.13.tgz
  -
https://downloads.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3-scala2.13.tgz
  -
https://downloads.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3-scala2.13.tgz

4. Good suggestion, Bjorn. Instead, we had better add daily jobs like Java
17 because Apache Spark 3.4 added Python 3.11 support via SPARK-41454
already.
- "First, we are currently conducting tests with Python versions 3.8 and
3.9."
- "Should we consider replacing 3.9 with 3.11?"

5. For Guava, I'm also tracking the on-going discussion.

Thanks,
Dongjoon.


Spark writing API

2023-05-31 Thread Andrew Melo
Hi all

I've been developing for some time a Spark DSv2 plugin "Laurelin" (
https://github.com/spark-root/laurelin
) to read the ROOT (https://root.cern) file format (which is used in high
energy physics). I've recently presented my work in a conference (
https://indico.jlab.org/event/459/contributions/11603/).

All of that to say,

A) is there no reason that the builtin (eg parquet) data sources can't
consume the external APIs? It's hard to write a plugin that has to use a
specific API when you're competing with another source who gets access to
the internals directly.

B) What is the Spark-approved API to code against for to write? There is a
mess of *ColumnWriter classes in the Java namespace, and while there is no
documentation, it's unclear which is preferred by the core (maybe
ArrowWriterColumnVector?). We can give a zero copy write if the API
describes it

C) Putting aside everything above, is there a way to hint to the downstream
users on the number of rows expected to write? Any smart writer will use
off-heap memory to write to disk/memory, so the current API that shoves
rows in doesn't do the trick. You don't want to keep reallocating buffers
constantly

D) what is sparks plan to use arrow-based columnar data representations? I
see that there a lot of external efforts whose only option is to inject
themselves in the CLASSPATH. The regular DSv2 api is already crippled for
reads and for writes it's even worse. Is there a commitment from the spark
core to bring the API to parity? Or is instead is it just a YMMV commitment

Thanks!
Andrew





-- 
It's dark in this basement.


Re: Late materialization?

2023-05-31 Thread Alex Cruise
Just to clarify briefly, in hopes that future searchers will find this
thread... ;)

IIUC at the moment, partition pruning and column pruning are
all-or-nothing: every partition and every column either is, or is not, used
for a query.

Late materialization would mean that only the values needed for filtering &
aggregation would be read in the scan+filter stages, and any expressions
requested by the user but not needed for filtering and aggregation would
only be read/computed afterward.

I can see how this will invite sequential consistency problems, in data
sources where mutations like DML or compactions are happening behind the
query's back, but presumably Spark users already have this class of
problem, it's just less serious when the end-to-end execution time of a
query is shorter.

WDYT?

-0xe1a

On Wed, May 31, 2023 at 11:03 AM Alex Cruise  wrote:

> Hey folks, I'm building a Spark connector for my company's proprietary
> data lake... That project is going fine despite the near total lack of
> documentation. ;)
>
> In parallel, I'm also trying to figure out a better story for when humans
> inevitably `select * from 100_trillion_rows`, glance at the first page,
> then walk away forever. The traditional RDBMS approach seems to be to keep
> a lot of state in server-side cursors, so they can eagerly fetch only the
> first few pages of results and go to sleep until the user advances the
> cursor, at which point we wake up and fetch a few more pages.
>
> After some cursory googling about how Trino handles this nightmare
> scenario, I found https://github.com/trinodb/trino/issues/49 and its
> child https://github.com/trinodb/trino/pull/602, which appear to be based
> on the paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is
> what HyPerDB (never open source, acquired by Tableau) was based on.
>
> IIUC this kind of optimization isn't really feasible in Spark at present,
> due to the sharp distinction between transforms, which are always lazy, and
> actions, which are always eager. However, given the very desirable
> performance/efficiency benefits, I think it's worth starting this
> conversation: if we wanted to do something like this, where would we start?
>
> Thanks!
>
> -0xe1a
>


Late materialization?

2023-05-31 Thread Alex Cruise
Hey folks, I'm building a Spark connector for my company's proprietary data
lake... That project is going fine despite the near total lack of
documentation. ;)

In parallel, I'm also trying to figure out a better story for when humans
inevitably `select * from 100_trillion_rows`, glance at the first page,
then walk away forever. The traditional RDBMS approach seems to be to keep
a lot of state in server-side cursors, so they can eagerly fetch only the
first few pages of results and go to sleep until the user advances the
cursor, at which point we wake up and fetch a few more pages.

After some cursory googling about how Trino handles this nightmare
scenario, I found https://github.com/trinodb/trino/issues/49 and its child
https://github.com/trinodb/trino/pull/602, which appear to be based on the
paper http://www.vldb.org/pvldb/vol4/p539-neumann.pdf, which is what
HyPerDB (never open source, acquired by Tableau) was based on.

IIUC this kind of optimization isn't really feasible in Spark at present,
due to the sharp distinction between transforms, which are always lazy, and
actions, which are always eager. However, given the very desirable
performance/efficiency benefits, I think it's worth starting this
conversation: if we wanted to do something like this, where would we start?

Thanks!

-0xe1a


Re: Apache Spark 3.5.0 Expectations (?)

2023-05-31 Thread Bjørn Jørgensen
@Cheng Pan

https://issues.apache.org/jira/browse/HIVE-22126

ons. 31. mai 2023 kl. 03:58 skrev Cheng Pan :

> @Bjørn Jørgensen
>
> I did some investigation on upgrading Guava after Spark drop Hadoop2
> support, but unfortunately, the Hive still depends on it, the worse thing
> is, that Guava’s classes are marked as shared in IsolatedClientLoader[1],
> which means Spark can not upgrade Guava even after upgrading the built-in
> Hive from current 2.3.9 to a new version which does not stick on an old
> Guava, to avoid breaking the old version of Hive Metastore client.
>
> I don't find clues why Guava classes need to be marked as shared, can
> anyone bring some background?
>
> [1]
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L215
>
> Thanks,
> Cheng Pan
>
>
> > On May 31, 2023, at 03:49, Bjørn Jørgensen 
> wrote:
> >
> > @Dongjoon Hyun Thank you.
> >
> > I have two points to discuss.
> > First, we are currently conducting tests with Python versions 3.8 and
> 3.9.
> > Should we consider replacing 3.9 with 3.11?
> >
> > Secondly, I'd like to know the status of Google Guava.
> > With Hadoop version 2 no longer being utilized, is there any other
> factor that is posing a blockage for this?
> >
> > tir. 30. mai 2023 kl. 10:39 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
> > I don't know whether it is related but Scala 2.12.17 is fine for the
> Spark 3 family (compile and run) . I spent a day compiling  Spark 3.4.0
> code against Scala 2.13.8 with maven and was getting all sorts of weird and
> wonderful errors at runtime.
> >
> > HTH
> >
> > Mich Talebzadeh,
> > Lead Solutions Architect/Engineering Lead
> > Palantir Technologies Limited
> > London
> > United Kingdom
> >
> >view my Linkedin profile
> >
> >  https://en.everybodywiki.com/Mich_Talebzadeh
> >  Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
> >
> >
> > On Tue, 30 May 2023 at 01:59, Jungtaek Lim 
> wrote:
> > Shall we initiate a new discussion thread for Scala 2.13 by default?
> While I'm not an expert on this area, it sounds like the change is major
> and (probably) breaking. It seems to be worth having a separate discussion
> thread rather than just treat it like one of 25 items.
> >
> > On Tue, May 30, 2023 at 9:54 AM Sean Owen  wrote:
> > It does seem risky; there are still likely libs out there that don't
> cross compile for 2.13. I would make it the default at 4.0, myself.
> >
> > On Mon, May 29, 2023 at 7:16 PM Hyukjin Kwon 
> wrote:
> > While I support going forward with a higher version, actually using
> Scala 2.13 by default is a big deal especially in a way that:
> > • Users would likely download the built-in version assuming that
> it’s backward binary compatible.
> > • PyPI doesn't allow specifying the Scala version, meaning that
> users wouldn’t have a way to 'pip install pyspark' based on Scala 2.12.
> > I wonder if it’s safer to do it in Spark 4 (which I believe will be
> discussed soon).
> >
> >
> > On Mon, 29 May 2023 at 13:21, Jia Fan  wrote:
> > Thanks Dongjoon!
> > There are some ticket I want to share.
> > SPARK-39420 Support ANALYZE TABLE on v2 tables
> > SPARK-42750 Support INSERT INTO by name
> > SPARK-43521 Support CREATE TABLE LIKE FILE
> >
> > Dongjoon Hyun  于2023年5月29日周一 08:42写道:
> > Hi, All.
> >
> > Apache Spark 3.5.0 is scheduled for August (1st Release Candidate) and
> currently a few notable things are under discussions in the mailing list.
> >
> > I believe it's a good time to share a short summary list (containing
> both completed and in-progress items) to give a highlight in advance and to
> collect your targets too.
> >
> > Please share your expectations or working items if you want to
> prioritize them more in the community in Apache Spark 3.5.0 timeframe.
> >
> > (Sorted by ID)
> > SPARK-40497 Upgrade Scala 2.13.11
> > SPARK-42452 Remove hadoop-2 profile from Apache Spark 3.5.0
> > SPARK-42913 Upgrade to Hadoop 3.3.5 (aws-java-sdk-bundle: 1.12.262 ->
> 1.12.316)
> > SPARK-43024 Upgrade Pandas to 2.0.0
> > SPARK-43200 Remove Hadoop 2 reference in docs
> > SPARK-43347 Remove Python 3.7 Support
> > SPARK-43348 Support Python 3.8 in PyPy3
> > SPARK-43351 Add Spark Connect Go prototype code and example
> > SPARK-43379 Deprecate old Java 8 versions prior to 8u371
> > SPARK-43394 Upgrade to Maven 3.8.8
> > SPARK-43436 Upgrade to RocksDbjni 8.1.1.1
> > SPARK-43446 Upgrade to Apache Arrow 12.0.0
> > SPARK-43447 Support R 4.3.0
> > SPARK-43489 Remove protobuf 2.5.0
> > SPARK-43519 Bump Parquet to 1.13.1
> > SPARK-43581 Upgrade kubernetes-client to 6.6.2
> > SPARK-43588 Upgrade to ASM 9.5
> > SPARK-43600 Update K8s doc to recommend K8s