from:"\\\?\\\?\\\?\\\?\\\?\\\?\\\?\\\?\\\?\\\?"

Re: push-based external shuffle service on K8S - Spark 4.0? Earlier versions?

2024-06-06 Thread Ye Zhou

Hi Ofir.
Right now, the push based shuffle within Spark is only supported for Spark
on YARN, with external shuffle service running as auxiliary service in
NodeManager, but not natively on K8s.
As far as I know, there are no recent plans to add the support for Spark on
K8s natively.

For question 2, are you looking for how to set up push based shuffle for
Spark on YARN or Spark on K8s? For Spark on YARN, as documented here
,
you need to enable the merge shuffle manager in shuffle service, and also
enable push based shuffle in client.

Thanks.
Ye.

On Thu, Jun 6, 2024 at 7:28 AM Keyong Zhou  wrote:

> Hi Ofir,
>
> I can provide some information about use cases for Apache Celeborn.
>
> Apache Celeborn can be deployed on K8s and standalone, both are widely
> used in production environment by users. The largest cluster I know
> contains
> more than 1,000 Celeborn workers.
>
> Celeborn is specially beneficial for large scale shuffle with high
> parallelism, which
> usually causes long fetch wait time or even fetch failure. We have seen
> serveral times speedup
> for jobs with large scale shuffle.
>
> Besides, with Celeborn, Spark on K8s can achive better Dynamic Resource
> Allocation because
> executors don't need to store shuffle data locally, also the pods don't
> need a large disk space.
>
> Celeborn is relatively easy to operate, especially for its graceful
> rolling upgrade and
> backward compatibility (across two successive versions).
>
> You can find more information including user feedbacks here[1]. I
> recommend you to try it out, and the community is happy to help :)
>
> Regards,
> Keyong Zhou
>
> [1]
> https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn
>
> On 2024/06/06 09:08:31 Ofir Manor wrote:
> > Hi,
> > Regarding the external shuffle service on K8S and especially the
> push-based variant that was merged in 3.2:
> >
> >   1.
> > Are there plans to make it supported and work out-of-the-box in 4.0?
> >   2.
> > Did anyone make it work for themselves in 3.5 or earlier? If so, can you
> share your experience and what was needed to make it work?
> >
> > As a fallback, someone using one of the new shuffle projects with K8S
> such as Apache Uniffle or Apache Celeborn and can share some feedback?
> Performance, stability, added complexity etc?
> > Thanks,
> >Ofir
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 

*Zhou, Ye  **周晔*

Re: push-based external shuffle service on K8S - Spark 4.0? Earlier versions?

2024-06-06 Thread Keyong Zhou

Hi Ofir,

I can provide some information about use cases for Apache Celeborn.

Apache Celeborn can be deployed on K8s and standalone, both are widely
used in production environment by users. The largest cluster I know contains
more than 1,000 Celeborn workers.

Celeborn is specially beneficial for large scale shuffle with high parallelism, 
which
usually causes long fetch wait time or even fetch failure. We have seen 
serveral times speedup
for jobs with large scale shuffle.

Besides, with Celeborn, Spark on K8s can achive better Dynamic Resource 
Allocation because
executors don't need to store shuffle data locally, also the pods don't need a 
large disk space.

Celeborn is relatively easy to operate, especially for its graceful rolling 
upgrade and
backward compatibility (across two successive versions).

You can find more information including user feedbacks here[1]. I recommend you 
to try it out, and the community is happy to help :)

Regards,
Keyong Zhou

[1] 
https://news.apache.org/foundation/entry/apache-software-foundation-announces-new-top-level-project-apache-celeborn

On 2024/06/06 09:08:31 Ofir Manor wrote:
> Hi,
> Regarding the external shuffle service on K8S and especially the push-based 
> variant that was merged in 3.2:
> 
>   1.
> Are there plans to make it supported and work out-of-the-box in 4.0?
>   2.
> Did anyone make it work for themselves in 3.5 or earlier? If so, can you 
> share your experience and what was needed to make it work?
> 
> As a fallback, someone using one of the new shuffle projects with K8S such as 
> Apache Uniffle or Apache Celeborn and can share some feedback? Performance, 
> stability, added complexity etc?
> Thanks,
>Ofir
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

push-based external shuffle service on K8S - Spark 4.0? Earlier versions?

2024-06-06 Thread Ofir Manor

Hi,
Regarding the external shuffle service on K8S and especially the push-based 
variant that was merged in 3.2:

  1.
Are there plans to make it supported and work out-of-the-box in 4.0?
  2.
Did anyone make it work for themselves in 3.5 or earlier? If so, can you share 
your experience and what was needed to make it work?

As a fallback, someone using one of the new shuffle projects with K8S such as 
Apache Uniffle or Apache Celeborn and can share some feedback? Performance, 
stability, added complexity etc?
Thanks,
   Ofir

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Matthew Powers

I am a huge fan of the Apache Spark docs and I regularly look at the
analytics on this page

to see how well they are doing.  Great work to everyone that's contributed
to the docs over the years.

We've been chipping away with some improvements over the past year and have
made good progress.  For example, lots of the pages were missing canonical
links.  Canonical links are a special type of link that are
extremely important for any site that has duplicate content.  Versioned
documentation sites have lots of duplicate pages, so getting these
canonical links added was important.  It wasn't really easy to make this
change though.

The current site is confusing Google a bit.  If you do a "spark rocksdb"
Google search for example, you get the Spark 3.2 Structured Streaming
Programming Guide as the first result (because Google isn't properly
indexing the docs).  You need to Control+F and search for "rocksdb" to
navigate to the relevant section which says: "As of Spark 3.2, we add a new
built-in state store implementation...", which is what you'd expect in a
versionless docs site in any case.

There are two different user experiences:

* Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1
Structured Streaming Programming guide that doesn't mention RocksDB
* Option B: push Spark Structured Streaming users to the latest Structure
Streaming Programming guide, which mentions RocksDB, but caveat that this
feature was added in Spark 3.2

I think Option B provides Spark 3.1 users a better experience overall.
It's better to let users know they can access RocksDB by upgrading than
hiding this info from them IMO.

Now if we want Option A, then we'd need to give users a reasonable way to
actually navigate to the Spark 3.1 docs.  From what I can tell, the only
way to navigate from the latest Structured Streaming Programming Guide

to a different version is by manually updating the URL.

I was just skimming over the Structured Streaming Programming guide and
noticing again how lots of the Python code snippets aren't PEP 8
compliant.  It seems like our current docs publishing process would prevent
us from improving the old docs pages.

In this conversation, let's make sure we distinguish between "programming
guides" and "API documentation".  API docs should be versioned and there is
no question there.  Programming guides are higher level conceptual
overviews, like the Polars user guide , and should
be relevant across many versions.

I would also like to point out the the current programming guides are not
consistent:

* The Structured Streaming programming guide

is one giant page
* The SQL programming guide
 is split
on many pages
* The PySpark programming guide

takes you to a whole different URL structure and makes it so you can't even
navigate to the other programming guides anymore

I am looking forward to collaborating with the community and improving the
docs to 1. delight existing users and 2. attract new users.  Docs are a
"website problem" and we're big data people, but I'm confident we'll be
able to work together and find a good path forward here.

On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy  wrote:

> Thanks all for the responses. Let me try to address everything.
>
> > the programming guides are also different between versions since
> features are being added, configs are being added/ removed/ changed,
> defaults are being changed etc.
>
> I agree that this is the case. But I think it's fine to mention what
> version a feature is available in. In fact, I would argue that mentioning
> an improvement that a version brings motivates users to upgrade more than
> keeping docs improvement to "new releases to keep the community updating".
> Users should upgrade to get a better Spark, not better Spark documentation.
>
> > having a programming guide that refers to features or API methods that
> does not exist in that version is confusing and detrimental
>
> I don't think that we'd do this. Again, programming guides should teach
> fundamentals that do not change version-to-version. TypeScript
>  
> (which
> has one of the best DX's and docs) does this exceptionally well. Their
> guides are refined, versionless pages, new features are elaborated upon in
> release notes (analogous to our version-specific docs), and for the
> occasional caveat for a version, it is called out in the guides.
>
>  I agree with Wenchen's 3 points. I don't think we need to say that they
> *have* to go to the old

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Neil Ramaswamy

Thanks all for the responses. Let me try to address everything.

> the programming guides are also different between versions since features
are being added, configs are being added/ removed/ changed, defaults are
being changed etc.

I agree that this is the case. But I think it's fine to mention what
version a feature is available in. In fact, I would argue that mentioning
an improvement that a version brings motivates users to upgrade more than
keeping docs improvement to "new releases to keep the community updating".
Users should upgrade to get a better Spark, not better Spark documentation.

> having a programming guide that refers to features or API methods that
does not exist in that version is confusing and detrimental

I don't think that we'd do this. Again, programming guides should teach
fundamentals that do not change version-to-version. TypeScript

(which
has one of the best DX's and docs) does this exceptionally well. Their
guides are refined, versionless pages, new features are elaborated upon in
release notes (analogous to our version-specific docs), and for the
occasional caveat for a version, it is called out in the guides.

 I agree with Wenchen's 3 points. I don't think we need to say that they
*have* to go to the old page, but that if they want to, they can.

Neil

On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan  wrote:

> I agree with the idea of a versionless programming guide. But one thing we
> need to make sure of is we give clear messages for things that are only
> available in a new version. My proposal is:
>
>1. keep the old versions' programming guide unchanged. For example,
>people can still access
>https://spark.apache.org/docs/3.3.4/quick-start.html
>2. In the new versionless programming guide, we mention at the
>beginning that for Spark versions before 4.0, go to the versioned doc site
>to read the programming guide.
>3. Revisit the programming guide of Spark 4.0 (compare it with the one
>of 3.5), and adjust the content to mention version-specific changes (API
>change, new features, etc.)
>
> Then we can have a versionless programming guide starting from Spark 4.0.
> We can also revisit programming guides of all versions and combine them
> into one with version-specific notes, but that's probably too much work.
>
> Any thoughts?
>
> Wenchen
>
> On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson <
> martin.anders...@kambi.com> wrote:
>
>> While I have no practical knowledge of how documentation is maintained in
>> the spark project, I must agree with Nimrod. For users on older versions,
>> having a programming guide that refers to features or API methods that does
>> not exist in that version is confusing and detrimental.
>>
>> Surely there must be a better way to allow updating documentation more
>> often?
>>
>> Best Regards,
>> Martin
>>
>> --
>> *From:* Nimrod Ofek 
>> *Sent:* Wednesday, June 5, 2024 08:26
>> *To:* Neil Ramaswamy 
>> *Cc:* Praveen Gattu ; dev <
>> dev@spark.apache.org>
>> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide Proposal
>>
>>
>> EXTERNAL SENDER. Do not click links or open attachments unless you
>> recognize the sender and know the content is safe. DO NOT provide your
>> username or password.
>>
>> Hi Neil,
>>
>>
>> While you wrote you don't mean the api docs (of course), the programming
>> guides are also different between versions since features are being added,
>> configs are being added/ removed/ changed, defaults are being changed etc.
>>
>> I know of "backport hell" - which is why I wrote that once a version is
>> released it's freezed and the documentation will be updated for the new
>> version only.
>>
>> I think of it as facing forward and keeping older versions but focusing
>> on the new releases to keep the community updating.
>> While spark has support window of 18 months until eol, we can have only 6
>> months support cycle until eol for documentation- there are no major
>> security concerns for documentation...
>>
>> Nimrod
>>
>> בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy ‏<
>> n...@ramaswamy.org>:
>>
>> Hi Nimrod,
>>
>> Quick clarification—my proposal will not touch API-specific
>> documentation for the specific reasons you mentioned (signatures, behavior,
>> etc.). It just aims to make the *programming guides *versionless.
>> Programming guides should teach fundamentals of Spark, and the fundamentals
>> of Spark should not change between releases.
>>
>> There are a few issues with updating documentation multiple times after
>> Spark releases. First, fixes that apply to all existing versions'
>> programming guides need backport PRs. For example, this change
>>  applies to all the
>> versions of the SS programming guide, but is likely to be fixed only in
>> Spark 4.0. Additionally, any such update within a Spark release will

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Wenchen Fan

I agree with the idea of a versionless programming guide. But one thing we
need to make sure of is we give clear messages for things that are only
available in a new version. My proposal is:

   1. keep the old versions' programming guide unchanged. For example,
   people can still access
   https://spark.apache.org/docs/3.3.4/quick-start.html
   2. In the new versionless programming guide, we mention at the beginning
   that for Spark versions before 4.0, go to the versioned doc site to read
   the programming guide.
   3. Revisit the programming guide of Spark 4.0 (compare it with the one
   of 3.5), and adjust the content to mention version-specific changes (API
   change, new features, etc.)

Then we can have a versionless programming guide starting from Spark 4.0.
We can also revisit programming guides of all versions and combine them
into one with version-specific notes, but that's probably too much work.

Any thoughts?

Wenchen

On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson 
wrote:

> While I have no practical knowledge of how documentation is maintained in
> the spark project, I must agree with Nimrod. For users on older versions,
> having a programming guide that refers to features or API methods that does
> not exist in that version is confusing and detrimental.
>
> Surely there must be a better way to allow updating documentation more
> often?
>
> Best Regards,
> Martin
>
> --
> *From:* Nimrod Ofek 
> *Sent:* Wednesday, June 5, 2024 08:26
> *To:* Neil Ramaswamy 
> *Cc:* Praveen Gattu ; dev <
> dev@spark.apache.org>
> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide Proposal
>
>
> EXTERNAL SENDER. Do not click links or open attachments unless you
> recognize the sender and know the content is safe. DO NOT provide your
> username or password.
>
> Hi Neil,
>
>
> While you wrote you don't mean the api docs (of course), the programming
> guides are also different between versions since features are being added,
> configs are being added/ removed/ changed, defaults are being changed etc.
>
> I know of "backport hell" - which is why I wrote that once a version is
> released it's freezed and the documentation will be updated for the new
> version only.
>
> I think of it as facing forward and keeping older versions but focusing on
> the new releases to keep the community updating.
> While spark has support window of 18 months until eol, we can have only 6
> months support cycle until eol for documentation- there are no major
> security concerns for documentation...
>
> Nimrod
>
> בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy ‏<
> n...@ramaswamy.org>:
>
> Hi Nimrod,
>
> Quick clarification—my proposal will not touch API-specific documentation
> for the specific reasons you mentioned (signatures, behavior, etc.). It
> just aims to make the *programming guides *versionless. Programming
> guides should teach fundamentals of Spark, and the fundamentals of Spark
> should not change between releases.
>
> There are a few issues with updating documentation multiple times after
> Spark releases. First, fixes that apply to all existing versions'
> programming guides need backport PRs. For example, this change
>  applies to all the
> versions of the SS programming guide, but is likely to be fixed only in
> Spark 4.0. Additionally, any such update within a Spark release will require
> re-building the static sites in the spark repo, and copying those files to
> spark-website via a commit in spark-website. Making a typo fix like the one
> I linked would then require  + 1 PRs,
> opposed to 1 PR in the versionless programming guide world.
>
> Neil
>
> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek  wrote:
>
> Hi,
>
> While I think that the documentation needs a lot of improvement and
> important details are missing - and detaching the documentation from the
> main project can help iterating faster on documentation specific tasks, I
> don't think we can nor should move to versionless documentation.
>
> Documentation is version specific: parameters are added and removed, new
> features are added, behaviours sometimes change etc.
>
> I think the documentation should be version specific- but separate from
> spark release cadence - and can be updated multiple times after spark
> release.
> The way I see it is that the documentation should be updated only for the
> latest version and some time before a new release should be archived and
> the updated documentation should reflect the new version.
>
> Thanks,
> Nimrod
>
> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
> ‏:
>
> +1. This helps for greater velocity in improving docs. However, we might
> still need a way to provide version specific information isn't it, i.e.
> what features are available in which version etc.
>
> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy  wrote:
>
> Hi all,
>
> I've written up a proposal to migrate all the Apache Spark programming
> guides to

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Martin Andersson

While I have no practical knowledge of how documentation is maintained in the 
spark project, I must agree with Nimrod. For users on older versions, having a 
programming guide that refers to features or API methods that does not exist in 
that version is confusing and detrimental.

Surely there must be a better way to allow updating documentation more often?

Best Regards,
Martin


From: Nimrod Ofek 
Sent: Wednesday, June 5, 2024 08:26
To: Neil Ramaswamy 
Cc: Praveen Gattu ; dev 

Subject: Re: [DISCUSS] Versionless Spark Programming Guide Proposal


EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


Hi Neil,


While you wrote you don't mean the api docs (of course), the programming guides 
are also different between versions since features are being added, configs are 
being added/ removed/ changed, defaults are being changed etc.

I know of "backport hell" - which is why I wrote that once a version is 
released it's freezed and the documentation will be updated for the new version 
only.

I think of it as facing forward and keeping older versions but focusing on the 
new releases to keep the community updating.
While spark has support window of 18 months until eol, we can have only 6 
months support cycle until eol for documentation- there are no major security 
concerns for documentation...

Nimrod

בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy 
‏mailto:n...@ramaswamy.org>>:
Hi Nimrod,

Quick clarification—my proposal will not touch API-specific documentation for 
the specific reasons you mentioned (signatures, behavior, etc.). It just aims 
to make the programming guides versionless. Programming guides should teach 
fundamentals of Spark, and the fundamentals of Spark should not change between 
releases.

There are a few issues with updating documentation multiple times after Spark 
releases. First, fixes that apply to all existing versions' programming guides 
need backport PRs. For example, this 
change applies to all the 
versions of the SS programming guide, but is likely to be fixed only in Spark 
4.0. Additionally, any such update within a Spark release will require 
re-building the static sites in the spark repo, and copying those files to 
spark-website via a commit in spark-website. Making a typo fix like the one I 
linked would then require  + 1 PRs, 
opposed to 1 PR in the versionless programming guide world.

Neil

On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek 
mailto:ofek.nim...@gmail.com>> wrote:
Hi,

While I think that the documentation needs a lot of improvement and important 
details are missing - and detaching the documentation from the main project can 
help iterating faster on documentation specific tasks, I don't think we can nor 
should move to versionless documentation.

Documentation is version specific: parameters are added and removed, new 
features are added, behaviours sometimes change etc.

I think the documentation should be version specific- but separate from spark 
release cadence - and can be updated multiple times after spark release.
The way I see it is that the documentation should be updated only for the 
latest version and some time before a new release should be archived and the 
updated documentation should reflect the new version.

Thanks,
Nimrod

בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu 
‏:
+1. This helps for greater velocity in improving docs. However, we might still 
need a way to provide version specific information isn't it, i.e. what features 
are available in which version etc.

On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy 
mailto:n...@ramaswamy.org>> wrote:
Hi all,

I've written up a proposal to migrate all the Apache Spark programming guides 
to be versionless. You can find the proposal 
here.
 Please leave comments, or reply in this DISCUSS thread.

TLDR: by making the programming guides versionless, we can make updates to them 
whenever we'd like, instead of at the Spark release cadence. This increased 
update velocity will enable us to make gradual improvements, including breaking 
up the Structured Streaming programming guide into smaller sub-guides. The 
proposal does not break any existing URLs, and it does not affect our versioned 
API docs in any way.

Thanks!
Neil
CONFIDENTIALITY NOTICE: This email message (and any attachment) is intended 
only for the individual or entity to which it is addressed. The information in 
this email is confidential and may contain information that is legally 
privileged or exempt from disclosure under applicable law. If you are not the 
intended recipient, you are strictly prohibited from reading, using, publishing 
or disseminating such information and upon receipt, must permanently delete the 
original and destroy any

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Nimrod Ofek

Hi Neil,

While you wrote you don't mean the api docs (of course), the programming
guides are also different between versions since features are being added,
configs are being added/ removed/ changed, defaults are being changed etc.

I know of "backport hell" - which is why I wrote that once a version is
released it's freezed and the documentation will be updated for the new
version only.

I think of it as facing forward and keeping older versions but focusing on
the new releases to keep the community updating.
While spark has support window of 18 months until eol, we can have only 6
months support cycle until eol for documentation- there are no major
security concerns for documentation...

Nimrod

בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy ‏:

> Hi Nimrod,
>
> Quick clarification—my proposal will not touch API-specific documentation
> for the specific reasons you mentioned (signatures, behavior, etc.). It
> just aims to make the *programming guides *versionless. Programming
> guides should teach fundamentals of Spark, and the fundamentals of Spark
> should not change between releases.
>
> There are a few issues with updating documentation multiple times after
> Spark releases. First, fixes that apply to all existing versions'
> programming guides need backport PRs. For example, this change
>  applies to all the
> versions of the SS programming guide, but is likely to be fixed only in
> Spark 4.0. Additionally, any such update within a Spark release will require
> re-building the static sites in the spark repo, and copying those files to
> spark-website via a commit in spark-website. Making a typo fix like the one
> I linked would then require  + 1 PRs,
> opposed to 1 PR in the versionless programming guide world.
>
> Neil
>
> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek  wrote:
>
>> Hi,
>>
>> While I think that the documentation needs a lot of improvement and
>> important details are missing - and detaching the documentation from the
>> main project can help iterating faster on documentation specific tasks, I
>> don't think we can nor should move to versionless documentation.
>>
>> Documentation is version specific: parameters are added and removed, new
>> features are added, behaviours sometimes change etc.
>>
>> I think the documentation should be version specific- but separate from
>> spark release cadence - and can be updated multiple times after spark
>> release.
>> The way I see it is that the documentation should be updated only for the
>> latest version and some time before a new release should be archived and
>> the updated documentation should reflect the new version.
>>
>> Thanks,
>> Nimrod
>>
>> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
>> ‏:
>>
>>> +1. This helps for greater velocity in improving docs. However, we might
>>> still need a way to provide version specific information isn't it, i.e.
>>> what features are available in which version etc.
>>>
>>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy 
>>> wrote:
>>>
 Hi all,

 I've written up a proposal to migrate all the Apache Spark programming
 guides to be versionless. You can find the proposal here
 .
 Please leave comments, or reply in this DISCUSS thread.

 TLDR: by making the programming guides versionless, we can make updates
 to them whenever we'd like, instead of at the Spark release cadence. This
 increased update velocity will enable us to make gradual improvements,
 including breaking up the Structured Streaming programming guide into
 smaller sub-guides. The proposal does not break *any *existing URLs,
 and it does not affect our versioned API docs in any way.

 Thanks!
 Neil

>>>

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Neil Ramaswamy

Hi Nimrod,

Quick clarification—my proposal will not touch API-specific documentation
for the specific reasons you mentioned (signatures, behavior, etc.). It
just aims to make the *programming guides *versionless. Programming guides
should teach fundamentals of Spark, and the fundamentals of Spark should
not change between releases.

There are a few issues with updating documentation multiple times after
Spark releases. First, fixes that apply to all existing versions'
programming guides need backport PRs. For example, this change
 applies to all the
versions of the SS programming guide, but is likely to be fixed only in
Spark 4.0. Additionally, any such update within a Spark release will require
re-building the static sites in the spark repo, and copying those files to
spark-website via a commit in spark-website. Making a typo fix like the one
I linked would then require  + 1 PRs,
opposed to 1 PR in the versionless programming guide world.

Neil

On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek  wrote:

> Hi,
>
> While I think that the documentation needs a lot of improvement and
> important details are missing - and detaching the documentation from the
> main project can help iterating faster on documentation specific tasks, I
> don't think we can nor should move to versionless documentation.
>
> Documentation is version specific: parameters are added and removed, new
> features are added, behaviours sometimes change etc.
>
> I think the documentation should be version specific- but separate from
> spark release cadence - and can be updated multiple times after spark
> release.
> The way I see it is that the documentation should be updated only for the
> latest version and some time before a new release should be archived and
> the updated documentation should reflect the new version.
>
> Thanks,
> Nimrod
>
> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
> ‏:
>
>> +1. This helps for greater velocity in improving docs. However, we might
>> still need a way to provide version specific information isn't it, i.e.
>> what features are available in which version etc.
>>
>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy  wrote:
>>
>>> Hi all,
>>>
>>> I've written up a proposal to migrate all the Apache Spark programming
>>> guides to be versionless. You can find the proposal here
>>> .
>>> Please leave comments, or reply in this DISCUSS thread.
>>>
>>> TLDR: by making the programming guides versionless, we can make updates
>>> to them whenever we'd like, instead of at the Spark release cadence. This
>>> increased update velocity will enable us to make gradual improvements,
>>> including breaking up the Structured Streaming programming guide into
>>> smaller sub-guides. The proposal does not break *any *existing URLs,
>>> and it does not affect our versioned API docs in any way.
>>>
>>> Thanks!
>>> Neil
>>>
>>

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Nimrod Ofek

Hi,

While I think that the documentation needs a lot of improvement and
important details are missing - and detaching the documentation from the
main project can help iterating faster on documentation specific tasks, I
don't think we can nor should move to versionless documentation.

Documentation is version specific: parameters are added and removed, new
features are added, behaviours sometimes change etc.

I think the documentation should be version specific- but separate from
spark release cadence - and can be updated multiple times after spark
release.
The way I see it is that the documentation should be updated only for the
latest version and some time before a new release should be archived and
the updated documentation should reflect the new version.

Thanks,
Nimrod

בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
‏:

> +1. This helps for greater velocity in improving docs. However, we might
> still need a way to provide version specific information isn't it, i.e.
> what features are available in which version etc.
>
> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy  wrote:
>
>> Hi all,
>>
>> I've written up a proposal to migrate all the Apache Spark programming
>> guides to be versionless. You can find the proposal here
>> .
>> Please leave comments, or reply in this DISCUSS thread.
>>
>> TLDR: by making the programming guides versionless, we can make updates
>> to them whenever we'd like, instead of at the Spark release cadence. This
>> increased update velocity will enable us to make gradual improvements,
>> including breaking up the Structured Streaming programming guide into
>> smaller sub-guides. The proposal does not break *any *existing URLs, and
>> it does not affect our versioned API docs in any way.
>>
>> Thanks!
>> Neil
>>
>

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Praveen Gattu

+1. This helps for greater velocity in improving docs. However, we might
still need a way to provide version specific information isn't it, i.e.
what features are available in which version etc.

On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy  wrote:

> Hi all,
>
> I've written up a proposal to migrate all the Apache Spark programming
> guides to be versionless. You can find the proposal here
> .
> Please leave comments, or reply in this DISCUSS thread.
>
> TLDR: by making the programming guides versionless, we can make updates to
> them whenever we'd like, instead of at the Spark release cadence. This
> increased update velocity will enable us to make gradual improvements,
> including breaking up the Structured Streaming programming guide into
> smaller sub-guides. The proposal does not break *any *existing URLs, and
> it does not affect our versioned API docs in any way.
>
> Thanks!
> Neil
>

[ANNOUNCE] Announcing Apache Spark 4.0.0-preview1

2024-06-03 Thread Wenchen Fan

Hi all,

To enable wide-scale community testing of the upcoming Spark 4.0 release,
the Apache Spark community has posted a preview release of Spark 4.0. This
preview is not a stable release in terms of either API or functionality,
but it is meant to give the community early access to try the code that
will become Spark 4.0. If you would like to test the release, please
download it, and send feedback using either the mailing lists or JIRA.

There are a lot of exciting new features added to Spark 4.0, including ANSI
mode by default, Python data source, polymorphic Python UDTF, string
collation support, new VARIANT data type, streaming state store data
source, structured logging, Java 17 by default, and many more.

We'd like to thank our contributors and users for their contributions and
early feedback to this release. This release would not have been possible
without you.

To download Spark 4.0.0-preview1, head over to the download page:
https://archive.apache.org/dist/spark/spark-4.0.0-preview1 . It's also
available in PyPI, with version name "4.0.0.dev1".

Thanks,

Wenchen

[DISCUSS] Variant shredding specification

2024-06-03 Thread Gene Pang

Hi all,

We have been working on the Variant data type, which is designed to store
and process semi-structured data efficiently, even with heterogeneous
values. Users can store and process semi-structured data in a flexible way,
without having to specify or know any fixed schema on write. Variant data
is encoded in a self-describing format
, and
the binary format uses offset-based encoding to speed up the navigation
performance.

To further improve performance, we are also working on shredding, which is
the process of extracting some of the Variant fields from the binary, and
storing them in separate columns. We have written a specification for Variant
shredding  to augment the
existing Variant specification.

The shredding benefits include:
- more compact data encoding
- min/max statistics for data skipping
- I/O and CPU savings from pruning unnecessary fields not accessed by a
query

Please take a look at the shredding specification PR
 and leave github comments and
suggestions. Your feedback would be greatly appreciated!

Thanks,
Gene

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-06-02 Thread Wenchen Fan

The vote passes with 6+1s (4 binding +1s).

(* = binding)
+1:
Wenchen Fan (*)
Kent Yao
Cheng Pan
Xiao Li (*)
Gengliang Wang (*)
Tathagata Das (*)


Thanks all!

On Fri, May 31, 2024 at 6:07 PM Tathagata Das 
wrote:

> +1
> - Tested RC3 with Delta Lake. All our Scala and Python tests pass.
>
> On Fri, May 31, 2024 at 3:24 PM Xiao Li  wrote:
>
>> +1
>>
>> Cheng Pan  于2024年5月30日周四 09:48写道：
>>
>>> +1 (non-binding)
>>>
>>> - All links are valid
>>> - Run some basic quires using YARN client mode with Apache Hadoop
>>> v3.3.6, HMS 2.3.9
>>> - Pass integration tests with Apache Kyuubi v1.9.1 RC0
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> On May 29, 2024, at 02:48, Wenchen Fan  wrote:
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 4.0.0-preview1.
>>>
>>> The vote is open until May 31 PST and passes if a majority +1 PMC votes
>>> are cast, with
>>> a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v4.0.0-preview1-rc2 (commit
>>> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
>>> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1456/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
>>>
>>> The list of bug fixes going into 4.0.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with an out of date RC going forward).
>>>
>>>
>>>

Unsubscribe

2024-05-31 Thread Ashish Singh

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-31 Thread Tathagata Das

+1
- Tested RC3 with Delta Lake. All our Scala and Python tests pass.

On Fri, May 31, 2024 at 3:24 PM Xiao Li  wrote:

> +1
>
> Cheng Pan  于2024年5月30日周四 09:48写道：
>
>> +1 (non-binding)
>>
>> - All links are valid
>> - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6,
>> HMS 2.3.9
>> - Pass integration tests with Apache Kyuubi v1.9.1 RC0
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On May 29, 2024, at 02:48, Wenchen Fan  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 4.0.0-preview1.
>>
>> The vote is open until May 31 PST and passes if a majority +1 PMC votes
>> are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v4.0.0-preview1-rc2 (commit
>> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
>> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1456/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
>>
>> The list of bug fixes going into 4.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>>
>>

Unsubscribe

2024-05-31 Thread Ashish



Sent from my iPhone

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-31 Thread Gengliang Wang

+1

On Fri, May 31, 2024 at 11:06 AM Xiao Li  wrote:

> +1
>
> Cheng Pan  于2024年5月30日周四 09:48写道：
>
>> +1 (non-binding)
>>
>> - All links are valid
>> - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6,
>> HMS 2.3.9
>> - Pass integration tests with Apache Kyuubi v1.9.1 RC0
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On May 29, 2024, at 02:48, Wenchen Fan  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 4.0.0-preview1.
>>
>> The vote is open until May 31 PST and passes if a majority +1 PMC votes
>> are cast, with
>> a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v4.0.0-preview1-rc2 (commit
>> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
>> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1456/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
>>
>> The list of bug fixes going into 4.0.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>>
>>

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-31 Thread Xiao Li

+1

Cheng Pan  于2024年5月30日周四 09:48写道：

> +1 (non-binding)
>
> - All links are valid
> - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6,
> HMS 2.3.9
> - Pass integration tests with Apache Kyuubi v1.9.1 RC0
>
> Thanks,
> Cheng Pan
>
>
> On May 29, 2024, at 02:48, Wenchen Fan  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
>
> The vote is open until May 31 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v4.0.0-preview1-rc2 (commit
> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1456/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
>
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
>
>

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-30 Thread Cheng Pan

+1 (non-binding)

- All links are valid
- Run some basic quires using YARN client mode with Apache Hadoop v3.3.6, HMS 
2.3.9
- Pass integration tests with Apache Kyuubi v1.9.1 RC0

Thanks,
Cheng Pan


> On May 29, 2024, at 02:48, Wenchen Fan  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 4.0.0-preview1.
> 
> The vote is open until May 31 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
> The tag to be voted on is v4.0.0-preview1-rc2 (commit 
> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1456/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
> 
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-30 Thread Kent Yao

+1 (non-binding), I have checked:

- Download links are fine
- Signatures and integrities are fine
- Build from source
- run-example successfully with some example codes
- No block issues from my side
- Duplicated jars[1][2] found in both hive-jackson and examples/jars, the 
latter seems not necessary.

Thanks,
Kent Yao

[1] jackson-core-asl-1.9.13.jar
[2] jackson-mapper-asl-1.9.13.jar


On 2024/05/28 18:52:32 Wenchen Fan wrote:
> one correction: "The tag to be voted on is v4.0.0-preview1-rc2 (commit
> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66)" should be "The tag to be voted
> on is v4.0.0-preview1-rc3 (commit
> 7a7a8bc4bab591ac8b98b2630b38c57adf619b82):"
> 
> On Tue, May 28, 2024 at 11:48 AM Wenchen Fan  wrote:
> 
> > Please vote on releasing the following candidate as Apache Spark version
> > 4.0.0-preview1.
> >
> > The vote is open until May 31 PST and passes if a majority +1 PMC votes
> > are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v4.0.0-preview1-rc2 (commit
> > 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> > https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1456/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
> >
> > The list of bug fixes going into 4.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Unsubscribe

2024-05-29 Thread Jang tao

Unsubscribe

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-28 Thread Wenchen Fan

Hi all,

I've created a PR to put the behavior change guideline on the Spark
website: https://github.com/apache/spark-website/pull/518 . Please leave
comments if you have any, thanks!

On Wed, May 15, 2024 at 1:41 AM Wenchen Fan  wrote:

> Thanks all for the feedback here! Let me put up a new version, which
> clarifies the definition of "users":
>
> Behavior changes mean user-visible functional changes in a new release via
> public APIs. The "user" here is not only the user who writes queries and/or
> develops Spark plugins, but also the user who deploys and/or manages Spark
> clusters. New features, and even bug fixes that eliminate NPE or correct
> query results, are behavior changes. Things like performance improvement,
> code refactoring, and changes to unreleased APIs/features are not. All
> behavior changes should be called out in the PR description. We need to
> write an item in the migration guide (and probably legacy config) for those
> that may break users when upgrading:
>
>- Bug fixes that change query results. Users may need to do backfill
>to correct the existing data and must know about these correctness fixes.
>- Bug fixes that change query schema. Users may need to update the
>schema of the tables in their data pipelines and must know about these
>changes.
>- Remove configs
>- Rename error class/condition
>- Any non-additive change to the public Python/SQL/Scala/Java/R APIs
>(including developer APIs): rename function, remove parameters, add
>parameters, rename parameters, change parameter default values, etc. These
>changes should be avoided in general, or done in a binary-compatible
>way like deprecating and adding a new function instead of renaming.
>- Any non-additive change to the way Spark should be deployed and
>managed.
>
> The list above is not supposed to be comprehensive. Anyone can raise your
> concern when reviewing PRs and ask the PR author to add migration guide if
> you believe the change is risky and may break users.
>
> On Thu, May 2, 2024 at 10:25 PM Will Raschkowski <
> wraschkow...@palantir.com> wrote:
>
>> To add some user perspective, I wanted to share our experience from
>> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
>> Palantir:
>>
>>
>>
>> We didn't mind "loud" changes that threw exceptions. We have some infra
>> to try run jobs with Spark 3 and fallback to Spark 2 if there's an
>> exception. E.g., the datetime parsing and rebasing migration in Spark 3 was
>> great: Spark threw a helpful exception but never silently changed results.
>> Similarly, for things listed in the migration guide as silent changes
>> (e.g., add_months's handling of last-day-of-month), we wrote custom check
>> rules to throw unless users acknowledged the change through config.
>>
>>
>>
>> Silent changes *not* in the migration guide were really bad for us:
>> Trusting the migration guide to be exhaustive, we automatically upgraded
>> jobs which then “succeeded” but wrote incorrect results. For example, some
>> expression increased timestamp precision in Spark 3; a query implicitly
>> relied on the reduced precision, and then produced bad results on upgrade.
>> It’s a silly query but a note in the migration guide would have helped.
>>
>>
>>
>> To summarize: the migration guide was invaluable, we appreciated every
>> entry, and we'd appreciate Wenchen's stricter definition of "behavior
>> changes" (especially for silent ones).
>>
>>
>>
>> *From: *Nimrod Ofek 
>> *Date: *Thursday, 2 May 2024 at 11:57
>> *To: *Wenchen Fan 
>> *Cc: *Erik Krogen , Spark dev list <
>> dev@spark.apache.org>
>> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>>
>> *CAUTION:* This email originates from an external party (outside of
>> Palantir). If you believe this message is suspicious in nature, please use
>> the "Report Message" button built into Outlook.
>>
>>
>>
>> Hi Erik and Wenchen,
>>
>>
>>
>> I think that usually a good practice with public api and with internal
>> api that has big impact and a lot of usage is to ease in changes by
>> providing defaults to new parameters that will keep former behaviour in a
>> method with the previous signature with deprecation notice, and deleting
>> that deprecated function in the next release- so the actual break will be
>> in the next release after all libraries had the chance to align with the
>> api and upgrades can be done while already using the new version.
>>
>>
>>
>> Another thing is that we should probably examine what private apis are
>> used externally to provide better experience and provide proper public apis
>> to meet those needs (for instance, applicative metrics and some way of
>> creating custom behaviour columns).
>>
>>
>>
>> Thanks,
>>
>> Nimrod
>>
>>
>>
>> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏> >:
>>
>> Hi Erik,
>>
>>
>>
>> Thanks for sharing your thoughts! Note: developer APIs are also public
>> APIs (such as Data Source V2 API, Spark

unsubscribe

2024-05-28 Thread Lucas De Jaeger

unsubscribe

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-28 Thread Wenchen Fan

one correction: "The tag to be voted on is v4.0.0-preview1-rc2 (commit
7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66)" should be "The tag to be voted
on is v4.0.0-preview1-rc3 (commit
7a7a8bc4bab591ac8b98b2630b38c57adf619b82):"

On Tue, May 28, 2024 at 11:48 AM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
>
> The vote is open until May 31 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v4.0.0-preview1-rc2 (commit
> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1456/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/
>
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>

[VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-28 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 31 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc2 (commit
7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc3

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1456/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc3-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

Re: [VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Wenchen Fan

Thanks for the quick reply! I'm cutting RC3 now.

On Tue, May 28, 2024 at 2:28 AM Kent Yao  wrote:

> -1
>
> You've updated your key in [2]  with a new one [1]. I believe you should
> add your new key without removing the old one. Otherwise, users cannot
> verify those archived releases you published.
>
> Thanks,
> Kent Yao
>
> [1] https://dist.apache.org/repos/dist/dev/spark/KEYS
> [2] https://downloads.apache.org/spark/KEYS
>
> On 2024/05/28 07:52:45 Yi Wu wrote:
> > -1
> > I think we should include this bug fix
> >
> https://github.com/apache/spark/commit/6cd1ccc56321dfa52672cd25f4cfdf2bbc86b3ea
> .
> > The bug can lead to the unrecoverable job failure.
> >
> > Thanks,
> > Yi
> >
> > On Tue, May 28, 2024 at 3:45 PM Wenchen Fan  wrote:
> >
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > > 4.0.0-preview1.
> > >
> > > The vote is open until May 31 PST and passes if a majority +1 PMC votes
> > > are cast, with
> > > a minimum of 3 +1 votes.
> > >
> > > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see http://spark.apache.org/
> > >
> > > The tag to be voted on is v4.0.0-preview1-rc2 (commit
> > > 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> > > https://github.com/apache/spark/tree/v4.0.0-preview1-rc2
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-bin/
> > >
> > > Signatures used for Spark RCs can be found in this file:
> > > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1455/
> > >
> > > The documentation corresponding to this release can be found at:
> > > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-docs/
> > >
> > > The list of bug fixes going into 4.0.0 can be found at the following
> URL:
> > > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> > >
> > > FAQ
> > >
> > > =
> > > How can I help test this release?
> > > =
> > >
> > > If you are a Spark user, you can help us test this release by taking
> > > an existing Spark workload and running on this release candidate, then
> > > reporting any regressions.
> > >
> > > If you're working in PySpark you can set up a virtual env and install
> > > the current RC and see if anything important breaks, in the Java/Scala
> > > you can add the staging repository to your projects resolvers and test
> > > with the RC (make sure to clean up the artifact cache before/after so
> > > you don't end up building with an out of date RC going forward).
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Kent Yao

-1

You've updated your key in [2]  with a new one [1]. I believe you should add 
your new key without removing the old one. Otherwise, users cannot verify those 
archived releases you published.

Thanks,
Kent Yao

[1] https://dist.apache.org/repos/dist/dev/spark/KEYS
[2] https://downloads.apache.org/spark/KEYS

On 2024/05/28 07:52:45 Yi Wu wrote:
> -1
> I think we should include this bug fix
> https://github.com/apache/spark/commit/6cd1ccc56321dfa52672cd25f4cfdf2bbc86b3ea.
> The bug can lead to the unrecoverable job failure.
> 
> Thanks,
> Yi
> 
> On Tue, May 28, 2024 at 3:45 PM Wenchen Fan  wrote:
> 
> > Please vote on releasing the following candidate as Apache Spark version
> > 4.0.0-preview1.
> >
> > The vote is open until May 31 PST and passes if a majority +1 PMC votes
> > are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v4.0.0-preview1-rc2 (commit
> > 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> > https://github.com/apache/spark/tree/v4.0.0-preview1-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1455/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-docs/
> >
> > The list of bug fixes going into 4.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Yi Wu

-1
I think we should include this bug fix
https://github.com/apache/spark/commit/6cd1ccc56321dfa52672cd25f4cfdf2bbc86b3ea.
The bug can lead to the unrecoverable job failure.

Thanks,
Yi

On Tue, May 28, 2024 at 3:45 PM Wenchen Fan  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
>
> The vote is open until May 31 PST and passes if a majority +1 PMC votes
> are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v4.0.0-preview1-rc2 (commit
> 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1455/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-docs/
>
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>

[VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 31 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc2 (commit
7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc2

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1455/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc2-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

ArrowUtilSuite Fails with "NoSuchFieldError: chunkSize"

2024-05-28 Thread Senthil Kumar

Hello Team,

We are seeing, ArrowUtilSuite test fails with "NoSuchFieldError: chunkSize"
error.

java.lang.NoSuchFieldError: Class io.netty.buffer.PoolArena does not have
member field 'int chunkSize'.

And Netty library does not have field 'int chunkSize' in 4.1.72/74/82/84
even in higher versions too. But Arrow, versions 7.0.0 to 12.0.1, is still
referring to "chunkSize" field.


Error:
{{
error] Uncaught exception when running
org.apache.spark.sql.util.ArrowUtilsSuite: java.lang.NoSuchFieldError:
Class io.netty.buffer.PoolArena does not have member field 'int chunkSize'
[error] sbt.ForkMain$ForkError: java.lang.NoSuchFieldError: Class
io.netty.buffer.PoolArena does not have member field 'int chunkSize'
[error] at
io.netty.buffer.PooledByteBufAllocatorL$InnerAllocator.(PooledByteBufAllocatorL.java:153)
[error] at
io.netty.buffer.PooledByteBufAllocatorL.(PooledByteBufAllocatorL.java:49)
[error] at
org.apache.arrow.memory.NettyAllocationManager.(NettyAllocationManager.java:51)
[error] at
org.apache.arrow.memory.DefaultAllocationManagerFactory.(DefaultAllocationManagerFactory.java:26)
[error] at java.base/java.lang.Class.forName0(Native Method)
[error] at java.base/java.lang.Class.forName(Class.java:421)
[error] at java.base/java.lang.Class.forName(Class.java:412)
[error] at
org.apache.arrow.memory.DefaultAllocationManagerOption.getFactory(DefaultAllocationManagerOption.java:108)
}}


Source code in Arrow:
{{


try {
  Field f = PooledByteBufAllocator.class.getDeclaredField("directArenas");
  f.setAccessible(true);
  this.directArenas = (PoolArena[]) f.get(this);
} catch (Exception e) {
  throw new RuntimeException("Failure while initializing allocator.
Unable to retrieve direct arenas field.", e);
}

this.chunkSize = *directArenas[0].chunkSize;*

if (memoryLogger.isTraceEnabled()) {
  statusThread = new MemoryStatusThread();
  statusThread.start();


}}


Is this known issue in Spark test suite ArrowUtilSuite?


-- 
Senthil kumar

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh

sorry i thought i gave an explanation

The issue you are encountering with incorrect record numbers in the
"ShuffleWrite Size/Records" column in the Spark DAG UI when data is read
from cache/persist is a known limitation. This discrepancy arises due to
the way Spark handles and reports shuffle data when caching is involved.

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Sun, 26 May 2024 at 21:16, Prem Sahoo  wrote:

> Can anyone please assist me ?
>
> On Fri, May 24, 2024 at 12:29 AM Prem Sahoo  wrote:
>
>> Does anyone have a clue ?
>>
>> On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> in spark DAG UI , we have Stages tab. Once you click on each stage you
>>> can view the tasks.
>>>
>>> In each task we have a column "ShuffleWrite Size/Records " that column
>>> prints wrong data when it gets the data from cache/persist . it
>>> typically will show the wrong record number though the data size is correct
>>> for e.g  3.2G/ 7400 which is wrong .
>>>
>>> please advise.
>>>
>>

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh

Just to further clarify that the Shuffle Write Size/Records column in
the Spark UI can be misleading when working with cached/persisted data
because it reflects the shuffled data size and record count, not the
entire cached/persisted data., So it is fair to say that this is a
limitation of the UI's display, not necessarily a bug in the Spark
framework itself.

HTH

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

Mich Talebzadeh,

Technologist | Architect | Data Engineer  | Generative AI | FinCrime

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".



On Sun, 26 May 2024 at 16:45, Mich Talebzadeh  wrote:
>
> Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show 
> incorrect record counts when data is retrieved from cache or persisted data. 
> This happens because the record count reflects the number of records written 
> to disk for shuffling, and not the actual number of records in the cached or 
> persisted data itself. Add to it, because of lazy evaluation:, Spark may only 
> materialize a portion of the cached or persisted data when a task needs it. 
> The "Shuffle Write Size/Records" might only reflect the materialized portion, 
> not the total number of records in the cache/persistence. While the "Shuffle 
> Write Size/Records" might be inaccurate for cached/persisted data, the 
> "Shuffle Read Size/Records" column can be more reliable. This metric shows 
> the number of records read from shuffle by the following stage, which should 
> be closer to the actual number of records processed.
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my knowledge 
> but of course cannot be guaranteed . It is essential to note that, as with 
> any advice, quote "one test result is worth one-thousand expert opinions 
> (Werner Von Braun)".
>
>
>
> On Thu, 23 May 2024 at 17:45, Prem Sahoo  wrote:
>>
>> Hello Team,
>> in spark DAG UI , we have Stages tab. Once you click on each stage you can 
>> view the tasks.
>>
>> In each task we have a column "ShuffleWrite Size/Records " that column 
>> prints wrong data when it gets the data from cache/persist . it typically 
>> will show the wrong record number though the data size is correct for e.g  
>> 3.2G/ 7400 which is wrong .
>>
>> please advise.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh

Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show
incorrect record counts *when data is retrieved from cache or persisted
data*. This happens because the record count reflects the number of records
written to disk for shuffling, and not the actual number of records in the
cached or persisted data itself. Add to it, because of lazy evaluation:,
Spark may only materialize a portion of the cached or persisted data when a
task needs it. The "Shuffle Write Size/Records" might only reflect the
materialized portion, not the total number of records in the
cache/persistence. While the "Shuffle Write Size/Records" might be
inaccurate for cached/persisted data, the "Shuffle Read Size/Records"
column can be more reliable. This metric shows the number of records read
from shuffle by the following stage, which should be closer to the actual
number of records processed.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Thu, 23 May 2024 at 17:45, Prem Sahoo  wrote:

> Hello Team,
> in spark DAG UI , we have Stages tab. Once you click on each stage you can
> view the tasks.
>
> In each task we have a column "ShuffleWrite Size/Records " that column
> prints wrong data when it gets the data from cache/persist . it
> typically will show the wrong record number though the data size is correct
> for e.g  3.2G/ 7400 which is wrong .
>
> please advise.
>

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo

Can anyone please assist me ?

On Fri, May 24, 2024 at 12:29 AM Prem Sahoo  wrote:

> Does anyone have a clue ?
>
> On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:
>
>> Hello Team,
>> in spark DAG UI , we have Stages tab. Once you click on each stage you
>> can view the tasks.
>>
>> In each task we have a column "ShuffleWrite Size/Records " that column
>> prints wrong data when it gets the data from cache/persist . it
>> typically will show the wrong record number though the data size is correct
>> for e.g  3.2G/ 7400 which is wrong .
>>
>> please advise.
>>
>

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo

Does anyone have a clue ?

On Thu, May 23, 2024 at 11:40 AM Prem Sahoo  wrote:

> Hello Team,
> in spark DAG UI , we have Stages tab. Once you click on each stage you can
> view the tasks.
>
> In each task we have a column "ShuffleWrite Size/Records " that column
> prints wrong data when it gets the data from cache/persist . it
> typically will show the wrong record number though the data size is correct
> for e.g  3.2G/ 7400 which is wrong .
>
> please advise.
>

BUG :: UI Spark

2024-05-23 Thread Prem Sahoo

Hello Team,
in spark DAG UI , we have Stages tab. Once you click on each stage you can
view the tasks.

In each task we have a column "ShuffleWrite Size/Records " that column
prints wrong data when it gets the data from cache/persist . it
typically will show the wrong record number though the data size is correct
for e.g  3.2G/ 7400 which is wrong .

please advise.

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Nicholas Chammas

[dev list to bcc]

This is a question for the user list  
or for Stack Overflow 
. The dev list is for 
discussions related to the development of Spark itself.

Nick


> On May 21, 2024, at 6:58 AM, Prem Sahoo  wrote:
> 
> Hello Vibhor,
> Thanks for the suggestion .
> I am looking for some other alternatives where I can use the same dataframe 
> can be written to two destinations without re execution and cache or persist .
> 
> Can some one help me in scenario 2 ?
> How to make spark write to MinIO faster ?
> Sent from my iPhone
> 
>> On May 21, 2024, at 1:18 AM, Vibhor Gupta  wrote:
>> 
>> 
>> Hi Prem,
>>  
>> You can try to write to HDFS then read from HDFS and write to MinIO.
>>  
>> This will prevent duplicate transformation.
>>  
>> You can also try persisting the dataframe using the DISK_ONLY level.
>>  
>> Regards,
>> Vibhor
>> From: Prem Sahoo 
>> Date: Tuesday, 21 May 2024 at 8:16 AM
>> To: Spark dev list 
>> Subject: EXT: Dual Write to HDFS and MinIO in faster way
>> 
>> EXTERNAL: Report suspicious emails to Email Abuse.
>> 
>> Hello Team,
>> I am planning to write to two datasource at the same time . 
>>  
>> Scenario:-
>>  
>> Writing the same dataframe to HDFS and MinIO without re-executing the 
>> transformations and no cache(). Then how can we make it faster ?
>>  
>> Read the parquet file and do a few transformations and write to HDFS and 
>> MinIO.
>>  
>> here in both write spark needs execute the transformation again. Do we know 
>> how we can avoid re-execution of transformation  without cache()/persist ?
>>  
>> Scenario2 :-
>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>> Do we have any way to make writing this faster ?
>>  
>> I don't want to do repartition and write as repartition will have overhead 
>> of shuffling .
>>  
>> Please provide some inputs.

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo

Hello Vibhor,
Thanks for the suggestion .
I am looking for some other alternatives where I can use the same dataframe can 
be written to two destinations without re execution and cache or persist .

Can some one help me in scenario 2 ?
How to make spark write to MinIO faster ?
Sent from my iPhone

> On May 21, 2024, at 1:18 AM, Vibhor Gupta  wrote:
> 
> 
> Hi Prem,
>  
> You can try to write to HDFS then read from HDFS and write to MinIO.
>  
> This will prevent duplicate transformation.
>  
> You can also try persisting the dataframe using the DISK_ONLY level.
>  
> Regards,
> Vibhor
> From: Prem Sahoo 
> Date: Tuesday, 21 May 2024 at 8:16 AM
> To: Spark dev list 
> Subject: EXT: Dual Write to HDFS and MinIO in faster way
> 
> EXTERNAL: Report suspicious emails to Email Abuse.
> 
> Hello Team,
> I am planning to write to two datasource at the same time . 
>  
> Scenario:-
>  
> Writing the same dataframe to HDFS and MinIO without re-executing the 
> transformations and no cache(). Then how can we make it faster ?
>  
> Read the parquet file and do a few transformations and write to HDFS and 
> MinIO.
>  
> here in both write spark needs execute the transformation again. Do we know 
> how we can avoid re-execution of transformation  without cache()/persist ?
>  
> Scenario2 :-
> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
> Do we have any way to make writing this faster ?
>  
> I don't want to do repartition and write as repartition will have overhead of 
> shuffling .
>  
> Please provide some inputs. 
>  
>

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-20 Thread Vibhor Gupta

Hi Prem,

You can try to write to HDFS then read from HDFS and write to MinIO.

This will prevent duplicate transformation.

You can also try persisting the dataframe using the DISK_ONLY level.

Regards,
Vibhor
From: Prem Sahoo 
Date: Tuesday, 21 May 2024 at 8:16 AM
To: Spark dev list 
Subject: EXT: Dual Write to HDFS and MinIO in faster way
EXTERNAL: Report suspicious emails to Email Abuse.
Hello Team,
I am planning to write to two datasource at the same time .

Scenario:-

Writing the same dataframe to HDFS and MinIO without re-executing the 
transformations and no cache(). Then how can we make it faster ?

Read the parquet file and do a few transformations and write to HDFS and MinIO.

here in both write spark needs execute the transformation again. Do we know how 
we can avoid re-execution of transformation  without cache()/persist ?

Scenario2 :-
I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
Do we have any way to make writing this faster ?

I don't want to do repartition and write as repartition will have overhead of 
shuffling .

Please provide some inputs.

Dual Write to HDFS and MinIO in faster way

2024-05-20 Thread Prem Sahoo

Hello Team,
I am planning to write to two datasource at the same time .

Scenario:-

Writing the same dataframe to HDFS and MinIO without re-executing the
transformations and no cache(). Then how can we make it faster ?

Read the parquet file and do a few transformations and write to HDFS and
MinIO.

here in both write spark needs execute the transformation again. Do we know
how we can avoid re-execution of transformation  without cache()/persist ?

Scenario2 :-
I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
Do we have any way to make writing this faster ?

I don't want to do repartition and write as repartition will have overhead
of shuffling .

Please provide some inputs.

[VOTE][RESULT] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh

The vote passes with 13+1s (8 binding +1s) and 1+0.

(* = binding)
+1:
Chao Sun (*)
Liang-Chi Hsieh (*)
Huaxin Gao (*)
Bo Yang
Dongjoon Hyun (*)
Kent Yao
Wenchen Fan (*)
Ryan Blue
Anton Okolnychyi
Zhou Jiang
Gengliang Wang (*)
Xiao Li (*)
Hyukjin Kwon (*)

+0: None
Mich Talebzadeh


-1: None

Thanks all.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh

Hi all,

Thanks all for participating and your support! The vote has been passed.
I'll send out the result in a separate thread.

On Wed, May 15, 2024 at 4:44 PM Hyukjin Kwon  wrote:
>
> +1
>
> On Tue, 14 May 2024 at 16:39, Wenchen Fan  wrote:
>>
>> +1
>>
>> On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:
>>>
>>> +1 (non-binding)
>>>
>>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:

 Hi all,

 I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.

 Please also refer to:

- Discussion thread:
 https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
- SPIP doc: 
 https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/


 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …


 Thank you!

 Liang-Chi Hsieh

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

>>>
>>>
>>> --
>>> Zhou JIANG
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread Hyukjin Kwon

+1

On Tue, 14 May 2024 at 16:39, Wenchen Fan  wrote:

> +1
>
> On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:
>
>> +1 (non-binding)
>>
>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>>
>>> Thank you!
>>>
>>> Liang-Chi Hsieh
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> *Zhou JIANG*
>>
>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-15 Thread Wenchen Fan

Thanks all for the feedback here! Let me put up a new version, which
clarifies the definition of "users":

Behavior changes mean user-visible functional changes in a new release via
public APIs. The "user" here is not only the user who writes queries and/or
develops Spark plugins, but also the user who deploys and/or manages Spark
clusters. New features, and even bug fixes that eliminate NPE or correct
query results, are behavior changes. Things like performance improvement,
code refactoring, and changes to unreleased APIs/features are not. All
behavior changes should be called out in the PR description. We need to
write an item in the migration guide (and probably legacy config) for those
that may break users when upgrading:

   - Bug fixes that change query results. Users may need to do backfill to
   correct the existing data and must know about these correctness fixes.
   - Bug fixes that change query schema. Users may need to update the
   schema of the tables in their data pipelines and must know about these
   changes.
   - Remove configs
   - Rename error class/condition
   - Any non-additive change to the public Python/SQL/Scala/Java/R APIs
   (including developer APIs): rename function, remove parameters, add
   parameters, rename parameters, change parameter default values, etc. These
   changes should be avoided in general, or done in a binary-compatible
   way like deprecating and adding a new function instead of renaming.
   - Any non-additive change to the way Spark should be deployed and
   managed.

The list above is not supposed to be comprehensive. Anyone can raise your
concern when reviewing PRs and ask the PR author to add migration guide if
you believe the change is risky and may break users.

On Thu, May 2, 2024 at 10:25 PM Will Raschkowski 
wrote:

> To add some user perspective, I wanted to share our experience from
> automatically upgrading tens of thousands of jobs from Spark 2 to 3 at
> Palantir:
>
>
>
> We didn't mind "loud" changes that threw exceptions. We have some infra to
> try run jobs with Spark 3 and fallback to Spark 2 if there's an exception.
> E.g., the datetime parsing and rebasing migration in Spark 3 was great:
> Spark threw a helpful exception but never silently changed results.
> Similarly, for things listed in the migration guide as silent changes
> (e.g., add_months's handling of last-day-of-month), we wrote custom check
> rules to throw unless users acknowledged the change through config.
>
>
>
> Silent changes *not* in the migration guide were really bad for us:
> Trusting the migration guide to be exhaustive, we automatically upgraded
> jobs which then “succeeded” but wrote incorrect results. For example, some
> expression increased timestamp precision in Spark 3; a query implicitly
> relied on the reduced precision, and then produced bad results on upgrade.
> It’s a silly query but a note in the migration guide would have helped.
>
>
>
> To summarize: the migration guide was invaluable, we appreciated every
> entry, and we'd appreciate Wenchen's stricter definition of "behavior
> changes" (especially for silent ones).
>
>
>
> *From: *Nimrod Ofek 
> *Date: *Thursday, 2 May 2024 at 11:57
> *To: *Wenchen Fan 
> *Cc: *Erik Krogen , Spark dev list <
> dev@spark.apache.org>
> *Subject: *Re: [DISCUSS] clarify the definition of behavior changes
>
> *CAUTION:* This email originates from an external party (outside of
> Palantir). If you believe this message is suspicious in nature, please use
> the "Report Message" button built into Outlook.
>
>
>
> Hi Erik and Wenchen,
>
>
>
> I think that usually a good practice with public api and with internal api
> that has big impact and a lot of usage is to ease in changes by providing
> defaults to new parameters that will keep former behaviour in a method with
> the previous signature with deprecation notice, and deleting that
> deprecated function in the next release- so the actual break will be in the
> next release after all libraries had the chance to align with the api and
> upgrades can be done while already using the new version.
>
>
>
> Another thing is that we should probably examine what private apis are
> used externally to provide better experience and provide proper public apis
> to meet those needs (for instance, applicative metrics and some way of
> creating custom behaviour columns).
>
>
>
> Thanks,
>
> Nimrod
>
>
>
> בתאריך יום ה׳, 2 במאי 2024, 03:51, מאת Wenchen Fan ‏:
>
> Hi Erik,
>
>
>
> Thanks for sharing your thoughts! Note: developer APIs are also public
> APIs (such as Data Source V2 API, Spark Listener API, etc.), so breaking
> changes should be avoided as much as we can and new APIs should be
> mentioned in the release notes. Breaking binary compatibility is also a
> "functional change" and should be treated as a behavior change.
>
>
>
> BTW, AFAIK some downstream libraries use private APIs such as Catalyst
> Expression and LogicalPlan. It's too much work to track all the changes to
> private APIs

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-15 Thread Wenchen Fan

RC1 failed because of this issue. I'll cut RC2 after we downgrade Jetty to
9.x.

On Sat, May 11, 2024 at 3:37 PM Cheng Pan  wrote:

> -1 (non-binding)
>
> A small question, the tag is orphan but I suppose it should belong to the
> master branch.
>
> Seems YARN integration is broken due to javax =>  jakarta namespace
> migration, I filled SPARK-48238, and left some comments on
> https://github.com/apache/spark/pull/45154
>
> Caused by: java.lang.IllegalStateException: class
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a
> jakarta.servlet.Filter
> at
> org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> at
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
> ~[?:?]
> at
> java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
> ~[?:?]
> at
> java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
> ~[?:?]
> at
> org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
> ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
> ... 38 more
>
> Thanks,
> Cheng Pan
>
>
> > On May 11, 2024, at 13:55, Wenchen Fan  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 4.0.0-preview1.
> >
> > The vote is open until May 16 PST and passes if a majority +1 PMC votes
> are cast, with
> > a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v4.0.0-preview1-rc1 (commit
> 7dcf77c739c3854260464d732dbfb9a0f54706e7):
> > https://github.com/apache/spark/tree/v4.0.0-preview1-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1454/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/
> >
> > The list of bug fixes going into 4.0.0 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12353359
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with an out of date RC going forward).
>
>

Community over Code EU 2024: The countdown has started!

2024-05-14 Thread Ryan Skraba

[Note: You're receiving this email because you are subscribed to one
or more project dev@ mailing lists at the Apache Software Foundation.]

We are very close to Community Over Code EU -- check out the amazing
program and the special discounts that we have for you.

Special discounts

You still have the opportunity to secure your ticket for Community
Over Code EU. Explore the various options available, including the
regular pass, the committer and groups pass, and now introducing the
one-day pass tailored for locals in Bratislava.

We also have a special discount for you to attend both Community Over
Code and Berlin Buzzwords from June 9th to 11th. Visit our website to
find out more about this opportunity and contact te...@sg.com.mx to
get the discount code.

Take advantage of the discounts and register now!
https://eu.communityovercode.org/tickets/

Check out the full program!

This year Community Over Code Europe will bring to you three days of
keynotes and sessions that cover topics of interest for ASF projects
and the greater open source ecosystem including data engineering,
performance engineering, search, Internet of Things (IoT) as well as
sessions with tips and lessons learned on building a healthy open
source community.

Check out the program: https://eu.communityovercode.org/program/

Keynote speaker highlights for Community Over Code Europe include:

* Dirk-Willem Van Gulik, VP of Public Policy at the Apache Software
Foundation, will discuss the Cyber Resiliency Act and its impact on
open source (All your code belongs to Policy Makers, Politicians, and
the Law).

* Dr. Sherae Daniel will share the results of her study on the impact
of self-promotion for open source software developers (To Toot or not
to Toot, that is the question).

* Asim Hussain, Executive Director of the Green Software Foundation
will present a framework they have developed for quantifying the
environmental impact of software (Doing for Sustainability what Open
Source did for Software).

* Ruth Ikegah will  discuss the growth of the open source movement in
Africa (From Local Roots to Global Impact: Building an Inclusive Open
Source Community in Africa)

* A discussion panel on EU policies and regulations affecting
specialists working in Open Source Program Offices

Additional activities

* Poster sessions: We invite you to stop by our poster area and see if
the ideas presented ignite a conversation within your team.

* BOF time: Don't miss the opportunity to discuss in person with your
open source colleagues on your shared interests.

* Participants reception: At the end of the first day, we will have a
reception at the event venue. All participants are welcome to attend!

* Spontaneous talks: There is a dedicated room and social space for
having spontaneous talks and sessions. Get ready to share with your
peers.

* Lighting talks: At the end of the event we will have the awaited
Lighting talks, where every participant is welcome to share and
enlighten us.

Please remember:  If you haven't applied for the visa, we will provide
the necessary letter for the process. In the unfortunate case of a
visa rejection, your ticket will be reimbursed.

See you in Bratislava,

Community Over Code EU Team

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Xiao Li

+1

Gengliang Wang  于2024年5月13日周一 16:24写道：

> +1
>
> On Mon, May 13, 2024 at 12:30 PM Zhou Jiang 
> wrote:
>
>> +1 (non-binding)
>>
>> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>>
>>> Please also refer to:
>>>
>>>- Discussion thread:
>>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>>
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>>
>>> Thank you!
>>>
>>> Liang-Chi Hsieh
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> *Zhou JIANG*
>>
>>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Wenchen Fan

+1

On Tue, May 14, 2024 at 8:19 AM Zhou Jiang  wrote:

> +1 (non-binding)
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> *Zhou JIANG*
>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Gengliang Wang

+1

On Mon, May 13, 2024 at 12:30 PM Zhou Jiang  wrote:

> +1 (non-binding)
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> *Zhou JIANG*
>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Zhou Jiang

+1 (non-binding)

On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:

> Hi all,
>
> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>
> Please also refer to:
>
>- Discussion thread:
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
> Thank you!
>
> Liang-Chi Hsieh
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
*Zhou JIANG*

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Anton Okolnychyi

+1

On 2024/05/13 15:33:33 Ryan Blue wrote:
> +1
> 
> On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh 
> wrote:
> 
> > +0
> >
> > For reasons I outlined in the discussion thread
> >
> > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >
> > Mich Talebzadeh,
> > Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> > London
> > United Kingdom
> >
> >
> >view my Linkedin profile
> > 
> >
> >
> >  https://en.everybodywiki.com/Mich_Talebzadeh
> >
> >
> >
> > *Disclaimer:* The information provided is correct to the best of my
> > knowledge but of course cannot be guaranteed . It is essential to note
> > that, as with any advice, quote "one test result is worth one-thousand
> > expert opinions (Werner  
> > Von
> > Braun )".
> >
> >
> > On Mon, 13 May 2024 at 08:24, Wenchen Fan  wrote:
> >
> >> +1
> >>
> >> On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:
> >>
> >>> +1
> >>>
> >>> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
> >>> >
> >>> > +1
> >>> >
> >>> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
> >>> wrote:
> >>> >>
> >>> >> +1
> >>> >>
> >>> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
> >>> >>>
> >>> >>> +1
> >>> >>>
> >>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >>> >>> >
> >>> >>> > +1
> >>> >>> >
> >>> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
> >>> wrote:
> >>> >>> >>
> >>> >>> >> Hi all,
> >>> >>> >>
> >>> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
> >>> Catalogs.
> >>> >>> >>
> >>> >>> >> Please also refer to:
> >>> >>> >>
> >>> >>> >>- Discussion thread:
> >>> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>> >>> >>- JIRA ticket:
> >>> https://issues.apache.org/jira/browse/SPARK-44167
> >>> >>> >>- SPIP doc:
> >>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Please vote on the SPIP for the next 72 hours:
> >>> >>> >>
> >>> >>> >> [ ] +1: Accept the proposal as an official SPIP
> >>> >>> >> [ ] +0
> >>> >>> >> [ ] -1: I don’t think this is a good idea because …
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Thank you!
> >>> >>> >>
> >>> >>> >> Liang-Chi Hsieh
> >>> >>> >>
> >>> >>> >>
> >>> -
> >>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>> >>
> >>> >>>
> >>> >>> -
> >>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
> >>>
> 
> -- 
> Ryan Blue
> Tabular
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Ryan Blue

+1

On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh 
wrote:

> +0
>
> For reasons I outlined in the discussion thread
>
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 13 May 2024 at 08:24, Wenchen Fan  wrote:
>
>> +1
>>
>> On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:
>>
>>> +1
>>>
>>> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
>>> >
>>> > +1
>>> >
>>> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>>> >>>
>>> >>> +1
>>> >>>
>>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>>> >>> >
>>> >>> > +1
>>> >>> >
>>> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
>>> wrote:
>>> >>> >>
>>> >>> >> Hi all,
>>> >>> >>
>>> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
>>> Catalogs.
>>> >>> >>
>>> >>> >> Please also refer to:
>>> >>> >>
>>> >>> >>- Discussion thread:
>>> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>> >>> >>- JIRA ticket:
>>> https://issues.apache.org/jira/browse/SPARK-44167
>>> >>> >>- SPIP doc:
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> >>> >>
>>> >>> >>
>>> >>> >> Please vote on the SPIP for the next 72 hours:
>>> >>> >>
>>> >>> >> [ ] +1: Accept the proposal as an official SPIP
>>> >>> >> [ ] +0
>>> >>> >> [ ] -1: I don’t think this is a good idea because …
>>> >>> >>
>>> >>> >>
>>> >>> >> Thank you!
>>> >>> >>
>>> >>> >> Liang-Chi Hsieh
>>> >>> >>
>>> >>> >>
>>> -
>>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>> >>
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
Ryan Blue
Tabular

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan

Hi Nicholas,

Thanks for your help! I'm definitely interested in participating in this
unification work. Let me know how I can help.

Wenchen

On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas 
wrote:

> Re: unification
>
> We also have a long-standing problem with how we manage Python
> dependencies, something I’ve tried (unsuccessfully
> ) to fix in the past.
>
> Consider, for example, how many separate places this numpy dependency is
> installed:
>
> 1.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
> 2.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
> 3.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
> 4.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
> 5.
> https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
> 6.
> https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
> 7.
> https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
> 8.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
> 9.
> https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
> 10.
> https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
> 11.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
> 12.
> https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92
>
> None of those installations reference a unified version requirement, so
> naturally they are inconsistent across all these different lines. Some say
> `>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In
> several cases there is no version requirement specified at all.
>
> I’m interested in trying again to fix this problem, but it needs to be in
> collaboration with a committer since I cannot fully test the release
> scripts. (This testing gap is what doomed my last attempt at fixing this
> problem.)
>
> Nick
>
>
> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
>
> After finishing the 4.0.0-preview1 RC1, I have more experience with this
> topic now.
>
> In fact, the main job of the release process: building packages and
> documents, is tested in Github Action jobs. However, the way we test them
> is different from what we do in the release scripts.
>
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile:
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile:
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release
> process needs to set up more things so it may not be viable to use a single
> Dockerfile for both.
>
> 2. the execution code is different. Use building documents as an example:
> The release scripts:
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job:
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify
> them.
>
> It's better if we can run the release scripts as Github Action jobs, but I
> think it's more important to do the unification now.
>
> Thanks,
> Wenchen
>
>
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:
>
>> Hello,
>>
>> I can answer some of your common questions with other Apache projects.
>>
>> > Who currently has permissions for Github actions? Is there a specific
>> owner for that today or a different volunteer each time?
>>
>> The Apache organization owns Github Actions, and committers (contributors
>> with write permissions) can retrigger/cancel a Github Actions workflow, but
>> Github Actions runners are managed by the Apache infra team.
>>
>> > What are the current limits of GitHub Actions, who set them - and what
>> is the process to change those (if possible at all, but I presume not all
>> Apache projects have the same limits)?
>>
>> For limits, I don't think there is any significant limit, especially
>> since the Apache organization has 900 donated runners used by its projects,
>> and there is an initiative from the Infra team to add self-hosted runners
>> running on Kubernetes (document
>> 
>> ).
>>
>> > Where should the artifacts be

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Mich Talebzadeh

+0

For reasons I outlined in the discussion thread

https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 13 May 2024 at 08:24, Wenchen Fan  wrote:

> +1
>
> On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:
>
>> +1
>>
>> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
>> >
>> > +1
>> >
>> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
>> wrote:
>> >>
>> >> +1
>> >>
>> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >>> >
>> >>> > +1
>> >>> >
>> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
>> wrote:
>> >>> >>
>> >>> >> Hi all,
>> >>> >>
>> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
>> Catalogs.
>> >>> >>
>> >>> >> Please also refer to:
>> >>> >>
>> >>> >>- Discussion thread:
>> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>> >>- JIRA ticket:
>> https://issues.apache.org/jira/browse/SPARK-44167
>> >>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>> >>
>> >>> >>
>> >>> >> Please vote on the SPIP for the next 72 hours:
>> >>> >>
>> >>> >> [ ] +1: Accept the proposal as an official SPIP
>> >>> >> [ ] +0
>> >>> >> [ ] -1: I don’t think this is a good idea because …
>> >>> >>
>> >>> >>
>> >>> >> Thank you!
>> >>> >>
>> >>> >> Liang-Chi Hsieh
>> >>> >>
>> >>> >>
>> -
>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>> >>
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas

Re: unification

We also have a long-standing problem with how we manage Python dependencies, 
something I’ve tried (unsuccessfully 
) to fix in the past.

Consider, for example, how many separate places this numpy dependency is 
installed:

1. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L277
2. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L733
3. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L853
4. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/.github/workflows/build_and_test.yml#L871
5. 
https://github.com/apache/spark/blob/8094535973f19e9f0543535a97254e8ebffc1b23/.github/workflows/build_python_connect35.yml#L70
6. 
https://github.com/apache/spark/blob/553e1b85c42a60c082d33f7b9df53b0495893286/.github/workflows/maven_test.yml#L181
7. 
https://github.com/apache/spark/blob/6e5d1db9058de62a45f35d3f41e028a72f688b70/dev/requirements.txt#L5
8. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L90
9. 
https://github.com/apache/spark/blob/678aeb7ef7086bd962df7ac6d1c5f39151a0515b/dev/run-pip-tests#L99
10. 
https://github.com/apache/spark/blob/9a2818820f11f9bdcc042f4ab80850918911c68c/dev/create-release/spark-rm/Dockerfile#L40
11. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L89
12. 
https://github.com/apache/spark/blob/9a42610d5ad8ae0ded92fb68c7617861cfe975e1/dev/infra/Dockerfile#L92

None of those installations reference a unified version requirement, so 
naturally they are inconsistent across all these different lines. Some say 
`>=1.21`, others say `>=1.20.0`, and still others say `==1.20.3`. In several 
cases there is no version requirement specified at all.

I’m interested in trying again to fix this problem, but it needs to be in 
collaboration with a committer since I cannot fully test the release scripts. 
(This testing gap is what doomed my last attempt at fixing this problem.)

Nick


> On May 13, 2024, at 12:18 AM, Wenchen Fan  wrote:
> 
> After finishing the 4.0.0-preview1 RC1, I have more experience with this 
> topic now.
> 
> In fact, the main job of the release process: building packages and 
> documents, is tested in Github Action jobs. However, the way we test them is 
> different from what we do in the release scripts.
> 
> 1. the execution environment is different:
> The release scripts define the execution environment with this Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
> However, Github Action jobs use a different Dockerfile: 
> https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
> We should figure out a way to unify it. The docker image for the release 
> process needs to set up more things so it may not be viable to use a single 
> Dockerfile for both.
> 
> 2. the execution code is different. Use building documents as an example:
> The release scripts: 
> https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
> The Github Action job: 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
> I don't know which one is more correct, but we should definitely unify them.
> 
> It's better if we can run the release scripts as Github Action jobs, but I 
> think it's more important to do the unification now.
> 
> Thanks,
> Wenchen
> 
> 
> On Fri, May 10, 2024 at 12:34 AM Hussein Awala  > wrote:
>> Hello,
>> 
>> I can answer some of your common questions with other Apache projects.
>> 
>> > Who currently has permissions for Github actions? Is there a specific 
>> > owner for that today or a different volunteer each time?
>> 
>> The Apache organization owns Github Actions, and committers (contributors 
>> with write permissions) can retrigger/cancel a Github Actions workflow, but 
>> Github Actions runners are managed by the Apache infra team.
>> 
>> > What are the current limits of GitHub Actions, who set them - and what is 
>> > the process to change those (if possible at all, but I presume not all 
>> > Apache projects have the same limits)?
>> 
>> For limits, I don't think there is any significant limit, especially since 
>> the Apache organization has 900 donated runners used by its projects, and 
>> there is an initiative from the Infra team to add self-hosted runners 
>> running on Kubernetes (document 
>> ).
>> 
>> > Where should the artifacts be stored?
>> 
>> Usually, we use Maven for jars, DockerHub for Docker images, and Github 
>> cache for workflow cache. But we can use Github artifacts to store any kind 
>> of package (even Docker images in the

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Wenchen Fan

+1

On Mon, May 13, 2024 at 10:30 AM Kent Yao  wrote:

> +1
>
> Dongjoon Hyun  于2024年5月13日周一 08:39写道：
> >
> > +1
> >
> > On Sun, May 12, 2024 at 3:50 PM huaxin gao 
> wrote:
> >>
> >> +1
> >>
> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
> >>>
> >>> +1
> >>>
> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >>> >
> >>> > +1
> >>> >
> >>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh 
> wrote:
> >>> >>
> >>> >> Hi all,
> >>> >>
> >>> >> I’d like to start a vote for SPIP: Stored Procedures API for
> Catalogs.
> >>> >>
> >>> >> Please also refer to:
> >>> >>
> >>> >>- Discussion thread:
> >>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
> >>> >>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>> >>
> >>> >>
> >>> >> Please vote on the SPIP for the next 72 hours:
> >>> >>
> >>> >> [ ] +1: Accept the proposal as an official SPIP
> >>> >> [ ] +0
> >>> >> [ ] -1: I don’t think this is a good idea because …
> >>> >>
> >>> >>
> >>> >> Thank you!
> >>> >>
> >>> >> Liang-Chi Hsieh
> >>> >>
> >>> >>
> -
> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >>
> >>>
> >>> -
> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan

After finishing the 4.0.0-preview1 RC1, I have more experience with this
topic now.

In fact, the main job of the release process: building packages and
documents, is tested in Github Action jobs. However, the way we test them
is different from what we do in the release scripts.

1. the execution environment is different:
The release scripts define the execution environment with this Dockerfile:
https://github.com/apache/spark/blob/master/dev/create-release/spark-rm/Dockerfile
However, Github Action jobs use a different Dockerfile:
https://github.com/apache/spark/blob/master/dev/infra/Dockerfile
We should figure out a way to unify it. The docker image for the release
process needs to set up more things so it may not be viable to use a single
Dockerfile for both.

2. the execution code is different. Use building documents as an example:
The release scripts:
https://github.com/apache/spark/blob/master/dev/create-release/release-build.sh#L404-L411
The Github Action job:
https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L883-L895
I don't know which one is more correct, but we should definitely unify them.

It's better if we can run the release scripts as Github Action jobs, but I
think it's more important to do the unification now.

Thanks,
Wenchen

On Fri, May 10, 2024 at 12:34 AM Hussein Awala  wrote:

> Hello,
>
> I can answer some of your common questions with other Apache projects.
>
> > Who currently has permissions for Github actions? Is there a specific
> owner for that today or a different volunteer each time?
>
> The Apache organization owns Github Actions, and committers (contributors
> with write permissions) can retrigger/cancel a Github Actions workflow, but
> Github Actions runners are managed by the Apache infra team.
>
> > What are the current limits of GitHub Actions, who set them - and what
> is the process to change those (if possible at all, but I presume not all
> Apache projects have the same limits)?
>
> For limits, I don't think there is any significant limit, especially since
> the Apache organization has 900 donated runners used by its projects, and
> there is an initiative from the Infra team to add self-hosted runners
> running on Kubernetes (document
> 
> ).
>
> > Where should the artifacts be stored?
>
> Usually, we use Maven for jars, DockerHub for Docker images, and Github
> cache for workflow cache. But we can use Github artifacts to store any kind
> of package (even Docker images in the ghcr), which is fully accepted by
> Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
> ...), a bucket can be used to store some of the packages.
>
>
>  > Who should be permitted to sign a version - and what is the process for
> that?
>
> The Apache documentation is clear about this, by default only PMC members
> can be release managers, but we can contact the infra team to add one of
> the committers as a release manager (document
> ). The
> process of creating a new version is described in this document
> .
>
>
> On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:
>
>> Following the conversation started with Spark 4.0.0 release, this is a
>> thread to discuss improvements to our release processes.
>>
>> I'll Start by raising some questions that probably should have answers to
>> start the discussion:
>>
>>
>>1. What is currently running in GitHub Actions?
>>2. Who currently has permissions for Github actions? Is there a
>>specific owner for that today or a different volunteer each time?
>>3. What are the current limits of GitHub Actions, who set them - and
>>what is the process to change those (if possible at all, but I presume not
>>all Apache projects have the same limits)?
>>4. What versions should we support as an output for the build?
>>5. Where should the artifacts be stored?
>>6. What should be the output? only tar or also a docker image
>>published somewhere?
>>7. Do we want to have a release on fixed dates or a manual release
>>upon request?
>>8. Who should be permitted to sign a version - and what is the
>>process for that?
>>
>>
>> Thanks!
>> Nimrod
>>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Kent Yao

+1

Dongjoon Hyun  于2024年5月13日周一 08:39写道：
>
> +1
>
> On Sun, May 12, 2024 at 3:50 PM huaxin gao  wrote:
>>
>> +1
>>
>> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>>>
>>> +1
>>>
>>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>>> >
>>> > +1
>>> >
>>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>> >>
>>> >> Please also refer to:
>>> >>
>>> >>- Discussion thread:
>>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>> >>- SPIP doc: 
>>> >> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> >>
>>> >>
>>> >> Please vote on the SPIP for the next 72 hours:
>>> >>
>>> >> [ ] +1: Accept the proposal as an official SPIP
>>> >> [ ] +0
>>> >> [ ] -1: I don’t think this is a good idea because …
>>> >>
>>> >>
>>> >> Thank you!
>>> >>
>>> >> Liang-Chi Hsieh
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Dongjoon Hyun

+1

On Sun, May 12, 2024 at 3:50 PM huaxin gao  wrote:

> +1
>
> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >
>> > +1
>> >
>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>> >>
>> >> Please also refer to:
>> >>
>> >>- Discussion thread:
>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>
>> >>
>> >> Please vote on the SPIP for the next 72 hours:
>> >>
>> >> [ ] +1: Accept the proposal as an official SPIP
>> >> [ ] +0
>> >> [ ] -1: I don’t think this is a good idea because …
>> >>
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi Hsieh
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread bo yang

+1

On Sat, May 11, 2024 at 4:43 PM huaxin gao  wrote:

> +1
>
> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>> >
>> > +1
>> >
>> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>> >>
>> >> Please also refer to:
>> >>
>> >>- Discussion thread:
>> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>> >>- SPIP doc:
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> >>
>> >>
>> >> Please vote on the SPIP for the next 72 hours:
>> >>
>> >> [ ] +1: Accept the proposal as an official SPIP
>> >> [ ] +0
>> >> [ ] -1: I don’t think this is a good idea because …
>> >>
>> >>
>> >> Thank you!
>> >>
>> >> Liang-Chi Hsieh
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread huaxin gao

+1

On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh  wrote:

> +1
>
> On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
> >
> > +1
> >
> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
> >>
> >> Hi all,
> >>
> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
> >>
> >> Please also refer to:
> >>
> >>- Discussion thread:
> >> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
> >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
> >>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >>
> >>
> >> Please vote on the SPIP for the next 72 hours:
> >>
> >> [ ] +1: Accept the proposal as an official SPIP
> >> [ ] +0
> >> [ ] -1: I don’t think this is a good idea because …
> >>
> >>
> >> Thank you!
> >>
> >> Liang-Chi Hsieh
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh

+1

On Sat, May 11, 2024 at 3:11 PM Chao Sun  wrote:
>
> +1
>
> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:
>>
>> Hi all,
>>
>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>>
>> Please also refer to:
>>
>>- Discussion thread:
>> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>>- SPIP doc: 
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>>
>> Thank you!
>>
>> Liang-Chi Hsieh
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Chao Sun

+1

On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh  wrote:

> Hi all,
>
> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.
>
> Please also refer to:
>
>- Discussion thread:
> https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
>- SPIP doc:
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
> Thank you!
>
> Liang-Chi Hsieh
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh

Hi all,

I’d like to start a vote for SPIP: Stored Procedures API for Catalogs.

Please also refer to:

   - Discussion thread:
https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo
   - JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167
   - SPIP doc: 
https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …


Thank you!

Liang-Chi Hsieh

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Mich Talebzadeh

Thanks

In the context of stored procedures API for Catalogs, this approach
deviates from the traditional definition of stored procedures in RDBMS for
two key reasons:

   - Compilation vs. Interpretation: Traditional stored procedures are
   typically pre-compiled into machine code for faster execution. This
   approach, however, focuses on loading and interpreting the code on demand,
   similar to how scripts are run in some programming languages like Python.
   - Schema Changes and Invalidation: In RDBMS, changes to the underlying
   tables can invalidate compiled procedures as they might reference
   non-existent columns or have incompatible data types. This approach aims to
   avoid invalidation by potentially adapting to minor schema changes.

So, while it leverages the concept of pre-defined procedures stored within
the database and accessible through the Catalog API, it is evident that
this approach functions more like dynamic scripts than traditional compiled
stored procedures.

HTH

Mich Talebzadeh,Technologist | Architect | Data Engineer  | Generative AI |
FinCrime

London
United Kingdom

   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh




Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".


Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 11 May 2024 at 19:25, Anton Okolnychyi 
wrote:

> Mich, I don't think the invalidation will be necessary in our case as
> there is no plan to preprocess or compile the procedures into executable
> objects. They will be loaded and executed on demand via the Catalog API.
>
> пт, 10 трав. 2024 р. о 10:37 Mich Talebzadeh 
> пише:
>
>> Hi,
>>
>> If the underlying table changes (DDL), if I recall from RDBMSs like
>> Oracle, the stored procedure will be invalidated as it is a compiled
>> object. How is this going to be handled? Does it follow the same mechanism?
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Sat, 20 Apr 2024 at 02:34, Anton Okolnychyi 
>> wrote:
>>
>>> Hi folks,
>>>
>>> I'd like to start a discussion on SPARK-44167 that aims to enable
>>> catalogs to expose custom routines as stored procedures. I believe this
>>> functionality will enhance Spark’s ability to interact with external
>>> connectors and allow users to perform more operations in plain SQL.
>>>
>>> SPIP [1] contains proposed API changes and parser extensions. Any
>>> feedback is more than welcome!
>>>
>>> Unlike the initial proposal for stored procedures with Python [2], this
>>> one focuses on exposing pre-defined stored procedures via the catalog API.
>>> This approach is inspired by a similar functionality in Trino and avoids
>>> the challenges of supporting user-defined routines discussed earlier [3].
>>>
>>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>>
>>> - Anton
>>>
>>> [1] -
>>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>>> [2] -
>>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
>>> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>>>
>>>
>>>
>>>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Anton Okolnychyi

Mich, I don't think the invalidation will be necessary in our case as there
is no plan to preprocess or compile the procedures into executable objects.
They will be loaded and executed on demand via the Catalog API.

пт, 10 трав. 2024 р. о 10:37 Mich Talebzadeh 
пише:

> Hi,
>
> If the underlying table changes (DDL), if I recall from RDBMSs like
> Oracle, the stored procedure will be invalidated as it is a compiled
> object. How is this going to be handled? Does it follow the same mechanism?
>
> Thanks
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sat, 20 Apr 2024 at 02:34, Anton Okolnychyi 
> wrote:
>
>> Hi folks,
>>
>> I'd like to start a discussion on SPARK-44167 that aims to enable
>> catalogs to expose custom routines as stored procedures. I believe this
>> functionality will enhance Spark’s ability to interact with external
>> connectors and allow users to perform more operations in plain SQL.
>>
>> SPIP [1] contains proposed API changes and parser extensions. Any
>> feedback is more than welcome!
>>
>> Unlike the initial proposal for stored procedures with Python [2], this
>> one focuses on exposing pre-defined stored procedures via the catalog API.
>> This approach is inspired by a similar functionality in Trino and avoids
>> the challenges of supporting user-defined routines discussed earlier [3].
>>
>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>
>> - Anton
>>
>> [1] -
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> [2] -
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
>> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>>
>>
>>
>>

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-11 Thread Cheng Pan

-1 (non-binding)

A small question, the tag is orphan but I suppose it should belong to the 
master branch.

Seems YARN integration is broken due to javax =>  jakarta namespace migration, 
I filled SPARK-48238, and left some comments on 
https://github.com/apache/spark/pull/45154

Caused by: java.lang.IllegalStateException: class 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
jakarta.servlet.Filter
at 
org.sparkproject.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) 
~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
org.sparkproject.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
org.sparkproject.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:724)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
at 
java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1625)
 ~[?:?]
at 
java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
 ~[?:?]
at 
java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:762)
 ~[?:?]
at 
org.sparkproject.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
 ~[spark-core_2.13-4.0.0-preview1.jar:4.0.0-preview1]
... 38 more

Thanks,
Cheng Pan

> On May 11, 2024, at 13:55, Wenchen Fan  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 4.0.0-preview1.
> 
> The vote is open until May 16 PST and passes if a majority +1 PMC votes are 
> cast, with
> a minimum of 3 +1 votes.
> 
> [ ] +1 Release this package as Apache Spark 4.0.0-preview1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see http://spark.apache.org/
> 
> The tag to be voted on is v4.0.0-preview1-rc1 (commit 
> 7dcf77c739c3854260464d732dbfb9a0f54706e7):
> https://github.com/apache/spark/tree/v4.0.0-preview1-rc1
> 
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
> 
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1454/
> 
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/
> 
> The list of bug fixes going into 4.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12353359
> 
> FAQ
> 
> =
> How can I help test this release?
> =
> 
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
> 
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-10 Thread Wenchen Fan

Please vote on releasing the following candidate as Apache Spark version
4.0.0-preview1.

The vote is open until May 16 PST and passes if a majority +1 PMC votes are
cast, with
a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 4.0.0-preview1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v4.0.0-preview1-rc1 (commit
7dcf77c739c3854260464d732dbfb9a0f54706e7):
https://github.com/apache/spark/tree/v4.0.0-preview1-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1454/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-docs/

The list of bug fixes going into 4.0.0 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12353359

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with an out of date RC going forward).

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-10 Thread Mich Talebzadeh

Hi,

If the underlying table changes (DDL), if I recall from RDBMSs like Oracle,
the stored procedure will be invalidated as it is a compiled object. How is
this going to be handled? Does it follow the same mechanism?

Thanks

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 20 Apr 2024 at 02:34, Anton Okolnychyi 
wrote:

> Hi folks,
>
> I'd like to start a discussion on SPARK-44167 that aims to enable catalogs
> to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
>
> SPIP [1] contains proposed API changes and parser extensions. Any feedback
> is more than welcome!
>
> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
>
> Liang-Chi was kind enough to shepherd this effort. Thanks!
>
> - Anton
>
> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>
>
>
>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread huaxin gao

Thanks Anton for the updated proposal -- it looks great! I appreciate the
hard work put into refining it. I am looking forward to the upcoming vote
and moving forward with this initiative.

Thanks,
Huaxin

On Thu, May 9, 2024 at 7:30 PM L. C. Hsieh  wrote:

> Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
> others if I miss those who are participating in the discussion.
>
> I suppose we have reached a consensus or close to being in the design.
>
> If you have some more comments, please let us know.
>
> If not, I will go to start a vote soon after a few days.
>
> Thank you.
>
> On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi 
> wrote:
> >
> > Thanks to everyone who commented on the design doc. I updated the
> proposal and it is ready for another look. I hope we can converge and move
> forward with this effort!
> >
> > - Anton
> >
> > пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi 
> пише:
> >>
> >> Hi folks,
> >>
> >> I'd like to start a discussion on SPARK-44167 that aims to enable
> catalogs to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
> >>
> >> SPIP [1] contains proposed API changes and parser extensions. Any
> feedback is more than welcome!
> >>
> >> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
> >>
> >> Liang-Chi was kind enough to shepherd this effort. Thanks!
> >>
> >> - Anton
> >>
> >> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> >> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
> >>
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan

Thanks for leading this project! Let's move forward.

On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh  wrote:

> Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
> others if I miss those who are participating in the discussion.
>
> I suppose we have reached a consensus or close to being in the design.
>
> If you have some more comments, please let us know.
>
> If not, I will go to start a vote soon after a few days.
>
> Thank you.
>
> On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi 
> wrote:
> >
> > Thanks to everyone who commented on the design doc. I updated the
> proposal and it is ready for another look. I hope we can converge and move
> forward with this effort!
> >
> > - Anton
> >
> > пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi 
> пише:
> >>
> >> Hi folks,
> >>
> >> I'd like to start a discussion on SPARK-44167 that aims to enable
> catalogs to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
> >>
> >> SPIP [1] contains proposed API changes and parser extensions. Any
> feedback is more than welcome!
> >>
> >> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
> >>
> >> Liang-Chi was kind enough to shepherd this effort. Thanks!
> >>
> >> - Anton
> >>
> >> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> >> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> >> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
> >>
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread L. C. Hsieh

Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and
others if I miss those who are participating in the discussion.

I suppose we have reached a consensus or close to being in the design.

If you have some more comments, please let us know.

If not, I will go to start a vote soon after a few days.

Thank you.

On Thu, May 9, 2024 at 6:12 PM Anton Okolnychyi  wrote:
>
> Thanks to everyone who commented on the design doc. I updated the proposal 
> and it is ready for another look. I hope we can converge and move forward 
> with this effort!
>
> - Anton
>
> пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi  пише:
>>
>> Hi folks,
>>
>> I'd like to start a discussion on SPARK-44167 that aims to enable catalogs 
>> to expose custom routines as stored procedures. I believe this functionality 
>> will enhance Spark’s ability to interact with external connectors and allow 
>> users to perform more operations in plain SQL.
>>
>> SPIP [1] contains proposed API changes and parser extensions. Any feedback 
>> is more than welcome!
>>
>> Unlike the initial proposal for stored procedures with Python [2], this one 
>> focuses on exposing pre-defined stored procedures via the catalog API. This 
>> approach is inspired by a similar functionality in Trino and avoids the 
>> challenges of supporting user-defined routines discussed earlier [3].
>>
>> Liang-Chi was kind enough to shepherd this effort. Thanks!
>>
>> - Anton
>>
>> [1] - 
>> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
>> [2] - 
>> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
>> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>>
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Anton Okolnychyi

Thanks to everyone who commented on the design doc. I updated the proposal
and it is ready for another look. I hope we can converge and move forward
with this effort!

- Anton

пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi  пише:

> Hi folks,
>
> I'd like to start a discussion on SPARK-44167 that aims to enable catalogs
> to expose custom routines as stored procedures. I believe this
> functionality will enhance Spark’s ability to interact with external
> connectors and allow users to perform more operations in plain SQL.
>
> SPIP [1] contains proposed API changes and parser extensions. Any feedback
> is more than welcome!
>
> Unlike the initial proposal for stored procedures with Python [2], this
> one focuses on exposing pre-defined stored procedures via the catalog API.
> This approach is inspired by a similar functionality in Trino and avoids
> the challenges of supporting user-defined routines discussed earlier [3].
>
> Liang-Chi was kind enough to shepherd this effort. Thanks!
>
> - Anton
>
> [1] -
> https://docs.google.com/document/d/1rDcggNl9YNcBECsfgPcoOecHXYZOu29QYFrloo2lPBg/
> [2] -
> https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/
> [3] - https://lists.apache.org/thread/lkjm9r7rx7358xxn2z8yof4wdknpzg3l
>
>
>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

UPDATE:

I've successfully uploaded the release packages:
https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/
(I skipped SparkR as I was not able to fix the errors, I'll get back to it
later)

However, there is a new issue with doc building:
https://github.com/apache/spark/pull/44628#discussion_r1595718574

I'll continue after the issue is fixed.

On Fri, May 10, 2024 at 12:29 AM Dongjoon Hyun 
wrote:

> Please re-try to upload, Wenchen. ASF Infra team bumped up our upload
> limit based on our request.
>
> > Your upload limit has been increased to 650MB
>
> Dongjoon.
>
>
>
> On Thu, May 9, 2024 at 8:12 AM Wenchen Fan  wrote:
>
>> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776
>>
>> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
>> wrote:
>>
>>> In addition, FYI, I was the latest release manager with Apache Spark
>>> 3.4.3 (2024-04-15 Vote)
>>>
>>> According to my work log, I uploaded the following binaries to SVN from
>>> EC2 (us-west-2) without any issues.
>>>
>>> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
>>> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
>>> spark-3.4.3-bin-hadoop3-scala2.13.tgz
>>> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
>>> spark-3.4.3-bin-hadoop3.tgz
>>> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
>>> spark-3.4.3-bin-without-hadoop.tgz
>>> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
>>> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>>>
>>> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination,
>>> the total size should be smaller than 3.4.3 binaires.
>>>
>>> Given that, if there is any INFRA change, that could happen after 4/15.
>>>
>>> Dongjoon.
>>>
>>> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
>>> wrote:
>>>
 Could you file an INFRA JIRA issue with the error message and context
 first, Wenchen?

 As you know, if we see something, we had better file a JIRA issue
 because it could be not only an Apache Spark project issue but also all ASF
 project issues.

 Dongjoon.


 On Thu, May 9, 2024 at 12:28 AM Wenchen Fan 
 wrote:

> UPDATE:
>
> After resolving a few issues in the release scripts, I can finally
> build the release packages. However, I can't upload them to the staging 
> SVN
> repo due to a transmitting error, and it seems like a limitation from the
> server side. I tried it on both my local laptop and remote AWS instance,
> but neither works. These package binaries are like 300-400 MBs, and we 
> just
> did a release last month. Not sure if this is a new limitation due to cost
> saving.
>
> While I'm looking for help to get unblocked, I'm wondering if we can
> upload release packages to a public git repo instead, under the Apache
> account?
>
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun

Please re-try to upload, Wenchen. ASF Infra team bumped up our upload limit
based on our request.

> Your upload limit has been increased to 650MB

Dongjoon.

On Thu, May 9, 2024 at 8:12 AM Wenchen Fan  wrote:

> I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776
>
> On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
> wrote:
>
>> In addition, FYI, I was the latest release manager with Apache Spark
>> 3.4.3 (2024-04-15 Vote)
>>
>> According to my work log, I uploaded the following binaries to SVN from
>> EC2 (us-west-2) without any issues.
>>
>> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
>> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
>> spark-3.4.3-bin-hadoop3-scala2.13.tgz
>> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
>> spark-3.4.3-bin-hadoop3.tgz
>> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
>> spark-3.4.3-bin-without-hadoop.tgz
>> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
>> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>>
>> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
>> total size should be smaller than 3.4.3 binaires.
>>
>> Given that, if there is any INFRA change, that could happen after 4/15.
>>
>> Dongjoon.
>>
>> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Could you file an INFRA JIRA issue with the error message and context
>>> first, Wenchen?
>>>
>>> As you know, if we see something, we had better file a JIRA issue
>>> because it could be not only an Apache Spark project issue but also all ASF
>>> project issues.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>>>
 UPDATE:

 After resolving a few issues in the release scripts, I can finally
 build the release packages. However, I can't upload them to the staging SVN
 repo due to a transmitting error, and it seems like a limitation from the
 server side. I tried it on both my local laptop and remote AWS instance,
 but neither works. These package binaries are like 300-400 MBs, and we just
 did a release last month. Not sure if this is a new limitation due to cost
 saving.

 While I'm looking for help to get unblocked, I'm wondering if we can
 upload release packages to a public git repo instead, under the Apache
 account?

>
>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776

On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun 
wrote:

> In addition, FYI, I was the latest release manager with Apache Spark 3.4.3
> (2024-04-15 Vote)
>
> According to my work log, I uploaded the following binaries to SVN from
> EC2 (us-west-2) without any issues.
>
> -rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
> -rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
> spark-3.4.3-bin-hadoop3-scala2.13.tgz
> -rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
> spark-3.4.3-bin-hadoop3.tgz
> -rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
> spark-3.4.3-bin-without-hadoop.tgz
> -rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
> -rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz
>
> Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
> total size should be smaller than 3.4.3 binaires.
>
> Given that, if there is any INFRA change, that could happen after 4/15.
>
> Dongjoon.
>
> On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
> wrote:
>
>> Could you file an INFRA JIRA issue with the error message and context
>> first, Wenchen?
>>
>> As you know, if we see something, we had better file a JIRA issue because
>> it could be not only an Apache Spark project issue but also all ASF project
>> issues.
>>
>> Dongjoon.
>>
>>
>> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> After resolving a few issues in the release scripts, I can finally build
>>> the release packages. However, I can't upload them to the staging SVN repo
>>> due to a transmitting error, and it seems like a limitation from the server
>>> side. I tried it on both my local laptop and remote AWS instance, but
>>> neither works. These package binaries are like 300-400 MBs, and we just did
>>> a release last month. Not sure if this is a new limitation due to cost
>>> saving.
>>>
>>> While I'm looking for help to get unblocked, I'm wondering if we can
>>> upload release packages to a public git repo instead, under the Apache
>>> account?
>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun

In addition, FYI, I was the latest release manager with Apache Spark 3.4.3
(2024-04-15 Vote)

According to my work log, I uploaded the following binaries to SVN from EC2
(us-west-2) without any issues.

-rw-r--r--.  1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz
-rw-r--r--.  1 centos centos 397870995 Apr 15 00:44
spark-3.4.3-bin-hadoop3-scala2.13.tgz
-rw-r--r--.  1 centos centos 388930980 Apr 15 01:29
spark-3.4.3-bin-hadoop3.tgz
-rw-r--r--.  1 centos centos 300786123 Apr 15 01:04
spark-3.4.3-bin-without-hadoop.tgz
-rw-r--r--.  1 centos centos  32219044 Apr 15 00:23 spark-3.4.3.tgz
-rw-r--r--.  1 centos centos356749 Apr 15 01:29 SparkR_3.4.3.tar.gz

Since Apache Spark 4.0.0-preview doesn't have Scala 2.12 combination, the
total size should be smaller than 3.4.3 binaires.

Given that, if there is any INFRA change, that could happen after 4/15.

Dongjoon.

On Thu, May 9, 2024 at 7:57 AM Dongjoon Hyun 
wrote:

> Could you file an INFRA JIRA issue with the error message and context
> first, Wenchen?
>
> As you know, if we see something, we had better file a JIRA issue because
> it could be not only an Apache Spark project issue but also all ASF project
> issues.
>
> Dongjoon.
>
>
> On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> After resolving a few issues in the release scripts, I can finally build
>> the release packages. However, I can't upload them to the staging SVN repo
>> due to a transmitting error, and it seems like a limitation from the server
>> side. I tried it on both my local laptop and remote AWS instance, but
>> neither works. These package binaries are like 300-400 MBs, and we just did
>> a release last month. Not sure if this is a new limitation due to cost
>> saving.
>>
>> While I'm looking for help to get unblocked, I'm wondering if we can
>> upload release packages to a public git repo instead, under the Apache
>> account?
>>
>>>
>>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun

Could you file an INFRA JIRA issue with the error message and context
first, Wenchen?

As you know, if we see something, we had better file a JIRA issue because
it could be not only an Apache Spark project issue but also all ASF project
issues.

Dongjoon.


On Thu, May 9, 2024 at 12:28 AM Wenchen Fan  wrote:

> UPDATE:
>
> After resolving a few issues in the release scripts, I can finally build
> the release packages. However, I can't upload them to the staging SVN repo
> due to a transmitting error, and it seems like a limitation from the server
> side. I tried it on both my local laptop and remote AWS instance, but
> neither works. These package binaries are like 300-400 MBs, and we just did
> a release last month. Not sure if this is a new limitation due to cost
> saving.
>
> While I'm looking for help to get unblocked, I'm wondering if we can
> upload release packages to a public git repo instead, under the Apache
> account?
>
> On Thu, May 9, 2024 at 12:39 AM Holden Karau 
> wrote:
>
>> That looks cool, maybe let’s split off a thread on how to improve our
>> release processes?
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:
>>
>>> On that note, GitHub recently released (public preview) a new feature
>>> called Artifact Attestions which may be relevant/useful here: Introducing
>>> Artifact Attestations–now in public beta - The GitHub Blog
>>> 
>>>
>>> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek 
>>> wrote:
>>>
 I have no permissions so I can't do it but I'm happy to help (although
 I am more familiar with Gitlab CICD than Github Actions).
 Is there some point of contact that can provide me needed context and
 permissions?
 I'd also love to see why the costs are high and see how we can reduce
 them...

 Thanks,
 Nimrod

 On Wed, May 8, 2024 at 8:26 AM Holden Karau 
 wrote:

> I think signing the artifacts produced from a secure CI sounds like a
> good idea. I know we’ve been asked to reduce our GitHub action usage but
> perhaps someone interested could volunteer to set that up.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
> wrote:
>
>> Hi,
>> Thanks for the reply.
>>
>> From my experience, a build on a build server would be much more
>> predictable and less error prone than building on some laptop- and of
>> course much faster to have builds, snapshots, release candidates, early
>> previews releases, release candidates or final releases.
>> It will enable us to have a preview version with current changes-
>> snapshot version, either automatically every day or if we need to save
>> costs (although build is really not expensive) - with a click of a 
>> button.
>>
>> Regarding keys for signing. - that's what vaults are for, all across
>> the industry we are using vaults (such as hashicorp vault)- but if the
>> build will be automated and the only thing which will be manual is to 
>> sign
>> the release for security reasons that would be reasonable.
>>
>> Thanks,
>> Nimrod
>>
>>
>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>> holden.ka...@gmail.com>:
>>
>>> Indeed. We could conceivably build the release in CI/CD but the
>>> final verification / signing should be done locally to keep the keys 
>>> safe
>>> (there was some concern from earlier release processes).
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>> wrote:
>>>
 Hi,

 Sorry for the novice question, Wenchen - the release is done
 manually from a laptop? Not using a CI CD process on a build server?

 Thanks,
 Nimrod

 On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
 wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and
> get it ready for the release process (docker desktop doesn't work 
> anymore,
> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
> Thanks
> for your patience!
>
> Wenchen

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan

Thanks for starting the discussion! To add a bit more color, we should at
least add a test job to make sure the release script can produce the
packages correctly. Today it's kind of being manually tested by the
release manager each time, which slows down the release process. It's
better if we can automate it entirely, so that making a release is a simple
click by authorized people.

On Thu, May 9, 2024 at 9:48 PM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Hussein Awala

Hello,

I can answer some of your common questions with other Apache projects.

> Who currently has permissions for Github actions? Is there a specific
owner for that today or a different volunteer each time?

The Apache organization owns Github Actions, and committers (contributors
with write permissions) can retrigger/cancel a Github Actions workflow, but
Github Actions runners are managed by the Apache infra team.

> What are the current limits of GitHub Actions, who set them - and what is
the process to change those (if possible at all, but I presume not all
Apache projects have the same limits)?

For limits, I don't think there is any significant limit, especially since
the Apache organization has 900 donated runners used by its projects, and
there is an initiative from the Infra team to add self-hosted runners
running on Kubernetes (document

).

> Where should the artifacts be stored?

Usually, we use Maven for jars, DockerHub for Docker images, and Github
cache for workflow cache. But we can use Github artifacts to store any kind
of package (even Docker images in the ghcr), which is fully accepted by
Apache policies. Also if the project has a cloud account (AWS, GCP, Azure,
...), a bucket can be used to store some of the packages.

 > Who should be permitted to sign a version - and what is the process for
that?

The Apache documentation is clear about this, by default only PMC members
can be release managers, but we can contact the infra team to add one of
the committers as a release manager (document
). The
process of creating a new version is described in this document
.

On Thu, May 9, 2024 at 10:45 AM Nimrod Ofek  wrote:

> Following the conversation started with Spark 4.0.0 release, this is a
> thread to discuss improvements to our release processes.
>
> I'll Start by raising some questions that probably should have answers to
> start the discussion:
>
>
>1. What is currently running in GitHub Actions?
>2. Who currently has permissions for Github actions? Is there a
>specific owner for that today or a different volunteer each time?
>3. What are the current limits of GitHub Actions, who set them - and
>what is the process to change those (if possible at all, but I presume not
>all Apache projects have the same limits)?
>4. What versions should we support as an output for the build?
>5. Where should the artifacts be stored?
>6. What should be the output? only tar or also a docker image
>published somewhere?
>7. Do we want to have a release on fixed dates or a manual release
>upon request?
>8. Who should be permitted to sign a version - and what is the process
>for that?
>
>
> Thanks!
> Nimrod
>

[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek

Following the conversation started with Spark 4.0.0 release, this is a
thread to discuss improvements to our release processes.

I'll Start by raising some questions that probably should have answers to
start the discussion:


   1. What is currently running in GitHub Actions?
   2. Who currently has permissions for Github actions? Is there a specific
   owner for that today or a different volunteer each time?
   3. What are the current limits of GitHub Actions, who set them - and
   what is the process to change those (if possible at all, but I presume not
   all Apache projects have the same limits)?
   4. What versions should we support as an output for the build?
   5. Where should the artifacts be stored?
   6. What should be the output? only tar or also a docker image published
   somewhere?
   7. Do we want to have a release on fixed dates or a manual release upon
   request?
   8. Who should be permitted to sign a version - and what is the process
   for that?


Thanks!
Nimrod

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan

UPDATE:

After resolving a few issues in the release scripts, I can finally build
the release packages. However, I can't upload them to the staging SVN repo
due to a transmitting error, and it seems like a limitation from the server
side. I tried it on both my local laptop and remote AWS instance, but
neither works. These package binaries are like 300-400 MBs, and we just did
a release last month. Not sure if this is a new limitation due to cost
saving.

While I'm looking for help to get unblocked, I'm wondering if we can upload
release packages to a public git repo instead, under the Apache account?

On Thu, May 9, 2024 at 12:39 AM Holden Karau  wrote:

> That looks cool, maybe let’s split off a thread on how to improve our
> release processes?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:
>
>> On that note, GitHub recently released (public preview) a new feature
>> called Artifact Attestions which may be relevant/useful here: Introducing
>> Artifact Attestations–now in public beta - The GitHub Blog
>> 
>>
>> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>>
>>> I have no permissions so I can't do it but I'm happy to help (although I
>>> am more familiar with Gitlab CICD than Github Actions).
>>> Is there some point of contact that can provide me needed context and
>>> permissions?
>>> I'd also love to see why the costs are high and see how we can reduce
>>> them...
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>>> wrote:
>>>
 I think signing the artifacts produced from a secure CI sounds like a
 good idea. I know we’ve been asked to reduce our GitHub action usage but
 perhaps someone interested could volunteer to set that up.

 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau


 On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
 wrote:

> Hi,
> Thanks for the reply.
>
> From my experience, a build on a build server would be much more
> predictable and less error prone than building on some laptop- and of
> course much faster to have builds, snapshots, release candidates, early
> previews releases, release candidates or final releases.
> It will enable us to have a preview version with current changes-
> snapshot version, either automatically every day or if we need to save
> costs (although build is really not expensive) - with a click of a button.
>
> Regarding keys for signing. - that's what vaults are for, all across
> the industry we are using vaults (such as hashicorp vault)- but if the
> build will be automated and the only thing which will be manual is to sign
> the release for security reasons that would be reasonable.
>
> Thanks,
> Nimrod
>
>
> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
> holden.ka...@gmail.com>:
>
>> Indeed. We could conceivably build the release in CI/CD but the final
>> verification / signing should be done locally to keep the keys safe 
>> (there
>> was some concern from earlier release processes).
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry for the novice question, Wenchen - the release is done
>>> manually from a laptop? Not using a CI CD process on a build server?
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>>> wrote:
>>>
 UPDATE:

 Unfortunately, it took me quite some time to set up my laptop and
 get it ready for the release process (docker desktop doesn't work 
 anymore,
 my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
 Thanks
 for your patience!

 Wenchen

 On Fri, May 3, 2024 at 7:47 AM yangjie01 
 wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li <
> gatorsm...@gmail.com>, Tathagata Das ,
> Wenchen Fan , Cheng Pan ,
> Nicholas Chammas , Dongjoon Hyun <
> dongjoon.h...@gmail.com>, Cheng Pan , Spark
> dev

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo

Very helpful!

On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh 
wrote:

> *Potential reasons*
>
>
>- Data Serialization: Spark needs to serialize the DataFrame into an
>in-memory format suitable for storage. This process can be time-consuming,
>especially for large datasets like 3.2 GB with complex schemas.
>- Shuffle Operations: If your transformations involve shuffle
>operations, Spark might need to shuffle data across the cluster to ensure
>efficient storage. Shuffling can be slow, especially on large datasets or
>limited network bandwidth or nodes..  Check Spark UI staging and executor
>tabs for info on shuffle reads and writes
>- Memory Allocation: Spark allocates memory for the cached DataFrame.
>Depending on the cluster configuration and available memory, this
>allocation can take some time.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:
>
>> Could any one help me here ?
>> Sent from my iPhone
>>
>> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
>> >
>> > 
>> > Hello Folks,
>> > in Spark I have read a file and done some transformation and finally
>> writing to hdfs.
>> >
>> > Now I am interested in writing the same dataframe to MapRFS but for
>> this Spark will execute the full DAG again  (recompute all the previous
>> steps)(all the read + transformations ).
>> >
>> > I don't want this recompute again so I decided to cache() the dataframe
>> so that 2nd/nth write won't recompute all the steps .
>> >
>> > But here is a catch: the cache() takes more time to persist the data in
>> memory.
>> >
>> > I have a question when the dataframe is in memory then just to save it
>> to another space in memory , why it will take more time (3.2 G data 6 mins)
>> >
>> > May I know what operations in cache() are taking such a long time ?
>> >
>> > I would appreciate it if someone would share the information .
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau

That looks cool, maybe let’s split off a thread on how to improve our
release processes?

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Wed, May 8, 2024 at 9:31 AM Erik Krogen  wrote:

> On that note, GitHub recently released (public preview) a new feature
> called Artifact Attestions which may be relevant/useful here: Introducing
> Artifact Attestations–now in public beta - The GitHub Blog
> 
>
> On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:
>
>> I have no permissions so I can't do it but I'm happy to help (although I
>> am more familiar with Gitlab CICD than Github Actions).
>> Is there some point of contact that can provide me needed context and
>> permissions?
>> I'd also love to see why the costs are high and see how we can reduce
>> them...
>>
>> Thanks,
>> Nimrod
>>
>> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
>> wrote:
>>
>>> I think signing the artifacts produced from a secure CI sounds like a
>>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>>> perhaps someone interested could volunteer to set that up.
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek 
>>> wrote:
>>>
 Hi,
 Thanks for the reply.

 From my experience, a build on a build server would be much more
 predictable and less error prone than building on some laptop- and of
 course much faster to have builds, snapshots, release candidates, early
 previews releases, release candidates or final releases.
 It will enable us to have a preview version with current changes-
 snapshot version, either automatically every day or if we need to save
 costs (although build is really not expensive) - with a click of a button.

 Regarding keys for signing. - that's what vaults are for, all across
 the industry we are using vaults (such as hashicorp vault)- but if the
 build will be automated and the only thing which will be manual is to sign
 the release for security reasons that would be reasonable.

 Thanks,
 Nimrod


 בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
 holden.ka...@gmail.com>:

> Indeed. We could conceivably build the release in CI/CD but the final
> verification / signing should be done locally to keep the keys safe (there
> was some concern from earlier release processes).
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
> wrote:
>
>> Hi,
>>
>> Sorry for the novice question, Wenchen - the release is done manually
>> from a laptop? Not using a CI CD process on a build server?
>>
>> Thanks,
>> Nimrod
>>
>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
>> wrote:
>>
>>> UPDATE:
>>>
>>> Unfortunately, it took me quite some time to set up my laptop and
>>> get it ready for the release process (docker desktop doesn't work 
>>> anymore,
>>> my pgp key is lost, etc.). I'll start the RC process at my tomorrow. 
>>> Thanks
>>> for your patience!
>>>
>>> Wenchen
>>>
>>> On Fri, May 3, 2024 at 7:47 AM yangjie01 
>>> wrote:
>>>
 +1



 *发件人**: *Jungtaek Lim 
 *日期**: *2024年5月2日 星期四 10:21
 *收件人**: *Holden Karau 
 *抄送**: *Chao Sun , Xiao Li <
 gatorsm...@gmail.com>, Tathagata Das ,
 Wenchen Fan , Cheng Pan ,
 Nicholas Chammas , Dongjoon Hyun <
 dongjoon.h...@gmail.com>, Cheng Pan , Spark
 dev list , Anish Shrigondekar <
 anish.shrigonde...@databricks.com>
 *主题**: *Re: [DISCUSS] Spark 4.0.0 release



 +1 love to see it!



 On Thu, May 2, 2024 at 10:08 AM Holden Karau <
 holden.ka...@gmail.com> wrote:

 +1 :) yay previews



 On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

 +1



 On Wed, May 1, 2024 at 5:23 PM Xiao Li 
 wrote:

 +1 for next Monday.



 We can do more previews when the other features are ready for
 preview.



 Tathagata Das  于2024年5月1日周三 08:46写道：

 Next week sounds

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Erik Krogen

On that note, GitHub recently released (public preview) a new feature
called Artifact Attestions which may be relevant/useful here: Introducing
Artifact Attestations–now in public beta - The GitHub Blog


On Wed, May 8, 2024 at 9:06 AM Nimrod Ofek  wrote:

> I have no permissions so I can't do it but I'm happy to help (although I
> am more familiar with Gitlab CICD than Github Actions).
> Is there some point of contact that can provide me needed context and
> permissions?
> I'd also love to see why the costs are high and see how we can reduce
> them...
>
> Thanks,
> Nimrod
>
> On Wed, May 8, 2024 at 8:26 AM Holden Karau 
> wrote:
>
>> I think signing the artifacts produced from a secure CI sounds like a
>> good idea. I know we’ve been asked to reduce our GitHub action usage but
>> perhaps someone interested could volunteer to set that up.
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:
>>
>>> Hi,
>>> Thanks for the reply.
>>>
>>> From my experience, a build on a build server would be much more
>>> predictable and less error prone than building on some laptop- and of
>>> course much faster to have builds, snapshots, release candidates, early
>>> previews releases, release candidates or final releases.
>>> It will enable us to have a preview version with current changes-
>>> snapshot version, either automatically every day or if we need to save
>>> costs (although build is really not expensive) - with a click of a button.
>>>
>>> Regarding keys for signing. - that's what vaults are for, all across the
>>> industry we are using vaults (such as hashicorp vault)- but if the build
>>> will be automated and the only thing which will be manual is to sign the
>>> release for security reasons that would be reasonable.
>>>
>>> Thanks,
>>> Nimrod
>>>
>>>
>>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>>> holden.ka...@gmail.com>:
>>>
 Indeed. We could conceivably build the release in CI/CD but the final
 verification / signing should be done locally to keep the keys safe (there
 was some concern from earlier release processes).

 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau


 On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
 wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually
> from a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan 
> wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get
>> it ready for the release process (docker desktop doesn't work anymore, my
>> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
>> for your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas
>>> Chammas , Dongjoon Hyun <
>>> dongjoon.h...@gmail.com>, Cheng Pan , Spark
>>> dev list , Anish Shrigondekar <
>>> anish.shrigonde...@databricks.com>
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for
>>> preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>>> wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about 
>>> we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>>

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh

*Potential reasons*


   - Data Serialization: Spark needs to serialize the DataFrame into an
   in-memory format suitable for storage. This process can be time-consuming,
   especially for large datasets like 3.2 GB with complex schemas.
   - Shuffle Operations: If your transformations involve shuffle
   operations, Spark might need to shuffle data across the cluster to ensure
   efficient storage. Shuffling can be slow, especially on large datasets or
   limited network bandwidth or nodes..  Check Spark UI staging and executor
   tabs for info on shuffle reads and writes
   - Memory Allocation: Spark allocates memory for the cached DataFrame.
   Depending on the cluster configuration and available memory, this
   allocation can take some time.

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Wed, 8 May 2024 at 13:41, Prem Sahoo  wrote:

> Could any one help me here ?
> Sent from my iPhone
>
> > On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> >
> > 
> > Hello Folks,
> > in Spark I have read a file and done some transformation and finally
> writing to hdfs.
> >
> > Now I am interested in writing the same dataframe to MapRFS but for this
> Spark will execute the full DAG again  (recompute all the previous
> steps)(all the read + transformations ).
> >
> > I don't want this recompute again so I decided to cache() the dataframe
> so that 2nd/nth write won't recompute all the steps .
> >
> > But here is a catch: the cache() takes more time to persist the data in
> memory.
> >
> > I have a question when the dataframe is in memory then just to save it
> to another space in memory , why it will take more time (3.2 G data 6 mins)
> >
> > May I know what operations in cache() are taking such a long time ?
> >
> > I would appreciate it if someone would share the information .
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo

Could any one help me here ?
Sent from my iPhone

> On May 7, 2024, at 4:30 PM, Prem Sahoo  wrote:
> 
> 
> Hello Folks,
> in Spark I have read a file and done some transformation and finally writing 
> to hdfs.
> 
> Now I am interested in writing the same dataframe to MapRFS but for this 
> Spark will execute the full DAG again  (recompute all the previous steps)(all 
> the read + transformations ).
> 
> I don't want this recompute again so I decided to cache() the dataframe so 
> that 2nd/nth write won't recompute all the steps .
> 
> But here is a catch: the cache() takes more time to persist the data in 
> memory.
> 
> I have a question when the dataframe is in memory then just to save it to 
> another space in memory , why it will take more time (3.2 G data 6 mins)
> 
> May I know what operations in cache() are taking such a long time ?
> 
> I would appreciate it if someone would share the information .

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek

I have no permissions so I can't do it but I'm happy to help (although I am
more familiar with Gitlab CICD than Github Actions).
Is there some point of contact that can provide me needed context and
permissions?
I'd also love to see why the costs are high and see how we can reduce
them...

Thanks,
Nimrod

On Wed, May 8, 2024 at 8:26 AM Holden Karau  wrote:

> I think signing the artifacts produced from a secure CI sounds like a good
> idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
> someone interested could volunteer to set that up.
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:
>
>> Hi,
>> Thanks for the reply.
>>
>> From my experience, a build on a build server would be much more
>> predictable and less error prone than building on some laptop- and of
>> course much faster to have builds, snapshots, release candidates, early
>> previews releases, release candidates or final releases.
>> It will enable us to have a preview version with current changes-
>> snapshot version, either automatically every day or if we need to save
>> costs (although build is really not expensive) - with a click of a button.
>>
>> Regarding keys for signing. - that's what vaults are for, all across the
>> industry we are using vaults (such as hashicorp vault)- but if the build
>> will be automated and the only thing which will be manual is to sign the
>> release for security reasons that would be reasonable.
>>
>> Thanks,
>> Nimrod
>>
>>
>> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
>> holden.ka...@gmail.com>:
>>
>>> Indeed. We could conceivably build the release in CI/CD but the final
>>> verification / signing should be done locally to keep the keys safe (there
>>> was some concern from earlier release processes).
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>>
>>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>>> wrote:
>>>
 Hi,

 Sorry for the novice question, Wenchen - the release is done manually
 from a laptop? Not using a CI CD process on a build server?

 Thanks,
 Nimrod

 On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get
> it ready for the release process (docker desktop doesn't work anymore, my
> pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
> for your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas
>> Chammas , Dongjoon Hyun <
>> dongjoon.h...@gmail.com>, Cheng Pan , Spark dev
>> list , Anish Shrigondekar <
>> anish.shrigonde...@databricks.com>
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
>> wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We
>> don't need to wait for all the ongoing projects to be ready. How about we
>> do a 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard 
>> to
>> do that without a Preview release. So the sooner we make a Preview 
>> release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau

I think signing the artifacts produced from a secure CI sounds like a good
idea. I know we’ve been asked to reduce our GitHub action usage but perhaps
someone interested could volunteer to set that up.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 9:43 PM Nimrod Ofek  wrote:

> Hi,
> Thanks for the reply.
>
> From my experience, a build on a build server would be much more
> predictable and less error prone than building on some laptop- and of
> course much faster to have builds, snapshots, release candidates, early
> previews releases, release candidates or final releases.
> It will enable us to have a preview version with current changes- snapshot
> version, either automatically every day or if we need to save costs
> (although build is really not expensive) - with a click of a button.
>
> Regarding keys for signing. - that's what vaults are for, all across the
> industry we are using vaults (such as hashicorp vault)- but if the build
> will be automated and the only thing which will be manual is to sign the
> release for security reasons that would be reasonable.
>
> Thanks,
> Nimrod
>
>
> בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏<
> holden.ka...@gmail.com>:
>
>> Indeed. We could conceivably build the release in CI/CD but the final
>> verification / signing should be done locally to keep the keys safe (there
>> was some concern from earlier release processes).
>>
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>>
>> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek 
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry for the novice question, Wenchen - the release is done manually
>>> from a laptop? Not using a CI CD process on a build server?
>>>
>>> Thanks,
>>> Nimrod
>>>
>>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>>
 UPDATE:

 Unfortunately, it took me quite some time to set up my laptop and get
 it ready for the release process (docker desktop doesn't work anymore, my
 pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks
 for your patience!

 Wenchen

 On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li ,
> Tathagata Das , Wenchen Fan <
> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas
> , Dongjoon Hyun ,
> Cheng Pan , Spark dev list ,
> Anish Shrigondekar 
> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>
>
>
> +1 love to see it!
>
>
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
> +1 :) yay previews
>
>
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
> +1
>
>
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
> +1 for next Monday.
>
>
>
> We can do more previews when the other features are ready for preview.
>
>
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
> Next week sounds great! Thank you Wenchen!
>
>
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
> wrote:
>
> Yea I think a preview release won't hurt (without a branch cut). We
> don't need to wait for all the ongoing projects to be ready. How about we
> do a 4.0 preview release based on the current master branch next Monday?
>
>
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
> Hey all,
>
>
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard 
> to
> do that without a Preview release. So the sooner we make a Preview 
> release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
>
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
> wrote:
>
> Thank you all for the replies!
>
>
>
> To @Nicholas Chammas  : Thanks for
> cleaning up the error terminology and documentation! I've merged the first
> PR and let's finish others before the 4.0 release.
>
> To @Dongjoon Hyun  : Thanks for driving the
> ANSI on by default effort! Now the vote has passed, let's flip the config

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek

Hi,
Thanks for the reply.

>From my experience, a build on a build server would be much more
predictable and less error prone than building on some laptop- and of
course much faster to have builds, snapshots, release candidates, early
previews releases, release candidates or final releases.
It will enable us to have a preview version with current changes- snapshot
version, either automatically every day or if we need to save costs
(although build is really not expensive) - with a click of a button.

Regarding keys for signing. - that's what vaults are for, all across the
industry we are using vaults (such as hashicorp vault)- but if the build
will be automated and the only thing which will be manual is to sign the
release for security reasons that would be reasonable.

Thanks,
Nimrod

בתאריך יום ד׳, 8 במאי 2024, 00:54, מאת Holden Karau ‏:

> Indeed. We could conceivably build the release in CI/CD but the final
> verification / signing should be done locally to keep the keys safe (there
> was some concern from earlier release processes).
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:
>
>> Hi,
>>
>> Sorry for the novice question, Wenchen - the release is done manually
>> from a laptop? Not using a CI CD process on a build server?
>>
>> Thanks,
>> Nimrod
>>
>> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>>
>>> UPDATE:
>>>
>>> Unfortunately, it took me quite some time to set up my laptop and get it
>>> ready for the release process (docker desktop doesn't work anymore, my pgp
>>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>>> your patience!
>>>
>>> Wenchen
>>>
>>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>>
 +1

 *发件人**: *Jungtaek Lim 
 *日期**: *2024年5月2日 星期四 10:21
 *收件人**: *Holden Karau 
 *抄送**: *Chao Sun , Xiao Li ,
 Tathagata Das , Wenchen Fan <
 cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
 nicholas.cham...@gmail.com>, Dongjoon Hyun ,
 Cheng Pan , Spark dev list ,
 Anish Shrigondekar 
 *主题**: *Re: [DISCUSS] Spark 4.0.0 release

 +1 love to see it!

 On Thu, May 2, 2024 at 10:08 AM Holden Karau 
 wrote:

 +1 :) yay previews

 On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:

 +1

 On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:

 +1 for next Monday.

 We can do more previews when the other features are ready for preview.

 Tathagata Das  于2024年5月1日周三 08:46写道：

 Next week sounds great! Thank you Wenchen!

 On Wed, May 1, 2024 at 11:16 AM Wenchen Fan 
 wrote:

 Yea I think a preview release won't hurt (without a branch cut). We
 don't need to wait for all the ongoing projects to be ready. How about we
 do a 4.0 preview release based on the current master branch next Monday?

 On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
 tathagata.das1...@gmail.com> wrote:

 Hey all,

 Reviving this thread, but Spark master has already accumulated a huge
 amount of changes.  As a downstream project maintainer, I want to really
 start testing the new features and other breaking changes, and it's hard to
 do that without a Preview release. So the sooner we make a Preview release,
 the faster we can start getting feedback for fixing things for a great
 Spark 4.0 final release.

 So I urge the community to produce a Spark 4.0 Preview soon even if
 certain features targeting the Delta 4.0 release are still incomplete.

 Thanks!

 On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan 
 wrote:

 Thank you all for the replies!

 To @Nicholas Chammas  : Thanks for
 cleaning up the error terminology and documentation! I've merged the first
 PR and let's finish others before the 4.0 release.

 To @Dongjoon Hyun  : Thanks for driving the
 ANSI on by default effort! Now the vote has passed, let's flip the config
 and finish the DataFrame error context feature before 4.0.

 To @Jungtaek Lim  : Ack. We can treat
 the Streaming state store data source as completed for 4.0 then.

 To @Cheng Pan  : Yea we definitely should have a
 preview release. Let's collect more feedback on the ongoing projects and
 then we can propose a date for the preview release.

 On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:

 will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?

 Thanks,
 Cheng Pan

 > On Apr 15, 2024, at 09:58, Jungtaek Lim 
 wrote:
 >
 >

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau

Indeed. We could conceivably build the release in CI/CD but the final
verification / signing should be done locally to keep the keys safe (there
was some concern from earlier release processes).

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, May 7, 2024 at 10:55 AM Nimrod Ofek  wrote:

> Hi,
>
> Sorry for the novice question, Wenchen - the release is done manually from
> a laptop? Not using a CI CD process on a build server?
>
> Thanks,
> Nimrod
>
> On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:
>
>> UPDATE:
>>
>> Unfortunately, it took me quite some time to set up my laptop and get it
>> ready for the release process (docker desktop doesn't work anymore, my pgp
>> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
>> your patience!
>>
>> Wenchen
>>
>> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>>
>>> +1
>>>
>>>
>>>
>>> *发件人**: *Jungtaek Lim 
>>> *日期**: *2024年5月2日 星期四 10:21
>>> *收件人**: *Holden Karau 
>>> *抄送**: *Chao Sun , Xiao Li ,
>>> Tathagata Das , Wenchen Fan <
>>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>>> Cheng Pan , Spark dev list ,
>>> Anish Shrigondekar 
>>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>>
>>>
>>>
>>> +1 love to see it!
>>>
>>>
>>>
>>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>>> wrote:
>>>
>>> +1 :) yay previews
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>>
>>> +1 for next Monday.
>>>
>>>
>>>
>>> We can do more previews when the other features are ready for preview.
>>>
>>>
>>>
>>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>>
>>> Next week sounds great! Thank you Wenchen!
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>>
>>> Yea I think a preview release won't hurt (without a branch cut). We
>>> don't need to wait for all the ongoing projects to be ready. How about we
>>> do a 4.0 preview release based on the current master branch next Monday?
>>>
>>>
>>>
>>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>>> tathagata.das1...@gmail.com> wrote:
>>>
>>> Hey all,
>>>
>>>
>>>
>>> Reviving this thread, but Spark master has already accumulated a huge
>>> amount of changes.  As a downstream project maintainer, I want to really
>>> start testing the new features and other breaking changes, and it's hard to
>>> do that without a Preview release. So the sooner we make a Preview release,
>>> the faster we can start getting feedback for fixing things for a great
>>> Spark 4.0 final release.
>>>
>>>
>>>
>>> So I urge the community to produce a Spark 4.0 Preview soon even if
>>> certain features targeting the Delta 4.0 release are still incomplete.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>>
>>> Thank you all for the replies!
>>>
>>>
>>>
>>> To @Nicholas Chammas  : Thanks for cleaning
>>> up the error terminology and documentation! I've merged the first PR and
>>> let's finish others before the 4.0 release.
>>>
>>> To @Dongjoon Hyun  : Thanks for driving the
>>> ANSI on by default effort! Now the vote has passed, let's flip the config
>>> and finish the DataFrame error context feature before 4.0.
>>>
>>> To @Jungtaek Lim  : Ack. We can treat the
>>> Streaming state store data source as completed for 4.0 then.
>>>
>>> To @Cheng Pan  : Yea we definitely should have a
>>> preview release. Let's collect more feedback on the ongoing projects and
>>> then we can propose a date for the preview release.
>>>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>>
>>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>>
>>> Thanks,
>>> Cheng Pan
>>>
>>>
>>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>>> wrote:
>>> >
>>> > W.r.t. state data source - reader (SPARK-45511), there are several
>>> follow-up tickets, but we don't plan to address them soon. The current
>>> implementation is the final shape for Spark 4.0.0, unless there are demands
>>> on the follow-up tickets.
>>> >
>>> > We may want to check the plan for transformWithState - my
>>> understanding is that we want to release the feature to 4.0.0, but there
>>> are several remaining works to be done. While the tentative timeline for
>>> releasing is June 2024, what would be the tentative timeline for the RC cut?
>>> > (cc. Anish to add more context on the plan for transformWithState)
>>> >
>>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>>> wrote:
>>> > Hi all,
>>> >
>>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>>> and I think it's time to prepare for it and discuss the ongoing projects:
>>> > •
>>> > ANSI by default
>>> > • Spark Connect GA
>>> > • Structured Logging
>>> > • Streaming state store data source
>>> > • new

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Dongjoon Hyun

Thank you so much for the update, Wenchen!

Dongjoon.

On Tue, May 7, 2024 at 10:49 AM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>
>>
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> 
>>
>> YouTube Live Streams:

caching a dataframe in Spark takes lot of time

2024-05-07 Thread Prem Sahoo

Hello Folks,
in Spark I have read a file and done some transformation and finally
writing to hdfs.

Now I am interested in writing the same dataframe to MapRFS but for this
Spark will execute the full DAG again  (recompute all the previous
steps)(all the read + transformations ).

I don't want this recompute again so I decided to cache() the dataframe so
that 2nd/nth write won't recompute all the steps .

But here is a catch: the cache() takes more time to persist the data in
memory.

I have a question when the dataframe is in memory then just to save it to
another space in memory , why it will take more time (3.2 G data 6 mins)

May I know what operations in cache() are taking such a long time ?

I would appreciate it if someone would share the information .

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek

Hi,

Sorry for the novice question, Wenchen - the release is done manually from
a laptop? Not using a CI CD process on a build server?

Thanks,
Nimrod

On Tue, May 7, 2024 at 8:50 PM Wenchen Fan  wrote:

> UPDATE:
>
> Unfortunately, it took me quite some time to set up my laptop and get it
> ready for the release process (docker desktop doesn't work anymore, my pgp
> key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
> your patience!
>
> Wenchen
>
> On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:
>
>> +1
>>
>>
>>
>> *发件人**: *Jungtaek Lim 
>> *日期**: *2024年5月2日 星期四 10:21
>> *收件人**: *Holden Karau 
>> *抄送**: *Chao Sun , Xiao Li ,
>> Tathagata Das , Wenchen Fan <
>> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
>> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
>> Cheng Pan , Spark dev list ,
>> Anish Shrigondekar 
>> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>>
>>
>>
>> +1 love to see it!
>>
>>
>>
>> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
>> wrote:
>>
>> +1 :) yay previews
>>
>>
>>
>> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>>
>> +1
>>
>>
>>
>> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>>
>> +1 for next Monday.
>>
>>
>>
>> We can do more previews when the other features are ready for preview.
>>
>>
>>
>> Tathagata Das  于2024年5月1日周三 08:46写道：
>>
>> Next week sounds great! Thank you Wenchen!
>>
>>
>>
>> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>>
>> Yea I think a preview release won't hurt (without a branch cut). We don't
>> need to wait for all the ongoing projects to be ready. How about we do a
>> 4.0 preview release based on the current master branch next Monday?
>>
>>
>>
>> On Wed, May 1, 2024 at 11:06 PM Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> Reviving this thread, but Spark master has already accumulated a huge
>> amount of changes.  As a downstream project maintainer, I want to really
>> start testing the new features and other breaking changes, and it's hard to
>> do that without a Preview release. So the sooner we make a Preview release,
>> the faster we can start getting feedback for fixing things for a great
>> Spark 4.0 final release.
>>
>>
>>
>> So I urge the community to produce a Spark 4.0 Preview soon even if
>> certain features targeting the Delta 4.0 release are still incomplete.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>>
>> Thank you all for the replies!
>>
>>
>>
>> To @Nicholas Chammas  : Thanks for cleaning
>> up the error terminology and documentation! I've merged the first PR and
>> let's finish others before the 4.0 release.
>>
>> To @Dongjoon Hyun  : Thanks for driving the
>> ANSI on by default effort! Now the vote has passed, let's flip the config
>> and finish the DataFrame error context feature before 4.0.
>>
>> To @Jungtaek Lim  : Ack. We can treat the
>> Streaming state store data source as completed for 4.0 then.
>>
>> To @Cheng Pan  : Yea we definitely should have a
>> preview release. Let's collect more feedback on the ongoing projects and
>> then we can propose a date for the preview release.
>>
>>
>>
>> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>>
>> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
>> wrote:
>> >
>> > W.r.t. state data source - reader (SPARK-45511), there are several
>> follow-up tickets, but we don't plan to address them soon. The current
>> implementation is the final shape for Spark 4.0.0, unless there are demands
>> on the follow-up tickets.
>> >
>> > We may want to check the plan for transformWithState - my understanding
>> is that we want to release the feature to 4.0.0, but there are several
>> remaining works to be done. While the tentative timeline for releasing is
>> June 2024, what would be the tentative timeline for the RC cut?
>> > (cc. Anish to add more context on the plan for transformWithState)
>> >
>> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan 
>> wrote:
>> > Hi all,
>> >
>> > It's close to the previously proposed 4.0.0 release date (June 2024),
>> and I think it's time to prepare for it and discuss the ongoing projects:
>> > •
>> > ANSI by default
>> > • Spark Connect GA
>> > • Structured Logging
>> > • Streaming state store data source
>> > • new data type VARIANT
>> > • STRING collation support
>> > • Spark k8s operator versioning
>> > Please help to add more items to this list that are missed here. I
>> would like to volunteer as the release manager for Apache Spark 4.0.0 if
>> there is no objection. Thank you all for the great work that fills Spark
>> 4.0!
>> >
>> > Wenchen Fan
>>
>>
>>
>>
>> --
>>
>> Twitter: https://twitter.com/holdenkarau
>> 
>>
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>>

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan

UPDATE:

Unfortunately, it took me quite some time to set up my laptop and get it
ready for the release process (docker desktop doesn't work anymore, my pgp
key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for
your patience!

Wenchen

On Fri, May 3, 2024 at 7:47 AM yangjie01  wrote:

> +1
>
>
>
> *发件人**: *Jungtaek Lim 
> *日期**: *2024年5月2日 星期四 10:21
> *收件人**: *Holden Karau 
> *抄送**: *Chao Sun , Xiao Li ,
> Tathagata Das , Wenchen Fan <
> cloud0...@gmail.com>, Cheng Pan , Nicholas Chammas <
> nicholas.cham...@gmail.com>, Dongjoon Hyun ,
> Cheng Pan , Spark dev list ,
> Anish Shrigondekar 
> *主题**: *Re: [DISCUSS] Spark 4.0.0 release
>
>
>
> +1 love to see it!
>
>
>
> On Thu, May 2, 2024 at 10:08 AM Holden Karau 
> wrote:
>
> +1 :) yay previews
>
>
>
> On Wed, May 1, 2024 at 5:36 PM Chao Sun  wrote:
>
> +1
>
>
>
> On Wed, May 1, 2024 at 5:23 PM Xiao Li  wrote:
>
> +1 for next Monday.
>
>
>
> We can do more previews when the other features are ready for preview.
>
>
>
> Tathagata Das  于2024年5月1日周三 08:46写道：
>
> Next week sounds great! Thank you Wenchen!
>
>
>
> On Wed, May 1, 2024 at 11:16 AM Wenchen Fan  wrote:
>
> Yea I think a preview release won't hurt (without a branch cut). We don't
> need to wait for all the ongoing projects to be ready. How about we do a
> 4.0 preview release based on the current master branch next Monday?
>
>
>
> On Wed, May 1, 2024 at 11:06 PM Tathagata Das 
> wrote:
>
> Hey all,
>
>
>
> Reviving this thread, but Spark master has already accumulated a huge
> amount of changes.  As a downstream project maintainer, I want to really
> start testing the new features and other breaking changes, and it's hard to
> do that without a Preview release. So the sooner we make a Preview release,
> the faster we can start getting feedback for fixing things for a great
> Spark 4.0 final release.
>
>
>
> So I urge the community to produce a Spark 4.0 Preview soon even if
> certain features targeting the Delta 4.0 release are still incomplete.
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Apr 17, 2024 at 8:35 AM Wenchen Fan  wrote:
>
> Thank you all for the replies!
>
>
>
> To @Nicholas Chammas  : Thanks for cleaning
> up the error terminology and documentation! I've merged the first PR and
> let's finish others before the 4.0 release.
>
> To @Dongjoon Hyun  : Thanks for driving the ANSI
> on by default effort! Now the vote has passed, let's flip the config and
> finish the DataFrame error context feature before 4.0.
>
> To @Jungtaek Lim  : Ack. We can treat the
> Streaming state store data source as completed for 4.0 then.
>
> To @Cheng Pan  : Yea we definitely should have a
> preview release. Let's collect more feedback on the ongoing projects and
> then we can propose a date for the preview release.
>
>
>
> On Wed, Apr 17, 2024 at 1:22 PM Cheng Pan  wrote:
>
> will we have preview release for 4.0.0 like we did for 2.0.0 and 3.0.0?
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 15, 2024, at 09:58, Jungtaek Lim 
> wrote:
> >
> > W.r.t. state data source - reader (SPARK-45511), there are several
> follow-up tickets, but we don't plan to address them soon. The current
> implementation is the final shape for Spark 4.0.0, unless there are demands
> on the follow-up tickets.
> >
> > We may want to check the plan for transformWithState - my understanding
> is that we want to release the feature to 4.0.0, but there are several
> remaining works to be done. While the tentative timeline for releasing is
> June 2024, what would be the tentative timeline for the RC cut?
> > (cc. Anish to add more context on the plan for transformWithState)
> >
> > On Sat, Apr 13, 2024 at 3:15 AM Wenchen Fan  wrote:
> > Hi all,
> >
> > It's close to the previously proposed 4.0.0 release date (June 2024),
> and I think it's time to prepare for it and discuss the ongoing projects:
> > •
> > ANSI by default
> > • Spark Connect GA
> > • Structured Logging
> > • Streaming state store data source
> > • new data type VARIANT
> > • STRING collation support
> > • Spark k8s operator versioning
> > Please help to add more items to this list that are missed here. I would
> like to volunteer as the release manager for Apache Spark 4.0.0 if there is
> no objection. Thank you all for the great work that fills Spark 4.0!
> >
> > Wenchen Fan
>
>
>
>
> --
>
> Twitter: https://twitter.com/holdenkarau
> 
>
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> 
>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 
>
>

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi

Hi Folks,

I wanted to check why spark doesn't create staging dir while doing an
insertInto on partitioned tables. I'm running below example code –
```
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3)))
val df = spark.createDataFrame(rdd)
df.write.insertInto("testing_table") // testing table is partitioned on "_1"
```
In this scenario FileOutputCommitter considers table path as output path
and creates temporary folders like
`/testing_table/_temporary/0` and then moves them to
partition location when job commit happens.

But in-case if multiple parallel apps are inserting into the same
partition, this can cause race condition issues while deleting the
`_temporary` dir. Ideally for each app there should be a unique staging dir
where the job should write its output.

Is there any specific reason for this? or am i missing something here?
Thanks for your time and assistance regarding this!

Kind regards
Sanskar

Re: ASF board report draft for May

2024-05-06 Thread Matei Zaharia

I’ll mention that we’re working toward a preview release, even if the details 
are not finalized by the time we sent the report.

> On May 6, 2024, at 10:52 AM, Holden Karau  wrote:
> 
> I trust Wenchen to manage the preview release effectively but if there are 
> concerns around how to manage a developer preview release lets split that off 
> from the board report discussion.
> 
> On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh  > wrote:
>> I did some historical digging on this.
>> 
>> Whilst both preview release and RCs are pre-release versions, the main 
>> difference lies in their maturity and readiness for production use. Preview 
>> releases are early versions aimed at gathering feedback, while release 
>> candidates (RCs) are nearly finished versions that undergo final testing and 
>> voting before the official release.
>> 
>> So in our case, we have two options:
>> 
>> Skip mentioning of the Preview and focus on "We are intending to gather 
>> feedback on version 4 by releasing an earlier version to the community for 
>> look and feel feedback, especially focused on APIs
>> Mention Preview in the form. "There will be a Preview release with the aim 
>> of gathering feedback from the community focused on APIs"
>> IMO Preview release does not require a formal vote. Preview releases are 
>> often considered experimental or pre-alpha versions and are not expected to 
>> meet the same level of stability and completeness as release candidates or 
>> final releases.
>> 
>> HTH
>> 
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> 
>> London
>> United Kingdom
>> 
>>view my Linkedin profile 
>> 
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>  
>> Disclaimer: The information provided is correct to the best of my knowledge 
>> but of course cannot be guaranteed . It is essential to note that, as with 
>> any advice, quote "one test result is worth one-thousand expert opinions 
>> (Werner  Von Braun 
>> )".
>> 
>> 
>> On Mon, 6 May 2024 at 14:10, Mich Talebzadeh > > wrote:
>>> @Wenchen Fan  
>>> 
>>> Thanks for the update! To clarify, is the vote for approving a specific 
>>> preview build, or is it for moving towards an RC stage? I gather there is a 
>>> distinction between these two?
>>> 
>>> 
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> 
>>> London
>>> United Kingdom
>>> 
>>>view my Linkedin profile 
>>> 
>>> 
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>  
>>> Disclaimer: The information provided is correct to the best of my knowledge 
>>> but of course cannot be guaranteed . It is essential to note that, as with 
>>> any advice, quote "one test result is worth one-thousand expert opinions 
>>> (Werner  Von Braun 
>>> )".
>>> 
>>> 
>>> On Mon, 6 May 2024 at 13:03, Wenchen Fan >> > wrote:
 The preview release also needs a vote. I'll try my best to cut the RC on 
 Monday, but the actual release may take some time. Hopefully, we can get 
 it out this week but if the vote fails, it will take longer as we need 
 more RCs.
 
 On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun >>> > wrote:
> +1 for Holden's comment. Yes, it would be great to mention `it` as 
> "soon". 
> (If Wenchen release it on Monday, we can simply mention the release)
> 
> In addition, Apache Spark PMC received an official notice from ASF Infra 
> team.
> 
> https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
> > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for ASF 
> > projects
> 
> To track and comply with the new ASF Infra Policy as much as possible, we 
> opened a blocker-level JIRA issue and have been working on it.
> - https://infra.apache.org/github-actions-policy.html
> 
> Please include a sentence that Apache Spark PMC is working on under the 
> following umbrella JIRA issue.
> 
> https://issues.apache.org/jira/browse/SPARK-48094
> > Reduce GitHub Action usage according to ASF project allowance
> 
> Thanks,
> Dongjoon.
> 
> 
> On Sun, May 5, 2024 at 3:45 PM Holden Karau  > wrote:
>> Do we want to include that we’re planning on having a preview release of 
>> Spark 4 so folks can see the APIs “soon”?
>> 
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): 
>> https://amzn.to/2MaRAG9

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau

I trust Wenchen to manage the preview release effectively but if there are
concerns around how to manage a developer preview release lets split that
off from the board report discussion.

On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh 
wrote:

> I did some historical digging on this.
>
> Whilst both preview release and RCs are pre-release versions, the main
> difference lies in their maturity and readiness for production use. Preview
> releases are early versions aimed at gathering feedback, while release
> candidates (RCs) are nearly finished versions that undergo final testing
> and voting before the official release.
>
> So in our case, we have two options:
>
>
>1. Skip mentioning of the Preview and focus on "We are intending to
>gather feedback on version 4 by releasing an earlier version to the
>community for look and feel feedback, especially focused on APIs
>2. Mention Preview in the form. "There will be a Preview release with
>the aim of gathering feedback from the community focused on APIs"
>
> IMO Preview release does not require a formal vote. Preview releases are
> often considered experimental or pre-alpha versions and are not expected to
> meet the same level of stability and completeness as release candidates or
> final releases.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 6 May 2024 at 14:10, Mich Talebzadeh 
> wrote:
>
>> @Wenchen Fan 
>>
>> Thanks for the update! To clarify, is the vote for approving a specific
>> preview build, or is it for moving towards an RC stage? I gather there is a
>> distinction between these two?
>>
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 6 May 2024 at 13:03, Wenchen Fan  wrote:
>>
>>> The preview release also needs a vote. I'll try my best to cut the RC on
>>> Monday, but the actual release may take some time. Hopefully, we can get it
>>> out this week but if the vote fails, it will take longer as we need more
>>> RCs.
>>>
>>> On Mon, May 6, 2024 at 7:22 AM Dongjoon Hyun 
>>> wrote:
>>>
 +1 for Holden's comment. Yes, it would be great to mention `it` as
 "soon".
 (If Wenchen release it on Monday, we can simply mention the release)

 In addition, Apache Spark PMC received an official notice from ASF
 Infra team.

 https://lists.apache.org/thread/rgy1cg17tkd3yox7qfq87ht12sqclkbg
 > [NOTICE] Apache Spark's GitHub Actions usage exceeds allowances for
 ASF projects

 To track and comply with the new ASF Infra Policy as much as possible,
 we opened a blocker-level JIRA issue and have been working on it.
 - https://infra.apache.org/github-actions-policy.html

 Please include a sentence that Apache Spark PMC is working on under the
 following umbrella JIRA issue.

 https://issues.apache.org/jira/browse/SPARK-48094
 > Reduce GitHub Action usage according to ASF project allowance

 Thanks,
 Dongjoon.


 On Sun, May 5, 2024 at 3:45 PM Holden Karau 
 wrote:

> Do we want to include that we’re planning on having a preview release
> of Spark 4 so folks can see the APIs “soon”?
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Sun, May 5, 2024 at 3:24 PM Matei Zaharia 
> wrote:
>
>> It’s time for our quarterly ASF board report on Apache Spark this
>> Wednesday. Here’s a draft, feel free to suggest changes.
>>
>> 
>>
>> Description:
>>
>> Apache Spark is a fast and general purpose engine for large-scale
>> data processing. It offers high-level APIs in Java,

Re: Why spark-submit works with package not with jar

2024-05-06 Thread Mich Talebzadeh

Thanks David. I wanted to explain the difference between Package and Jar
with comments from the community on previous discussions back a few years
ago.

cheers


Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime


London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 6 May 2024 at 18:32, David Rabinowitz  wrote:

> Hi,
>
> It seems this library is several years old. Have you considered using the
> Google provided connector? You can find it in
> https://github.com/GoogleCloudDataproc/spark-bigquery-connector
>
> Regards,
> David Rabinowitz
>
> On Sun, May 5, 2024 at 6:07 PM Jeff Zhang  wrote:
>
>> Are you sure com.google.api.client.http.HttpRequestInitialize is in
>> the spark-bigquery-latest.jar or it may be in the transitive dependency
>> of spark-bigquery_2.11?
>>
>> On Sat, May 4, 2024 at 7:43 PM Mich Talebzadeh 
>> wrote:
>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> -- Forwarded message -
>>> From: Mich Talebzadeh 
>>> Date: Tue, 20 Oct 2020 at 16:50
>>> Subject: Why spark-submit works with package not with jar
>>> To: user @spark 
>>>
>>>
>>> Hi,
>>>
>>> I have a scenario that I use in Spark submit as follows:
>>>
>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar,
>>> */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar*
>>>
>>> As you can see the jar files needed are added.
>>>
>>>
>>> This comes back with error message as below
>>>
>>>
>>> Creating model test.weights_MODEL
>>>
>>> java.lang.NoClassDefFoundError:
>>> com/google/api/client/http/HttpRequestInitializer
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq$lzycompute(BigQuerySQLContext.scala:19)
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.bq(BigQuerySQLContext.scala:19)
>>>
>>>   at
>>> com.samelamin.spark.bigquery.BigQuerySQLContext.runDMLQuery(BigQuerySQLContext.scala:105)
>>>
>>>   ... 76 elided
>>>
>>> Caused by: java.lang.ClassNotFoundException:
>>> com.google.api.client.http.HttpRequestInitializer
>>>
>>>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>>>
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>
>>>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>>
>>>
>>> So there is an issue with finding the class, although the jar file used
>>>
>>>
>>> /home/hduser/jars/spark-bigquery_2.11-0.2.6.jar
>>>
>>> has it.
>>>
>>>
>>> Now if *I remove the above jar file and replace it with the same
>>> version but package* it works!
>>>
>>>
>>> spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars
>>> /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar
>>> *-**-packages com.github.samelamin:spark-bigquery_2.11:0.2.6*
>>>
>>>
>>> I have read the write-ups about packages searching the maven
>>> libraries etc. Not convinced why using the package should make so much
>>> difference between a failure and success. In other words, when to use a
>>> package rather than a jar.
>>>
>>>
>>> Any ideas will be appreciated.
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 28329 matches

Mail list logo