Re: [External] Re: push-based external shuffle service on K8S - Spark 4.0? Earlier versions?

2024-06-07 Thread Ofir Manor
Hi Ye - I am running Spark on K8S, looking to see if someone made external shuffle service on K8S work in their environment (ex: with some out-of-tree patches or hacks), as the push-based variant seems like it would be a great fit for me. Ofir From: Ye Zhou

Re: push-based external shuffle service on K8S - Spark 4.0? Earlier versions?

2024-06-06 Thread Ye Zhou
Hi Ofir. Right now, the push based shuffle within Spark is only supported for Spark on YARN, with external shuffle service running as auxiliary service in NodeManager, but not natively on K8s. As far as I know, there are no recent plans to add the support for Spark on K8s natively. For question

Re: push-based external shuffle service on K8S - Spark 4.0? Earlier versions?

2024-06-06 Thread Keyong Zhou
Hi Ofir, I can provide some information about use cases for Apache Celeborn. Apache Celeborn can be deployed on K8s and standalone, both are widely used in production environment by users. The largest cluster I know contains more than 1,000 Celeborn workers. Celeborn is specially beneficial for

push-based external shuffle service on K8S - Spark 4.0? Earlier versions?

2024-06-06 Thread Ofir Manor
Hi, Regarding the external shuffle service on K8S and especially the push-based variant that was merged in 3.2: 1. Are there plans to make it supported and work out-of-the-box in 4.0? 2. Did anyone make it work for themselves in 3.5 or earlier? If so, can you share your experience and what

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Matthew Powers
I am a huge fan of the Apache Spark docs and I regularly look at the analytics on this page to see how well they are doing. Great work to everyone that's contributed to the

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Neil Ramaswamy
Thanks all for the responses. Let me try to address everything. > the programming guides are also different between versions since features are being added, configs are being added/ removed/ changed, defaults are being changed etc. I agree that this is the case. But I think it's fine to mention

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Wenchen Fan
I agree with the idea of a versionless programming guide. But one thing we need to make sure of is we give clear messages for things that are only available in a new version. My proposal is: 1. keep the old versions' programming guide unchanged. For example, people can still access

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Martin Andersson
While I have no practical knowledge of how documentation is maintained in the spark project, I must agree with Nimrod. For users on older versions, having a programming guide that refers to features or API methods that does not exist in that version is confusing and detrimental. Surely there

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-05 Thread Nimrod Ofek
Hi Neil, While you wrote you don't mean the api docs (of course), the programming guides are also different between versions since features are being added, configs are being added/ removed/ changed, defaults are being changed etc. I know of "backport hell" - which is why I wrote that once a

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Neil Ramaswamy
Hi Nimrod, Quick clarification—my proposal will not touch API-specific documentation for the specific reasons you mentioned (signatures, behavior, etc.). It just aims to make the *programming guides *versionless. Programming guides should teach fundamentals of Spark, and the fundamentals of Spark

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Nimrod Ofek
Hi, While I think that the documentation needs a lot of improvement and important details are missing - and detaching the documentation from the main project can help iterating faster on documentation specific tasks, I don't think we can nor should move to versionless documentation.

Re: [DISCUSS] Versionless Spark Programming Guide Proposal

2024-06-04 Thread Praveen Gattu
+1. This helps for greater velocity in improving docs. However, we might still need a way to provide version specific information isn't it, i.e. what features are available in which version etc. On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy wrote: > Hi all, > > I've written up a proposal to

[ANNOUNCE] Announcing Apache Spark 4.0.0-preview1

2024-06-03 Thread Wenchen Fan
Hi all, To enable wide-scale community testing of the upcoming Spark 4.0 release, the Apache Spark community has posted a preview release of Spark 4.0. This preview is not a stable release in terms of either API or functionality, but it is meant to give the community early access to try the code

[DISCUSS] Variant shredding specification

2024-06-03 Thread Gene Pang
Hi all, We have been working on the Variant data type, which is designed to store and process semi-structured data efficiently, even with heterogeneous values. Users can store and process semi-structured data in a flexible way, without having to specify or know any fixed schema on write. Variant

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-06-02 Thread Wenchen Fan
The vote passes with 6+1s (4 binding +1s). (* = binding) +1: Wenchen Fan (*) Kent Yao Cheng Pan Xiao Li (*) Gengliang Wang (*) Tathagata Das (*) Thanks all! On Fri, May 31, 2024 at 6:07 PM Tathagata Das wrote: > +1 > - Tested RC3 with Delta Lake. All our Scala and Python tests pass. > > On

Unsubscribe

2024-05-31 Thread Ashish Singh

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-31 Thread Tathagata Das
+1 - Tested RC3 with Delta Lake. All our Scala and Python tests pass. On Fri, May 31, 2024 at 3:24 PM Xiao Li wrote: > +1 > > Cheng Pan 于2024年5月30日周四 09:48写道: > >> +1 (non-binding) >> >> - All links are valid >> - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6, >> HMS

Unsubscribe

2024-05-31 Thread Ashish
Sent from my iPhone - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-31 Thread Gengliang Wang
+1 On Fri, May 31, 2024 at 11:06 AM Xiao Li wrote: > +1 > > Cheng Pan 于2024年5月30日周四 09:48写道: > >> +1 (non-binding) >> >> - All links are valid >> - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6, >> HMS 2.3.9 >> - Pass integration tests with Apache Kyuubi v1.9.1 RC0 >>

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-31 Thread Xiao Li
+1 Cheng Pan 于2024年5月30日周四 09:48写道: > +1 (non-binding) > > - All links are valid > - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6, > HMS 2.3.9 > - Pass integration tests with Apache Kyuubi v1.9.1 RC0 > > Thanks, > Cheng Pan > > > On May 29, 2024, at 02:48, Wenchen Fan

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-30 Thread Cheng Pan
+1 (non-binding) - All links are valid - Run some basic quires using YARN client mode with Apache Hadoop v3.3.6, HMS 2.3.9 - Pass integration tests with Apache Kyuubi v1.9.1 RC0 Thanks, Cheng Pan > On May 29, 2024, at 02:48, Wenchen Fan wrote: > > Please vote on releasing the following

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-30 Thread Kent Yao
+1 (non-binding), I have checked: - Download links are fine - Signatures and integrities are fine - Build from source - run-example successfully with some example codes - No block issues from my side - Duplicated jars[1][2] found in both hive-jackson and examples/jars, the latter seems not

Unsubscribe

2024-05-29 Thread Jang tao
Unsubscribe

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-28 Thread Wenchen Fan
Hi all, I've created a PR to put the behavior change guideline on the Spark website: https://github.com/apache/spark-website/pull/518 . Please leave comments if you have any, thanks! On Wed, May 15, 2024 at 1:41 AM Wenchen Fan wrote: > Thanks all for the feedback here! Let me put up a new

unsubscribe

2024-05-28 Thread Lucas De Jaeger
unsubscribe

Re: [VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-28 Thread Wenchen Fan
one correction: "The tag to be voted on is v4.0.0-preview1-rc2 (commit 7cfe5a6e44e8d7079ae29ad3e2cee7231cd3dc66)" should be "The tag to be voted on is v4.0.0-preview1-rc3 (commit 7a7a8bc4bab591ac8b98b2630b38c57adf619b82):" On Tue, May 28, 2024 at 11:48 AM Wenchen Fan wrote: > Please vote on

[VOTE] SPARK 4.0.0-preview1 (RC3)

2024-05-28 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 4.0.0-preview1. The vote is open until May 31 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 4.0.0-preview1 [ ] -1 Do not release this package

Re: [VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Wenchen Fan
Thanks for the quick reply! I'm cutting RC3 now. On Tue, May 28, 2024 at 2:28 AM Kent Yao wrote: > -1 > > You've updated your key in [2] with a new one [1]. I believe you should > add your new key without removing the old one. Otherwise, users cannot > verify those archived releases you

Re: [VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Kent Yao
-1 You've updated your key in [2] with a new one [1]. I believe you should add your new key without removing the old one. Otherwise, users cannot verify those archived releases you published. Thanks, Kent Yao [1] https://dist.apache.org/repos/dist/dev/spark/KEYS [2]

Re: [VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Yi Wu
-1 I think we should include this bug fix https://github.com/apache/spark/commit/6cd1ccc56321dfa52672cd25f4cfdf2bbc86b3ea. The bug can lead to the unrecoverable job failure. Thanks, Yi On Tue, May 28, 2024 at 3:45 PM Wenchen Fan wrote: > Please vote on releasing the following candidate as

[VOTE] SPARK 4.0.0-preview1 (RC2)

2024-05-28 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 4.0.0-preview1. The vote is open until May 31 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 4.0.0-preview1 [ ] -1 Do not release this package

ArrowUtilSuite Fails with "NoSuchFieldError: chunkSize"

2024-05-28 Thread Senthil Kumar
Hello Team, We are seeing, ArrowUtilSuite test fails with "NoSuchFieldError: chunkSize" error. java.lang.NoSuchFieldError: Class io.netty.buffer.PoolArena does not have member field 'int chunkSize'. And Netty library does not have field 'int chunkSize' in 4.1.72/74/82/84 even in higher versions

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
sorry i thought i gave an explanation The issue you are encountering with incorrect record numbers in the "ShuffleWrite Size/Records" column in the Spark DAG UI when data is read from cache/persist is a known limitation. This discrepancy arises due to the way Spark handles and reports shuffle

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Just to further clarify that the Shuffle Write Size/Records column in the Spark UI can be misleading when working with cached/persisted data because it reflects the shuffled data size and record count, not the entire cached/persisted data., So it is fair to say that this is a limitation of the

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show incorrect record counts *when data is retrieved from cache or persisted data*. This happens because the record count reflects the number of records written to disk for shuffling, and not the actual number of records in the

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo
Can anyone please assist me ? On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote: > Does anyone have a clue ? > > On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > >> Hello Team, >> in spark DAG UI , we have Stages tab. Once you click on each stage you >> can view the tasks. >> >> In each

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > prints wrong data when it gets

BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Hello Team, in spark DAG UI , we have Stages tab. Once you click on each stage you can view the tasks. In each task we have a column "ShuffleWrite Size/Records " that column prints wrong data when it gets the data from cache/persist . it typically will show the wrong record number though the data

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Nicholas Chammas
[dev list to bcc] This is a question for the user list or for Stack Overflow . The dev list is for discussions related to the development of Spark itself. Nick > On May 21, 2024, at 6:58 AM,

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
Hello Vibhor, Thanks for the suggestion . I am looking for some other alternatives where I can use the same dataframe can be written to two destinations without re execution and cache or persist . Can some one help me in scenario 2 ? How to make spark write to MinIO faster ? Sent from my iPhone

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-20 Thread Vibhor Gupta
Hi Prem, You can try to write to HDFS then read from HDFS and write to MinIO. This will prevent duplicate transformation. You can also try persisting the dataframe using the DISK_ONLY level. Regards, Vibhor From: Prem Sahoo Date: Tuesday, 21 May 2024 at 8:16 AM To: Spark dev list Subject:

Dual Write to HDFS and MinIO in faster way

2024-05-20 Thread Prem Sahoo
Hello Team, I am planning to write to two datasource at the same time . Scenario:- Writing the same dataframe to HDFS and MinIO without re-executing the transformations and no cache(). Then how can we make it faster ? Read the parquet file and do a few transformations and write to HDFS and

[VOTE][RESULT] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh
The vote passes with 13+1s (8 binding +1s) and 1+0. (* = binding) +1: Chao Sun (*) Liang-Chi Hsieh (*) Huaxin Gao (*) Bo Yang Dongjoon Hyun (*) Kent Yao Wenchen Fan (*) Ryan Blue Anton Okolnychyi Zhou Jiang Gengliang Wang (*) Xiao Li (*) Hyukjin Kwon (*) +0: None Mich Talebzadeh -1: None

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread L. C. Hsieh
Hi all, Thanks all for participating and your support! The vote has been passed. I'll send out the result in a separate thread. On Wed, May 15, 2024 at 4:44 PM Hyukjin Kwon wrote: > > +1 > > On Tue, 14 May 2024 at 16:39, Wenchen Fan wrote: >> >> +1 >> >> On Tue, May 14, 2024 at 8:19 AM Zhou

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-15 Thread Hyukjin Kwon
+1 On Tue, 14 May 2024 at 16:39, Wenchen Fan wrote: > +1 > > On Tue, May 14, 2024 at 8:19 AM Zhou Jiang wrote: > >> +1 (non-binding) >> >> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >>> Hi all, >>> >>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >>> >>>

Re: [DISCUSS] clarify the definition of behavior changes

2024-05-15 Thread Wenchen Fan
Thanks all for the feedback here! Let me put up a new version, which clarifies the definition of "users": Behavior changes mean user-visible functional changes in a new release via public APIs. The "user" here is not only the user who writes queries and/or develops Spark plugins, but also the

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-15 Thread Wenchen Fan
RC1 failed because of this issue. I'll cut RC2 after we downgrade Jetty to 9.x. On Sat, May 11, 2024 at 3:37 PM Cheng Pan wrote: > -1 (non-binding) > > A small question, the tag is orphan but I suppose it should belong to the > master branch. > > Seems YARN integration is broken due to javax =>

Community over Code EU 2024: The countdown has started!

2024-05-14 Thread Ryan Skraba
[Note: You're receiving this email because you are subscribed to one or more project dev@ mailing lists at the Apache Software Foundation.] We are very close to Community Over Code EU -- check out the amazing program and the special discounts that we have for you. Special discounts You still

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Xiao Li
+1 Gengliang Wang 于2024年5月13日周一 16:24写道: > +1 > > On Mon, May 13, 2024 at 12:30 PM Zhou Jiang > wrote: > >> +1 (non-binding) >> >> On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >>> Hi all, >>> >>> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >>> >>> Please

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Wenchen Fan
+1 On Tue, May 14, 2024 at 8:19 AM Zhou Jiang wrote: > +1 (non-binding) > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > >> Hi all, >> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >> >> Please also refer to: >> >>- Discussion thread: >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Gengliang Wang
+1 On Mon, May 13, 2024 at 12:30 PM Zhou Jiang wrote: > +1 (non-binding) > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > >> Hi all, >> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >> >> Please also refer to: >> >>- Discussion thread: >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Zhou Jiang
+1 (non-binding) On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > Hi all, > > I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. > > Please also refer to: > >- Discussion thread: > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo >- JIRA ticket:

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Anton Okolnychyi
+1 On 2024/05/13 15:33:33 Ryan Blue wrote: > +1 > > On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh > wrote: > > > +0 > > > > For reasons I outlined in the discussion thread > > > > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo > > > > Mich Talebzadeh, > > Technologist |

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Ryan Blue
+1 On Mon, May 13, 2024 at 12:31 AM Mich Talebzadeh wrote: > +0 > > For reasons I outlined in the discussion thread > > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo > > Mich Talebzadeh, > Technologist | Architect | Data Engineer | Generative AI | FinCrime > London > United

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-13 Thread Wenchen Fan
Hi Nicholas, Thanks for your help! I'm definitely interested in participating in this unification work. Let me know how I can help. Wenchen On Mon, May 13, 2024 at 1:41 PM Nicholas Chammas wrote: > Re: unification > > We also have a long-standing problem with how we manage Python >

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-13 Thread Mich Talebzadeh
+0 For reasons I outlined in the discussion thread https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Nicholas Chammas
Re: unification We also have a long-standing problem with how we manage Python dependencies, something I’ve tried (unsuccessfully ) to fix in the past. Consider, for example, how many separate places this numpy dependency is installed: 1.

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Wenchen Fan
+1 On Mon, May 13, 2024 at 10:30 AM Kent Yao wrote: > +1 > > Dongjoon Hyun 于2024年5月13日周一 08:39写道: > > > > +1 > > > > On Sun, May 12, 2024 at 3:50 PM huaxin gao > wrote: > >> > >> +1 > >> > >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > >>> > >>> +1 > >>> > >>> On Sat, May 11, 2024

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-12 Thread Wenchen Fan
After finishing the 4.0.0-preview1 RC1, I have more experience with this topic now. In fact, the main job of the release process: building packages and documents, is tested in Github Action jobs. However, the way we test them is different from what we do in the release scripts. 1. the execution

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Kent Yao
+1 Dongjoon Hyun 于2024年5月13日周一 08:39写道: > > +1 > > On Sun, May 12, 2024 at 3:50 PM huaxin gao wrote: >> >> +1 >> >> On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: >>> >>> +1 >>> >>> On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: >>> > >>> > +1 >>> > >>> > On Sat, May 11, 2024 at 2:10 PM

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread Dongjoon Hyun
+1 On Sun, May 12, 2024 at 3:50 PM huaxin gao wrote: > +1 > > On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > >> +1 >> >> On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: >> > >> > +1 >> > >> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >> >> >> Hi all, >> >> >> >> I’d like to

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-12 Thread bo yang
+1 On Sat, May 11, 2024 at 4:43 PM huaxin gao wrote: > +1 > > On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > >> +1 >> >> On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: >> > >> > +1 >> > >> > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >> >> >> Hi all, >> >> >> >> I’d like to

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread huaxin gao
+1 On Sat, May 11, 2024 at 4:35 PM L. C. Hsieh wrote: > +1 > > On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: > > > > +1 > > > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > >> > >> Hi all, > >> > >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. > >> > >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh
+1 On Sat, May 11, 2024 at 3:11 PM Chao Sun wrote: > > +1 > > On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: >> >> Hi all, >> >> I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. >> >> Please also refer to: >> >>- Discussion thread: >>

Re: [VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Chao Sun
+1 On Sat, May 11, 2024 at 2:10 PM L. C. Hsieh wrote: > Hi all, > > I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. > > Please also refer to: > >- Discussion thread: > https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo >- JIRA ticket:

[VOTE] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread L. C. Hsieh
Hi all, I’d like to start a vote for SPIP: Stored Procedures API for Catalogs. Please also refer to: - Discussion thread: https://lists.apache.org/thread/7r04pz544c9qs3gc8q2nyj3fpzfnv8oo - JIRA ticket: https://issues.apache.org/jira/browse/SPARK-44167 - SPIP doc:

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Mich Talebzadeh
Thanks In the context of stored procedures API for Catalogs, this approach deviates from the traditional definition of stored procedures in RDBMS for two key reasons: - Compilation vs. Interpretation: Traditional stored procedures are typically pre-compiled into machine code for faster

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-11 Thread Anton Okolnychyi
Mich, I don't think the invalidation will be necessary in our case as there is no plan to preprocess or compile the procedures into executable objects. They will be loaded and executed on demand via the Catalog API. пт, 10 трав. 2024 р. о 10:37 Mich Talebzadeh пише: > Hi, > > If the underlying

Re: [VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-11 Thread Cheng Pan
-1 (non-binding) A small question, the tag is orphan but I suppose it should belong to the master branch. Seems YARN integration is broken due to javax => jakarta namespace migration, I filled SPARK-48238, and left some comments on https://github.com/apache/spark/pull/45154 Caused by:

[VOTE] SPARK 4.0.0-preview1 (RC1)

2024-05-10 Thread Wenchen Fan
Please vote on releasing the following candidate as Apache Spark version 4.0.0-preview1. The vote is open until May 16 PST and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 4.0.0-preview1 [ ] -1 Do not release this package

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-10 Thread Mich Talebzadeh
Hi, If the underlying table changes (DDL), if I recall from RDBMSs like Oracle, the stored procedure will be invalidated as it is a compiled object. How is this going to be handled? Does it follow the same mechanism? Thanks Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread huaxin gao
Thanks Anton for the updated proposal -- it looks great! I appreciate the hard work put into refining it. I am looking forward to the upcoming vote and moving forward with this initiative. Thanks, Huaxin On Thu, May 9, 2024 at 7:30 PM L. C. Hsieh wrote: > Thanks Anton. Thank you, Wenchen,

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Wenchen Fan
Thanks for leading this project! Let's move forward. On Fri, May 10, 2024 at 10:31 AM L. C. Hsieh wrote: > Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and > others if I miss those who are participating in the discussion. > > I suppose we have reached a consensus or close to

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread L. C. Hsieh
Thanks Anton. Thank you, Wenchen, Dongjoon, Ryan, Serge, Allison and others if I miss those who are participating in the discussion. I suppose we have reached a consensus or close to being in the design. If you have some more comments, please let us know. If not, I will go to start a vote soon

Re: [DISCUSS] SPIP: Stored Procedures API for Catalogs

2024-05-09 Thread Anton Okolnychyi
Thanks to everyone who commented on the design doc. I updated the proposal and it is ready for another look. I hope we can converge and move forward with this effort! - Anton пт, 19 квіт. 2024 р. о 15:54 Anton Okolnychyi пише: > Hi folks, > > I'd like to start a discussion on SPARK-44167 that

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
UPDATE: I've successfully uploaded the release packages: https://dist.apache.org/repos/dist/dev/spark/v4.0.0-preview1-rc1-bin/ (I skipped SparkR as I was not able to fix the errors, I'll get back to it later) However, there is a new issue with doc building:

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
Please re-try to upload, Wenchen. ASF Infra team bumped up our upload limit based on our request. > Your upload limit has been increased to 650MB Dongjoon. On Thu, May 9, 2024 at 8:12 AM Wenchen Fan wrote: > I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776 > > On

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
I've created a ticket: https://issues.apache.org/jira/browse/INFRA-25776 On Thu, May 9, 2024 at 11:06 PM Dongjoon Hyun wrote: > In addition, FYI, I was the latest release manager with Apache Spark 3.4.3 > (2024-04-15 Vote) > > According to my work log, I uploaded the following binaries to SVN

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
In addition, FYI, I was the latest release manager with Apache Spark 3.4.3 (2024-04-15 Vote) According to my work log, I uploaded the following binaries to SVN from EC2 (us-west-2) without any issues. -rw-r--r--. 1 centos centos 311384003 Apr 15 01:29 pyspark-3.4.3.tar.gz -rw-r--r--. 1 centos

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Dongjoon Hyun
Could you file an INFRA JIRA issue with the error message and context first, Wenchen? As you know, if we see something, we had better file a JIRA issue because it could be not only an Apache Spark project issue but also all ASF project issues. Dongjoon. On Thu, May 9, 2024 at 12:28 AM Wenchen

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Wenchen Fan
Thanks for starting the discussion! To add a bit more color, we should at least add a test job to make sure the release script can produce the packages correctly. Today it's kind of being manually tested by the release manager each time, which slows down the release process. It's better if we can

Re: [DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Hussein Awala
Hello, I can answer some of your common questions with other Apache projects. > Who currently has permissions for Github actions? Is there a specific owner for that today or a different volunteer each time? The Apache organization owns Github Actions, and committers (contributors with write

[DISCUSS] Spark - How to improve our release processes

2024-05-09 Thread Nimrod Ofek
Following the conversation started with Spark 4.0.0 release, this is a thread to discuss improvements to our release processes. I'll Start by raising some questions that probably should have answers to start the discussion: 1. What is currently running in GitHub Actions? 2. Who currently

Re: [DISCUSS] Spark 4.0.0 release

2024-05-09 Thread Wenchen Fan
UPDATE: After resolving a few issues in the release scripts, I can finally build the release packages. However, I can't upload them to the staging SVN repo due to a transmitting error, and it seems like a limitation from the server side. I tried it on both my local laptop and remote AWS instance,

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Very helpful! On Wed, May 8, 2024 at 9:07 AM Mich Talebzadeh wrote: > *Potential reasons* > > >- Data Serialization: Spark needs to serialize the DataFrame into an >in-memory format suitable for storage. This process can be time-consuming, >especially for large datasets like 3.2 GB

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Holden Karau
That looks cool, maybe let’s split off a thread on how to improve our release processes? Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Erik Krogen
On that note, GitHub recently released (public preview) a new feature called Artifact Attestions which may be relevant/useful here: Introducing Artifact Attestations–now in public beta - The GitHub Blog On Wed,

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Mich Talebzadeh
*Potential reasons* - Data Serialization: Spark needs to serialize the DataFrame into an in-memory format suitable for storage. This process can be time-consuming, especially for large datasets like 3.2 GB with complex schemas. - Shuffle Operations: If your transformations involve

Re: caching a dataframe in Spark takes lot of time

2024-05-08 Thread Prem Sahoo
Could any one help me here ? Sent from my iPhone > On May 7, 2024, at 4:30 PM, Prem Sahoo wrote: > >  > Hello Folks, > in Spark I have read a file and done some transformation and finally writing > to hdfs. > > Now I am interested in writing the same dataframe to MapRFS but for this > Spark

Re: [DISCUSS] Spark 4.0.0 release

2024-05-08 Thread Nimrod Ofek
I have no permissions so I can't do it but I'm happy to help (although I am more familiar with Gitlab CICD than Github Actions). Is there some point of contact that can provide me needed context and permissions? I'd also love to see why the costs are high and see how we can reduce them... Thanks,

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
I think signing the artifacts produced from a secure CI sounds like a good idea. I know we’ve been asked to reduce our GitHub action usage but perhaps someone interested could volunteer to set that up. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.):

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi, Thanks for the reply. >From my experience, a build on a build server would be much more predictable and less error prone than building on some laptop- and of course much faster to have builds, snapshots, release candidates, early previews releases, release candidates or final releases. It

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Holden Karau
Indeed. We could conceivably build the release in CI/CD but the final verification / signing should be done locally to keep the keys safe (there was some concern from earlier release processes). Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.):

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Dongjoon Hyun
Thank you so much for the update, Wenchen! Dongjoon. On Tue, May 7, 2024 at 10:49 AM Wenchen Fan wrote: > UPDATE: > > Unfortunately, it took me quite some time to set up my laptop and get it > ready for the release process (docker desktop doesn't work anymore, my pgp > key is lost, etc.). I'll

caching a dataframe in Spark takes lot of time

2024-05-07 Thread Prem Sahoo
Hello Folks, in Spark I have read a file and done some transformation and finally writing to hdfs. Now I am interested in writing the same dataframe to MapRFS but for this Spark will execute the full DAG again (recompute all the previous steps)(all the read + transformations ). I don't want

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Nimrod Ofek
Hi, Sorry for the novice question, Wenchen - the release is done manually from a laptop? Not using a CI CD process on a build server? Thanks, Nimrod On Tue, May 7, 2024 at 8:50 PM Wenchen Fan wrote: > UPDATE: > > Unfortunately, it took me quite some time to set up my laptop and get it > ready

Re: [DISCUSS] Spark 4.0.0 release

2024-05-07 Thread Wenchen Fan
UPDATE: Unfortunately, it took me quite some time to set up my laptop and get it ready for the release process (docker desktop doesn't work anymore, my pgp key is lost, etc.). I'll start the RC process at my tomorrow. Thanks for your patience! Wenchen On Fri, May 3, 2024 at 7:47 AM yangjie01

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3))) val df =

Re: ASF board report draft for May

2024-05-06 Thread Matei Zaharia
I’ll mention that we’re working toward a preview release, even if the details are not finalized by the time we sent the report. > On May 6, 2024, at 10:52 AM, Holden Karau wrote: > > I trust Wenchen to manage the preview release effectively but if there are > concerns around how to manage a

Re: ASF board report draft for May

2024-05-06 Thread Holden Karau
I trust Wenchen to manage the preview release effectively but if there are concerns around how to manage a developer preview release lets split that off from the board report discussion. On Mon, May 6, 2024 at 10:44 AM Mich Talebzadeh wrote: > I did some historical digging on this. > > Whilst

  1   2   3   4   5   6   7   8   9   10   >