Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Yuanjian Li
+1

Chao Sun  于2024年4月1日周一 07:56写道:

> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
 +1

 On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
 wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI
> (Spark Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>



Re: [VOTE] SPIP: State Data Source - Reader

2023-10-25 Thread Yuanjian Li
+1

Jungtaek Lim  于2023年10月25日周三 01:06写道:

> Friendly reminder: the VOTE thread got 2 binding votes and needs 1 more
> binding vote to pass.
>
> On Wed, Oct 25, 2023 at 1:21 AM Bartosz Konieczny 
> wrote:
>
>> +1
>>
>> On Tuesday, October 24, 2023, Jia Fan  wrote:
>>
>>> +1
>>>
>>> L. C. Hsieh  于2023年10月24日周二 13:23写道:
>>>
 +1

 On Mon, Oct 23, 2023 at 6:31 PM Anish Shrigondekar
  wrote:
 >
 > +1 (non-binding)
 >
 > Thanks,
 > Anish
 >
 > On Mon, Oct 23, 2023 at 5:01 PM Wenchen Fan 
 wrote:
 >>
 >> +1
 >>
 >> On Mon, Oct 23, 2023 at 4:03 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 >>>
 >>> Starting with my +1 (non-binding). Thanks!
 >>>
 >>> On Mon, Oct 23, 2023 at 1:23 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:
 
  Hi all,
 
  I'd like to start the vote for SPIP: State Data Source - Reader.
 
  The high level summary of the SPIP is that we propose a new data
 source which enables a read ability for state store in the checkpoint, via
 batch query. This would enable two major use cases 1) constructing tests
 with verifying state store 2) inspecting values in state store in the
 scenario of incident.
 
  References:
 
  JIRA ticket
  SPIP doc
  Discussion thread
 
  Please vote on the SPIP for the next 72 hours:
 
  [ ] +1: Accept the proposal as an official SPIP
  [ ] +0
  [ ] -1: I don’t think this is a good idea because …
 
  Thanks!
  Jungtaek Lim (HeartSaVioR)

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


>>
>> --
>> Bartosz Konieczny
>> freelance data engineer
>> https://www.waitingforcode.com
>> https://github.com/bartosz25/
>> https://twitter.com/waitingforcode
>>
>>
>>


Re: [DISCUSS] SPIP: State Data Source - Reader

2023-10-18 Thread Yuanjian Li
+1, I have no issues with the practicality and value of this feature itself.
I've left some comments concerning ongoing maintenance and
compatibility-related matters, which we can continue to discuss.

Jungtaek Lim  于2023年10月17日周二 05:23写道:

> Thanks Bartosz and Anish for your support!
>
> I'll wait for a couple more days to see whether we can hear more voices on
> this. We could probably look for initiating a VOTE thread if there is no
> objection.
>
> On Tue, Oct 17, 2023 at 5:48 AM Anish Shrigondekar <
> anish.shrigonde...@databricks.com> wrote:
>
>> Hi Jungtaek,
>>
>> Thanks for putting this together. +1 from me and looks good overall.
>> Posted some minor comments/questions to the doc.
>>
>> Thanks,
>> Anish
>>
>> On Mon, Oct 16, 2023 at 11:25 AM Bartosz Konieczny <
>> bartkoniec...@gmail.com> wrote:
>>
>>> Thank you, Jungtaek, for your answers! It's clear now.
>>>
>>> +1 for me. It seems like a prerequisite for further ops-related
>>> improvements for the state store management. I mean especially here the
>>> state rebalancing that could rely on this read+write state store API. I
>>> don't mean here the dynamic state rebalancing that could probably be
>>> implemented with a lower latency directly in the stateful API. Instead I'm
>>> thinking more of an offline job to rebalance the state and later restart
>>> the stateful pipeline with the changed number of shuffle partitions.
>>>
>>> Best,
>>> Bartosz.
>>>
>>> On Mon, Oct 16, 2023 at 6:19 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 bump for better reach

 On Thu, Oct 12, 2023 at 4:26 PM Jungtaek Lim <
 kabhwan.opensou...@gmail.com> wrote:

> Sorry, please use this link instead for SPIP doc:
> https://docs.google.com/document/d/1_iVf_CIu2RZd3yWWF6KoRNlBiz5NbSIK0yThqG0EvPY/edit?usp=sharing
>
>
> On Thu, Oct 12, 2023 at 3:58 PM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> Hi dev,
>>
>> I'd like to start a discussion on "State Data Source - Reader".
>>
>> This proposal aims to introduce a new data source "statestore" which
>> enables reading the state rows from existing checkpoint via offline 
>> (batch)
>> query. This will enable users to 1) create unit tests against stateful
>> query verifying the state value (especially flatMapGroupsWithState), 2)
>> gather more context on the status when an incident occurs, especially for
>> incorrect output.
>>
>> *SPIP*:
>> https://docs.google.com/document/d/1HjEupRv8TRFeULtJuxRq_tEG1Wq-9UNu-ctGgCYRke0/edit?usp=sharing
>> *JIRA*: https://issues.apache.org/jira/browse/SPARK-45511
>>
>> Looking forward to your feedback!
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. The scope of the project is narrowed to the reader in this SPIP,
>> since the writer requires us to consider more cases. We are planning on 
>> it.
>>
>
>>>
>>> --
>>> Bartosz Konieczny
>>> freelance data engineer
>>> https://www.waitingforcode.com
>>> https://github.com/bartosz25/
>>> https://twitter.com/waitingforcode
>>>
>>>


Re: [ANNOUNCE] Apache Spark 3.5.0 released

2023-09-26 Thread Yuanjian Li
FYI, we received the handling from Pypi
<https://github.com/pypi/support/issues/3175> org yesterday, and the upload
of version 3.5.0 has just been completed. Please assist in verifying it.
Thank you!

Ruifeng Zheng  于2023年9月17日周日 23:28写道:

> Thanks Yuanjian for driving this release, Congratulations!
>
> On Mon, Sep 18, 2023 at 2:16 PM Maxim Gekk
>  wrote:
>
>> Thank you for the work, Yuanjian!
>>
>> On Mon, Sep 18, 2023 at 6:28 AM beliefer  wrote:
>>
>>> Congratulations! Apache Spark.
>>>
>>>
>>>
>>> At 2023-09-16 01:01:40, "Yuanjian Li"  wrote:
>>>
>>> Hi All,
>>>
>>> We are happy to announce the availability of *Apache Spark 3.5.0*!
>>>
>>> Apache Spark 3.5.0 is the sixth release of the 3.x line.
>>>
>>> To download Spark 3.5.0, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>> (Please note: the PyPi upload is pending due to a size limit request;
>>> we're actively following up here
>>> <https://github.com/pypi/support/issues/3175> with the PyPi
>>> organization)
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-5-0.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Best,
>>> Yuanjian
>>>
>>>
>
> --
> Ruifeng Zheng
> E-mail: zrfli...@gmail.com
>


Re: [VOTE] Updating documentation hosted for EOL and maintenance releases

2023-09-26 Thread Yuanjian Li
+1

Denny Lee  于2023年9月26日周二 12:07写道:

> +1
>
> On Tue, Sep 26, 2023 at 10:52 Maciej  wrote:
>
>> +1
>>
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> PGP: A30CEF0C31A501EC
>>
>> On 9/26/23 17:12, Michel Miotto Barbosa wrote:
>>
>> +1
>>
>> A disposição | At your disposal
>>
>> Michel Miotto Barbosa
>> https://www.linkedin.com/in/michelmiottobarbosa/
>> mmiottobarb...@gmail.com
>> +55 11 984 342 347
>>
>>
>>
>>
>> On Tue, Sep 26, 2023 at 11:44 AM Herman van Hovell
>>   wrote:
>>
>>> +1
>>>
>>> On Tue, Sep 26, 2023 at 10:39 AM yangjie01 
>>>  wrote:
>>>
 +1



 *发件人**: *Yikun Jiang 
 *日期**: *2023年9月26日 星期二 18:06
 *收件人**: *dev 
 *抄送**: *Hyukjin Kwon , Ruifeng Zheng <
 ruife...@apache.org>
 *主题**: *Re: [VOTE] Updating documentation hosted for EOL and
 maintenance releases



 +1, I believe it is a wise choice to update the EOL policy of the
 document based on the real demands of community users.


 Regards,

 Yikun





 On Tue, Sep 26, 2023 at 1:06 PM Ruifeng Zheng 
 wrote:

 +1



 On Tue, Sep 26, 2023 at 12:51 PM Hyukjin Kwon 
 wrote:

 Hi all,

 I would like to start the vote for updating documentation hosted for
 EOL and maintenance releases to improve the usability here, and in order
 for end users to read the proper and correct documentation.


 For discussion thread, please refer to
 https://lists.apache.org/thread/1675rzxx5x4j2x03t9x0kfph8tlys0cx
 .




 Here is one example:
 - https://github.com/apache/spark/pull/42989
 

 - https://github.com/apache/spark-website/pull/480
 



 Starting with my own +1.




[ANNOUNCE] Apache Spark 3.5.0 released

2023-09-15 Thread Yuanjian Li
Hi All,

We are happy to announce the availability of *Apache Spark 3.5.0*!

Apache Spark 3.5.0 is the sixth release of the 3.x line.

To download Spark 3.5.0, head over to the download page:
https://spark.apache.org/downloads.html
(Please note: the PyPi upload is pending due to a size limit request; we're
actively following up here 
with the PyPi organization)

To view the release notes:
https://spark.apache.org/releases/spark-release-3-5-0.html

We would like to acknowledge all community members for contributing to this
release. This release would not have been possible without you.

Best,
Yuanjian


[VOTE][RESULT] Release Apache Spark 3.5.0 (RC5)

2023-09-12 Thread Yuanjian Li
The vote passes with 13 +1s (8 binding +1s).
Thank you all who helped with the release!

(* = binding)
+1:
- Mridul Muralidharan (*)
- Yuanjian Li
- Xiao Li (*)
- Gengliang Wang (*)
- Hyukjin Kwon (*)
- Ruifeng Zheng (*)
- Jungtaek Lim
- Wenchen Fan (*)
- Jia Fan
- Jie Yang
- Yuming Wang (*)
- Kent Yao
- Dongjoon Hyun (*)

+0: None

-1: None


Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Yuanjian Li
+1 (non-binding)

Yuanjian Li  于2023年9月11日周一 09:36写道:

> @Peter Toth  I've looked into the details of this
> issue, and it appears that it's neither a regression in version 3.5.0 nor a
> correctness issue. It's a bug related to a new feature. I think we can fix
> this in 3.5.1 and list it as a known issue of the Scala client of Spark
> Connect in 3.5.0.
>
> Mridul Muralidharan  于2023年9月10日周日 04:12写道:
>
>>
>> +1
>>
>> Signatures, digests, etc check out fine.
>> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>>
>> Regards,
>> Mridul
>>
>> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
>> wrote:
>>
>>> Please vote on releasing the following candidate(RC5) as Apache Spark
>>> version 3.5.0.
>>>
>>> The vote is open until 11:59pm Pacific time Sep 11th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.5.0-rc5 (commit
>>> ce5ddad990373636e94071e7cef2f31021add07b):
>>>
>>> https://github.com/apache/spark/tree/v3.5.0-rc5
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1449
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>>>
>>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>
>>> This release is using the release script of the tag v3.5.0-rc5.
>>>
>>>
>>> FAQ
>>>
>>> =
>>>
>>> How can I help test this release?
>>>
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>>
>>> an existing Spark workload and running on this release candidate, then
>>>
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>>
>>> the current RC and see if anything important breaks, in the Java/Scala
>>>
>>> you can add the staging repository to your projects resolvers and test
>>>
>>> with the RC (make sure to clean up the artifact cache before/after so
>>>
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>>
>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>
>>> ===
>>>
>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.5.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>>
>>> fixes, documentation, and API tweaks that impact compatibility should
>>>
>>> be worked on immediately. Everything else please retarget to an
>>>
>>> appropriate release.
>>>
>>> ==
>>>
>>> But my bug isn't fixed?
>>>
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>>
>>> release unless the bug in question is a regression from the previous
>>>
>>> release. That being said, if there is something which is a regression
>>>
>>> that has not been correctly targeted please ping me or a committer to
>>>
>>> help target the issue.
>>>
>>> Thanks,
>>>
>>> Yuanjian Li
>>>
>>


Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-11 Thread Yuanjian Li
@Peter Toth  I've looked into the details of this
issue, and it appears that it's neither a regression in version 3.5.0 nor a
correctness issue. It's a bug related to a new feature. I think we can fix
this in 3.5.1 and list it as a known issue of the Scala client of Spark
Connect in 3.5.0.

Mridul Muralidharan  于2023年9月10日周日 04:12写道:

>
> +1
>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Phive -Pyarn -Pmesos -Pkubernetes
>
> Regards,
> Mridul
>
> On Sat, Sep 9, 2023 at 10:02 AM Yuanjian Li 
> wrote:
>
>> Please vote on releasing the following candidate(RC5) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Sep 11th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc5 (commit
>> ce5ddad990373636e94071e7cef2f31021add07b):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc5
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1449
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc5.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can be found at:
>>
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.0
>>
>> Committers should look at those and triage. Extremely important bug
>>
>> fixes, documentation, and API tweaks that impact compatibility should
>>
>> be worked on immediately. Everything else please retarget to an
>>
>> appropriate release.
>>
>> ==
>>
>> But my bug isn't fixed?
>>
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>>
>> release unless the bug in question is a regression from the previous
>>
>> release. That being said, if there is something which is a regression
>>
>> that has not been correctly targeted please ping me or a committer to
>>
>> help target the issue.
>>
>> Thanks,
>>
>> Yuanjian Li
>>
>


Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-10 Thread Yuanjian Li
@ian.a.mann...@gmail.com  Thank you for your
question. Because the voting period hasn't ended yet and this fix has just
been merged, we don't want to release version 3.5.0 with a known
correctness bug.

We've quickly cut RC5, and we welcome you to continue assisting with the
testing.

Ian Manning  于2023年9月9日周六 02:27写道:

> This issue is not a regression and yet we fail the vote?  Couldn't this
> issue have been fixed in 3.5.1?
>
> Sorry I am new, so maybe this is how it works?
>
> On Sat, 9 Sep 2023, 02:29 Dongjoon Hyun,  wrote:
>
>> Sorry but I'm -1 because there exists a late-arrival correctness patch
>> although it's not a regression.
>>
>> - https://issues.apache.org/jira/browse/SPARK-44805
>> "Data lost after union using
>> spark.sql.parquet.enableNestedColumnVectorizedReader=true"
>>
>> - https://github.com/apache/spark/pull/42850
>> -
>> https://github.com/apache/spark/commit/b2b2ba97d3003d25d159943ab8a4bf50e421fdab
>> (branch-3.5)
>>
>> Dongjoon.
>>
>>
>>>>>>
>>>>>> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li 
>>>>>> wrote:
>>>>>>
>>>>>> Please vote on releasing the following candidate(RC4) as Apache Spark
>>>>>> version 3.5.0.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The vote is open until 11:59pm Pacific time *Sep 8th* and passes if
>>>>>> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>>>>
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>>
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>>
>>>>>>
>>>>>> The tag to be voted on is v3.5.0-rc4 (commit
>>>>>> c2939589a29dd0d6a2d3d31a8d833877a37ee02a):
>>>>>>
>>>>>> https://github.com/apache/spark/tree/v3.5.0-rc4
>>>>>>
>>>>>>
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>>
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/
>>>>>>
>>>>>>
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>>
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1448
>>>>>>
>>>>>>
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>>
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/
>>>>>>
>>>>>>
>>>>>>
>>>>>> The list of bug fixes going into 3.5.0 can be found at the following
>>>>>> URL:
>>>>>>
>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>>>>
>>>>>>
>>>>>>
>>>>>> This release is using the release script of the tag v3.5.0-rc4.
>>>>>>
>>>>>>
>>>>>>
>>>>>> FAQ
>>>>>>
>>>>>>
>>>>>>
>>>>>> =
>>>>>>
>>>>>> How can I help test this release?
>>>>>>
>>>>>> =
>>>>>>
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>>
>>>>>> reporting any regressions.
>>>>>>
>>>>>>
>>>>>>
>>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>>
>>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>>>
>>>>>> you can add the staging repository to your projects resolvers and test
>>>>>>
>>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>>
>>>>>> you don't end up building with an out of date RC going forward).
>>>>>>
>>>>>>
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>>>>
>>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>>> Version/s" = 3.5.0
>>>>>>
>>>>>>
>>>>>>
>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>
>>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>>
>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>
>>>>>> appropriate release.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ==
>>>>>>
>>>>>> But my bug isn't fixed?
>>>>>>
>>>>>> ==
>>>>>>
>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>
>>>>>> release unless the bug in question is a regression from the previous
>>>>>>
>>>>>> release. That being said, if there is something which is a regression
>>>>>>
>>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>>
>>>>>> help target the issue.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Yuanjian Li
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>


Re: [VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-10 Thread Yuanjian Li
Yes, SPARK-44805 has been included. For the commits from RC4 to RC5, please
refer to https://github.com/apache/spark/commits/v3.5.0-rc5.

Mich Talebzadeh  于2023年9月9日周六 08:09写道:

> Apologies that should read ... release 3.5.0 (RC4) plus ..
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 9 Sept 2023 at 15:58, Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Can you please confirm that this cut is release 3.4.0 plus the resolved
>> Jira  https://issues.apache.org/jira/browse/SPARK-44805 which was
>> already fixed yesterday?
>>
>> Nothing else I believe?
>>
>> Thanks
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 9 Sept 2023 at 15:42, Yuanjian Li  wrote:
>>
>>> Please vote on releasing the following candidate(RC5) as Apache Spark
>>> version 3.5.0.
>>>
>>> The vote is open until 11:59pm Pacific time Sep 11th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.5.0-rc5 (commit
>>> ce5ddad990373636e94071e7cef2f31021add07b):
>>>
>>> https://github.com/apache/spark/tree/v3.5.0-rc5
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1449
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/
>>>
>>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>
>>> This release is using the release script of the tag v3.5.0-rc5.
>>>
>>>
>>> FAQ
>>>
>>> =
>>>
>>> How can I help test this release?
>>>
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>>
>>> an existing Spark workload and running on this release candidate, then
>>>
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>>
>>> the current RC and see if anything important breaks, in the Java/Scala
>>>
>>> you can add the staging repository to your projects resolvers and test
>>>
>>> with the RC (make sure to clean up the artifact cache before/after so
>>>
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>>
>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>
>>> =======
>>>
>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.5.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>>
>>> fixes, documentation, and API tweaks that impact compatibility should
>>>
>>> be worked on immediately. Everything else please retarget to an
>>>
>>> appropriate release.
>>>
>>> ==
>>>
>>> But my bug isn't fixed?
>>>
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>>
>>> release unless the bug in question is a regression from the previous
>>>
>>> release. That being said, if there is something which is a regression
>>>
>>> that has not been correctly targeted please ping me or a committer to
>>>
>>> help target the issue.
>>>
>>> Thanks,
>>>
>>> Yuanjian Li
>>>
>>


[VOTE] Release Apache Spark 3.5.0 (RC5)

2023-09-09 Thread Yuanjian Li
Please vote on releasing the following candidate(RC5) as Apache Spark
version 3.5.0.

The vote is open until 11:59pm Pacific time Sep 11th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.5.0-rc5 (commit
ce5ddad990373636e94071e7cef2f31021add07b):

https://github.com/apache/spark/tree/v3.5.0-rc5

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1449

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc5-docs/

The list of bug fixes going into 3.5.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12352848

This release is using the release script of the tag v3.5.0-rc5.


FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with an out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.5.0?

===

The current list of open tickets targeted at 3.5.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.5.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Thanks,

Yuanjian Li


Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-08 Thread Yuanjian Li
@Dongjoon Hyun  Thank you for reporting this and
for your prompt response.

The vote has failed. I'll cut RC5 tonight, PST time.

Dongjoon Hyun  于2023年9月8日周五 15:57写道:

> Sorry but I'm -1 because there exists a late-arrival correctness patch
> although it's not a regression.
>
> - https://issues.apache.org/jira/browse/SPARK-44805
> "Data lost after union using
> spark.sql.parquet.enableNestedColumnVectorizedReader=true"
>
> - https://github.com/apache/spark/pull/42850
> -
> https://github.com/apache/spark/commit/b2b2ba97d3003d25d159943ab8a4bf50e421fdab
> (branch-3.5)
>
> Dongjoon.
>
>
>>>>>
>>>>> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li 
>>>>> wrote:
>>>>>
>>>>> Please vote on releasing the following candidate(RC4) as Apache Spark
>>>>> version 3.5.0.
>>>>>
>>>>>
>>>>>
>>>>> The vote is open until 11:59pm Pacific time *Sep 8th* and passes if a
>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>>
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>>>
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>>
>>>>>
>>>>> The tag to be voted on is v3.5.0-rc4 (commit
>>>>> c2939589a29dd0d6a2d3d31a8d833877a37ee02a):
>>>>>
>>>>> https://github.com/apache/spark/tree/v3.5.0-rc4
>>>>>
>>>>>
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>>
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/
>>>>>
>>>>>
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>>
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>>
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1448
>>>>>
>>>>>
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>>
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/
>>>>>
>>>>>
>>>>>
>>>>> The list of bug fixes going into 3.5.0 can be found at the following
>>>>> URL:
>>>>>
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>>>
>>>>>
>>>>>
>>>>> This release is using the release script of the tag v3.5.0-rc4.
>>>>>
>>>>>
>>>>>
>>>>> FAQ
>>>>>
>>>>>
>>>>>
>>>>> =
>>>>>
>>>>> How can I help test this release?
>>>>>
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>
>>>>> an existing Spark workload and running on this release candidate, then
>>>>>
>>>>> reporting any regressions.
>>>>>
>>>>>
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>>
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>>
>>>>> you can add the staging repository to your projects resolvers and test
>>>>>
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>>
>>>>> you don't end up building with an out of date RC going forward).
>>>>>
>>>>>
>>>>>
>>>>> ===
>>>>>
>>>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>>>
>>>>> ===
>>>>>
>>>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>>>
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 3.5.0
>>>>>
>>>>>
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>>
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>>
>>>>> be worked on immediately. Everything else please retarget to an
>>>>>
>>>>> appropriate release.
>>>>>
>>>>>
>>>>>
>>>>> ==
>>>>>
>>>>> But my bug isn't fixed?
>>>>>
>>>>> ==
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>>
>>>>> release unless the bug in question is a regression from the previous
>>>>>
>>>>> release. That being said, if there is something which is a regression
>>>>>
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>>
>>>>> help target the issue.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Yuanjian Li
>>>>>
>>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>


Re: [VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Yuanjian Li
+1 (non-binding)

Xiao Li  于2023年9月6日周三 15:27写道:

> +1
>
> Xiao
>
> Herman van Hovell  于2023年9月6日周三 22:08写道:
>
>> Tested connect, and everything looks good.
>>
>> +1
>>
>> On Wed, Sep 6, 2023 at 8:11 AM Yuanjian Li 
>> wrote:
>>
>>> Please vote on releasing the following candidate(RC4) as Apache Spark
>>> version 3.5.0.
>>>
>>> The vote is open until 11:59pm Pacific time Sep 8th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.5.0-rc4 (commit
>>> c2939589a29dd0d6a2d3d31a8d833877a37ee02a):
>>>
>>> https://github.com/apache/spark/tree/v3.5.0-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>>
>>> https://repository.apache.org/content/repositories/orgapachespark-1448
>>>
>>> The documentation corresponding to this release can be found at:
>>>
>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/
>>>
>>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>>
>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>
>>> This release is using the release script of the tag v3.5.0-rc4.
>>>
>>>
>>> FAQ
>>>
>>> =
>>>
>>> How can I help test this release?
>>>
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>>
>>> an existing Spark workload and running on this release candidate, then
>>>
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>>
>>> the current RC and see if anything important breaks, in the Java/Scala
>>>
>>> you can add the staging repository to your projects resolvers and test
>>>
>>> with the RC (make sure to clean up the artifact cache before/after so
>>>
>>> you don't end up building with an out of date RC going forward).
>>>
>>> ===
>>>
>>> What should happen to JIRA tickets still targeting 3.5.0?
>>>
>>> ===
>>>
>>> The current list of open tickets targeted at 3.5.0 can be found at:
>>>
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.5.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>>
>>> fixes, documentation, and API tweaks that impact compatibility should
>>>
>>> be worked on immediately. Everything else please retarget to an
>>>
>>> appropriate release.
>>>
>>> ==
>>>
>>> But my bug isn't fixed?
>>>
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>>
>>> release unless the bug in question is a regression from the previous
>>>
>>> release. That being said, if there is something which is a regression
>>>
>>> that has not been correctly targeted please ping me or a committer to
>>>
>>> help target the issue.
>>>
>>> Thanks,
>>>
>>> Yuanjian Li
>>>
>>


Release Note of Apache Spark 3.5.0

2023-09-06 Thread Yuanjian Li
Hi All,

Thank you all for your valuable contributions to the Spark 3.5 release so
far!

I would appreciate your review and feedback on the release note.

Please see here
<https://docs.google.com/document/d/1udQTnvRVQb4Tn9ENtpaiwlU4rp11a_Gam5pERF_Ovs8/edit#heading=h.tqu2sa6s6myz>
for
the draft release note of Apache Spark 3.5.0 and feel free to add your
comments if any.

Thanks,
Yuanjian Li


[VOTE] Release Apache Spark 3.5.0 (RC4)

2023-09-06 Thread Yuanjian Li
Please vote on releasing the following candidate(RC4) as Apache Spark
version 3.5.0.

The vote is open until 11:59pm Pacific time Sep 8th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.5.0-rc4 (commit
c2939589a29dd0d6a2d3d31a8d833877a37ee02a):

https://github.com/apache/spark/tree/v3.5.0-rc4

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1448

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc4-docs/

The list of bug fixes going into 3.5.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12352848

This release is using the release script of the tag v3.5.0-rc4.


FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with an out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.5.0?

===

The current list of open tickets targeted at 3.5.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.5.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Thanks,

Yuanjian Li


Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Yuanjian Li
Sure, no problem.

Holden Karau  于2023年9月2日周六 22:10写道:

> Can we delay the next RC cut until after Labor Day?
>
> On Sat, Sep 2, 2023 at 9:59 PM Yuanjian Li  wrote:
>
>> Thank you for all the reports!
>> The vote has failed. I plan to cut RC4 in two days.
>>
>> @Dipayan Dev  I quickly skimmed through the
>> corresponding ticket, and it doesn't seem to be a regression introduced in
>> 3.5. Additionally, someone is asking if this is the same issue as
>> SPARK-35279.
>> @Yuming Wang  I will check the signature for RC4
>> @Jungtaek Lim  I will follow-up with you
>> regarding SPARK-45045 <https://issues.apache.org/jira/browse/SPARK-45045>
>> @Wenchen Fan  Agree, we should include the
>> correctness fix in 3.5
>>
>> Jungtaek Lim  于2023年8月31日周四 23:45写道:
>>
>>> My apologies, I have to add another ticket for a blocker, SPARK-45045
>>> <https://issues.apache.org/jira/browse/SPARK-45045>. That said, I'm -1
>>> (non-binding).
>>>
>>> SPARK-43183 <https://issues.apache.org/jira/browse/SPARK-43183> made a
>>> behavioral change regarding the StreamingQueryListener as well as
>>> StreamingQuery API as a side-effect, while the intention was more about
>>> introducing the change in the former one. I just got some reports that the
>>> behavioral change for StreamingQuery API broke various tests in 3rd party
>>> data sources. To help 3rd party ecosystems to adopt 3.5 without hassle, I'd
>>> like to see this be fixed in 3.5.0.
>>>
>>> There is no fix yet but I'm working on it. I'll give an update here.
>>> Maybe we could lower down priority and let the release go with describing
>>> this as a "known issue", if I couldn't make progress in a couple of days.
>>> I'm sorry about that.
>>>
>>> Thanks,
>>> Jungtaek Lim
>>>
>>> On Fri, Sep 1, 2023 at 12:12 PM Wenchen Fan  wrote:
>>>
>>>> Sorry for the last-minute bug report, but we found a regression in 3.5:
>>>> the SQL INSERT command without a column list fills missing columns with
>>>> NULL while Spark 3.4 does not allow it. According to the SQL standard, this
>>>> shouldn't be allowed and thus a regression in 3.5.
>>>>
>>>> The fix has been merged but one day after the RC3 cut:
>>>> https://github.com/apache/spark/pull/42393 . I'm -1 and let's include
>>>> this fix in 3.5.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> On Thu, Aug 31, 2023 at 9:09 PM Ian Manning 
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> Using Spark Core, Spark SQL, Structured Streaming.
>>>>>
>>>>> On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li 
>>>>> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate(RC3) as Apache Spark
>>>>>> version 3.5.0.
>>>>>>
>>>>>> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>>>>
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v3.5.0-rc3 (commit
>>>>>> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>>>>>>
>>>>>> https://github.com/apache/spark/tree/v3.5.0-rc3
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>>
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>>>>>>
>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>
>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1447
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>>
>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>>>>>>
>>>>

Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-09-02 Thread Yuanjian Li
Thank you for all the reports!
The vote has failed. I plan to cut RC4 in two days.

@Dipayan Dev  I quickly skimmed through the
corresponding ticket, and it doesn't seem to be a regression introduced in
3.5. Additionally, someone is asking if this is the same issue as
SPARK-35279.
@Yuming Wang  I will check the signature for RC4
@Jungtaek Lim  I will follow-up with you
regarding SPARK-45045 <https://issues.apache.org/jira/browse/SPARK-45045>
@Wenchen Fan  Agree, we should include the correctness
fix in 3.5

Jungtaek Lim  于2023年8月31日周四 23:45写道:

> My apologies, I have to add another ticket for a blocker, SPARK-45045
> <https://issues.apache.org/jira/browse/SPARK-45045>. That said, I'm -1
> (non-binding).
>
> SPARK-43183 <https://issues.apache.org/jira/browse/SPARK-43183> made a
> behavioral change regarding the StreamingQueryListener as well as
> StreamingQuery API as a side-effect, while the intention was more about
> introducing the change in the former one. I just got some reports that the
> behavioral change for StreamingQuery API broke various tests in 3rd party
> data sources. To help 3rd party ecosystems to adopt 3.5 without hassle, I'd
> like to see this be fixed in 3.5.0.
>
> There is no fix yet but I'm working on it. I'll give an update here. Maybe
> we could lower down priority and let the release go with describing this
> as a "known issue", if I couldn't make progress in a couple of days. I'm
> sorry about that.
>
> Thanks,
> Jungtaek Lim
>
> On Fri, Sep 1, 2023 at 12:12 PM Wenchen Fan  wrote:
>
>> Sorry for the last-minute bug report, but we found a regression in 3.5:
>> the SQL INSERT command without a column list fills missing columns with
>> NULL while Spark 3.4 does not allow it. According to the SQL standard, this
>> shouldn't be allowed and thus a regression in 3.5.
>>
>> The fix has been merged but one day after the RC3 cut:
>> https://github.com/apache/spark/pull/42393 . I'm -1 and let's include
>> this fix in 3.5.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Aug 31, 2023 at 9:09 PM Ian Manning 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> Using Spark Core, Spark SQL, Structured Streaming.
>>>
>>> On Tue, Aug 29, 2023 at 8:12 PM Yuanjian Li 
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate(RC3) as Apache Spark
>>>> version 3.5.0.
>>>>
>>>> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 3.5.0
>>>>
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v3.5.0-rc3 (commit
>>>> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>>>>
>>>> https://github.com/apache/spark/tree/v3.5.0-rc3
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>>
>>>> https://repository.apache.org/content/repositories/orgapachespark-1447
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>>
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>>>>
>>>> The list of bug fixes going into 3.5.0 can be found at the following
>>>> URL:
>>>>
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>>>
>>>> This release is using the release script of the tag v3.5.0-rc3.
>>>>
>>>>
>>>> FAQ
>>>>
>>>> =
>>>>
>>>> How can I help test this release?
>>>>
>>>> =
>>>>
>>>> If you are a Spark user, you can help us test this release by taking
>>>>
>>>> an existing Spark workload and running on this release candidate, then
>>>>
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>>

[VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Yuanjian Li
Please vote on releasing the following candidate(RC3) as Apache Spark
version 3.5.0.

The vote is open until 11:59pm Pacific time Aug 31st and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.5.0-rc3 (commit
9f137aa4dc43398aafa0c3e035ed3174182d7d6c):

https://github.com/apache/spark/tree/v3.5.0-rc3

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1447

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/

The list of bug fixes going into 3.5.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12352848

This release is using the release script of the tag v3.5.0-rc3.


FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with an out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.5.0?

===

The current list of open tickets targeted at 3.5.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.5.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Thanks,

Yuanjian Li


Re: [VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-24 Thread Yuanjian Li
-1, do not release this package because the correctness issue
https://issues.apache.org/jira/browse/SPARK-44871 /
https://github.com/apache/spark/pull/42559 was not addressed in RC2.

The vote has failed. I plan to cut RC3 in two days.

Best,
Yuanjian

yangjie01  于2023年8月20日周日 20:24写道:

> -1, due to SPARK-43646 <https://issues.apache.org/jira/browse/SPARK-43646>
> and SPARK-44784 <https://issues.apache.org/jira/browse/SPARK-44784> not
> yet being fixed.
>
>
>
> Jie Yang
>
>
>
> *发件人**: *Sean Owen 
> *日期**: *2023年8月20日 星期日 04:43
> *收件人**: *Yuanjian Li 
> *抄送**: *Spark dev list 
> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC2)
>
>
>
> +1 this looks better to me. Works with Scala 2.13 / Java 17 for me.
>
>
>
> On Sat, Aug 19, 2023 at 3:23 AM Yuanjian Li 
> wrote:
>
> Please vote on releasing the following candidate(RC2) as Apache Spark
> version 3.5.0.
>
>
>
> The vote is open until 11:59pm Pacific time *Aug 23th* and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
>
>
> [ ] +1 Release this package as Apache Spark 3.5.0
>
> [ ] -1 Do not release this package because ...
>
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
> <https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8DbI%2fNIiRNsQ%3d%3d>
>
>
>
> The tag to be voted on is v3.5.0-rc2 (commit
> 010c4a6a05ff290bec80c12a00cd1bdaed849242):
>
> https://github.com/apache/spark/tree/v3.5.0-rc2
> <https://mailshield.baidu.com/check?q=sX1j1790NTBpKEviWnf3JjPj6kqC%2b71h4%2fLn5NZ7BAg7wQlTgB03%2bipYuDhcg78wVANMrg%3d%3d>
>
>
>
> The release files, including signatures, digests, etc. can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-bin/
> <https://mailshield.baidu.com/check?q=t%2fBZVPM4NNM0StvVplXgKJibj9Upq7UG98dPlSMlU6ScGZaSMkMWhoY%2b3YlDBufXqFncJcFFyUxjFyfZP6c4zw%3d%3d>
>
>
>
> Signatures used for Spark RCs can be found in this file:
>
> https://dist.apache.org/repos/dist/dev/spark/KEYS
> <https://mailshield.baidu.com/check?q=E6fHbSXEWw02TTJBpc3bfA9mi7ea0YiWcNHkm%2fDJxwlaWinGnMdaoO1PahHhgj00vKwcbElpuHA%3d>
>
>
>
> The staging repository for this release can be found at:
>
> https://repository.apache.org/content/repositories/orgapachespark-1446
> <https://mailshield.baidu.com/check?q=g%2bL%2fuQ8wobqbEgKMFaVTXPGl8aGjFddlhpOKwTVLF%2fpwiZNjwxyBT8oX0JfX5DA%2beK3RqeJxOZjR%2fhGyCKgUWHxeWyUcntpURZBLfw%3d%3d>
>
>
>
> The documentation corresponding to this release can be found at:
>
> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-docs/
> <https://mailshield.baidu.com/check?q=UK2RA%2fI0DFMnK8DX9F2kAMVhS1FNQKZnBy8MXdxReqdLWShCqR2ArW6VqX9CijXMOKSI8NpwyUd%2fpicMQomIemtGVaE%3d>
>
>
>
> The list of bug fixes going into 3.5.0 can be found at the following URL:
>
> https://issues.apache.org/jira/projects/SPARK/versions/12352848
> <https://mailshield.baidu.com/check?q=rOHxO3EFdnYTS41rF0m9qsTrteyGHUmLHghEJgmTMLY2%2bhbNu4VZqqsL4J8TXbsKbVjS4fDayxhT%2fqjJjgSX8zM00bc%3d>
>
>
>
> This release is using the release script of the tag v3.5.0-rc2.
>
>
>
> FAQ
>
>
>
> =
>
> How can I help test this release?
>
> =
>
> If you are a Spark user, you can help us test this release by taking
>
> an existing Spark workload and running on this release candidate, then
>
> reporting any regressions.
>
>
>
> If you're working in PySpark you can set up a virtual env and install
>
> the current RC and see if anything important breaks, in the Java/Scala
>
> you can add the staging repository to your projects resolvers and test
>
> with the RC (make sure to clean up the artifact cache before/after so
>
> you don't end up building with an out of date RC going forward).
>
>
>
> ===
>
> What should happen to JIRA tickets still targeting 3.5.0?
>
> ===
>
> The current list of open tickets targeted at 3.5.0 can be found at:
>
> https://issues.apache.org/jira/projects/SPARK
> <https://mailshield.baidu.com/check?q=4UUpJqq41y71Gnuj0qTUYo6hTjqzT7oytN6x%2fvgC5XUtQUC8MfJ77tj7K70O%2f1QMmNoa1A%3d%3d>
>  and
> search for "Target Version/s" = 3.5.0
>
>
>
> Committers should look at those and triage. Extremely important bug
>
> fixes, documentation, and API tweaks that impact compatibility should
>
> be worked on immediately. Everything else please retarget to an
>
> appropriate release.
>
>
>
> ==
>
> But my bug isn't fixed?
>
> ==
>
> In order to make timely releases, we will typically not hold the
>
> release unless the bug in question is a regression from the previous
>
> release. That being said, if there is something which is a regression
>
> that has not been correctly targeted please ping me or a committer to
>
> help target the issue.
>
>
>
> Thanks,
>
> Yuanjian Li
>
>


[VOTE] Release Apache Spark 3.5.0 (RC2)

2023-08-19 Thread Yuanjian Li
Please vote on releasing the following candidate(RC2) as Apache Spark
version 3.5.0.

The vote is open until 11:59pm Pacific time Aug 23th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.5.0-rc2 (commit
010c4a6a05ff290bec80c12a00cd1bdaed849242):

https://github.com/apache/spark/tree/v3.5.0-rc2

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1446

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc2-docs/

The list of bug fixes going into 3.5.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12352848

This release is using the release script of the tag v3.5.0-rc2.


FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with an out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.5.0?

===

The current list of open tickets targeted at 3.5.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.5.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Thanks,

Yuanjian Li


Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-19 Thread Yuanjian Li
Thank you for all the reports!
I just cut RC2 a few hours before Peter's report.
I will continue to monitor the details of the correctness issue and the
voting status for RC2.

Peter Toth  于2023年8月18日周五 07:57写道:

> Hi Yuanjian,
>
> This is a correctness issue that we should probably fix in 3.5:
> https://issues.apache.org/jira/browse/SPARK-44871 /
> https://github.com/apache/spark/pull/42559
>
> Cheers,
> Peter
>
> yangjie01  ezt írta (időpont: 2023. aug.
> 12., Szo, 15:38):
>
>> Hi, Yuanjian,
>>
>>
>>
>> Maybe there is another issue that needs to be fixed
>>
>>
>>
>> -[SPARK-44784] <https://issues.apache.org/jira/browse/SPARK-44784>
>> Failure in testing `SparkSessionE2ESuite` using Maven
>>
>>
>>
>> Maven daily tests are still failing:
>> https://github.com/apache/spark/actions/runs/5832898984/job/15819181762
>>
>>
>>
>> I think we should address this issue before the release of Apache Spark
>> 3.5.0.
>>
>>
>>
>> Jie Yang
>>
>>
>>
>> *发件人**: *Yuanjian Li 
>> *日期**: *2023年8月12日 星期六 15:20
>> *收件人**: *Yuming Wang 
>> *抄送**: *yangjie01 , Sean Owen <
>> sro...@gmail.com>, Spark dev list 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC1)
>>
>>
>>
>> Thanks for all updates!
>>
>> The vote has failed. Here is the status of known blockers:
>>
>>- [SPARK-44719]
>>
>> <https://mailshield.baidu.com/check?q=wwfJriEy4YLHSWTPEZewyL%2f3Rqu%2fp4FKqD%2bp4FwtJqJ02sqPGmYPrQmOTBIEeRaP2%2fRVBQrfkLY%3d>
>>  NoClassDefFoundError
>>when using Hive UDF - *Resolved*
>>- [SPARK-44653
>>
>> <https://mailshield.baidu.com/check?q=wA9y49X0e47%2bUprborMbO4GB6VZLs4%2fZJckwZgX1zS%2fjL1b9OMia%2bpTr7SctUN6hN2R%2f527wJ4M%3d>]
>>  non-trivial
>>DataFrame unions should not break caching - *Resolved*
>>- [SPARK-43646
>>
>> <https://mailshield.baidu.com/check?q=xebdQk%2fkQ0oQcDLFMpwVi4eH7SqRuYIZqzKQihX%2fkaIz262tfrLqkhrU3yNWw0y%2fhebim80IThM%3d>]
>>Test failure of Connect: from_protobuf_messageClassName - *WIP*
>>
>> I'll cut RC2 once all blockers are resolved.
>>
>>
>>
>>
>>
>> Yuming Wang  于2023年8月8日周二 05:29写道:
>>
>> -1. I found a NoClassDefFoundError bug:
>> https://issues.apache.org/jira/browse/SPARK-44719
>> <https://mailshield.baidu.com/check?q=wwfJriEy4YLHSWTPEZewyL%2f3Rqu%2fp4FKqD%2bp4FwtJqJ02sqPGmYPrQmOTBIEeRaP2%2fRVBQrfkLY%3d>
>> .
>>
>>
>>
>> On Mon, Aug 7, 2023 at 11:24 AM yangjie01 
>> wrote:
>>
>>
>>
>> I submitted a PR last week to try and solve this issue:
>> https://github.com/apache/spark/pull/42236
>> <https://mailshield.baidu.com/check?q=RuROuzGgilTwZNUWiMZ7pwqOOLeH0npaU%2bC8iO%2fbTipu0P69GMyEDJZSoDpwwVYG>
>> .
>>
>>
>>
>> *发件人**: *Sean Owen 
>> *日期**: *2023年8月7日 星期一 11:05
>> *收件人**: *Yuanjian Li 
>> *抄送**: *Spark dev list 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC1)
>>
>>
>> --
>>
>> *【外部邮件】信息安全要牢记,账号密码不传递!*
>> --
>>
>>
>>
>> Let's keep testing 3.5.0 of course while that change is going in. (See
>> https://github.com/apache/spark/pull/42364#issuecomment-1666878287
>> <https://mailshield.baidu.com/check?q=AKrpE6Sminif6hfi4rNDJwIsSJerLpjGHJOitfreGs%2br9nhri8QLJ%2ftr9QH6N%2bV3NWkpmvinswJbvV2NWElmX93WIhxprTwb>
>> )
>>
>>
>>
>> Otherwise testing is pretty much as usual, except I get this test failure
>> in Connect, which is new. Anyone else? this is Java 8, Scala 2.13, Debian
>> 12.
>>
>>
>>
>> - from_protobuf_messageClassName_options *** FAILED ***
>>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS]
>> Could not load Protobuf class with name
>> org.apache.spark.connect.proto.StorageLevel.
>> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf
>> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar
>> with Protobuf classes needs to be shaded (com.google.protobuf.* -->
>> org.sparkproject.spark_protobuf.protobuf.*).
>>   at
>> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3554)
>>   at
>> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:198)
>>   at
>> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:

Re: [VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-12 Thread Yuanjian Li
Thanks for all updates!
The vote has failed. Here is the status of known blockers:

   - [SPARK-44719] <https://issues.apache.org/jira/browse/SPARK-44719>
NoClassDefFoundError
   when using Hive UDF - *Resolved*
   - [SPARK-44653 <https://issues.apache.org/jira/browse/SPARK-44653>]
non-trivial
   DataFrame unions should not break caching - *Resolved*
   - [SPARK-43646 <https://issues.apache.org/jira/browse/SPARK-43646>] Test
   failure of Connect: from_protobuf_messageClassName - *WIP*

I'll cut RC2 once all blockers are resolved.


Yuming Wang  于2023年8月8日周二 05:29写道:

> -1. I found a NoClassDefFoundError bug:
> https://issues.apache.org/jira/browse/SPARK-44719.
>
> On Mon, Aug 7, 2023 at 11:24 AM yangjie01 
> wrote:
>
>>
>>
>> I submitted a PR last week to try and solve this issue:
>> https://github.com/apache/spark/pull/42236.
>>
>>
>>
>> *发件人**: *Sean Owen 
>> *日期**: *2023年8月7日 星期一 11:05
>> *收件人**: *Yuanjian Li 
>> *抄送**: *Spark dev list 
>> *主题**: *Re: [VOTE] Release Apache Spark 3.5.0 (RC1)
>>
>>
>> --
>>
>> *【外部邮件】信息安全要牢记,账号密码不传递!*
>> --
>>
>>
>>
>> Let's keep testing 3.5.0 of course while that change is going in. (See
>> https://github.com/apache/spark/pull/42364#issuecomment-1666878287
>> <https://mailshield.baidu.com/check?q=AKrpE6Sminif6hfi4rNDJwIsSJerLpjGHJOitfreGs%2br9nhri8QLJ%2ftr9QH6N%2bV3NWkpmvinswJbvV2NWElmX93WIhxprTwb>
>> )
>>
>>
>>
>> Otherwise testing is pretty much as usual, except I get this test failure
>> in Connect, which is new. Anyone else? this is Java 8, Scala 2.13, Debian
>> 12.
>>
>>
>>
>> - from_protobuf_messageClassName_options *** FAILED ***
>>   org.apache.spark.sql.AnalysisException: [CANNOT_LOAD_PROTOBUF_CLASS]
>> Could not load Protobuf class with name
>> org.apache.spark.connect.proto.StorageLevel.
>> org.apache.spark.connect.proto.StorageLevel does not extend shaded Protobuf
>> Message class org.sparkproject.spark_protobuf.protobuf.Message. The jar
>> with Protobuf classes needs to be shaded (com.google.protobuf.* -->
>> org.sparkproject.spark_protobuf.protobuf.*).
>>   at
>> org.apache.spark.sql.errors.QueryCompilationErrors$.protobufClassLoadError(QueryCompilationErrors.scala:3554)
>>   at
>> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptorFromJavaClass(ProtobufUtils.scala:198)
>>   at
>> org.apache.spark.sql.protobuf.utils.ProtobufUtils$.buildDescriptor(ProtobufUtils.scala:156)
>>   at
>> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor$lzycompute(ProtobufDataToCatalyst.scala:58)
>>   at
>> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.messageDescriptor(ProtobufDataToCatalyst.scala:57)
>>   at
>> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType$lzycompute(ProtobufDataToCatalyst.scala:43)
>>   at
>> org.apache.spark.sql.protobuf.ProtobufDataToCatalyst.dataType(ProtobufDataToCatalyst.scala:42)
>>   at
>> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:194)
>>   at
>> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:73)
>>   at scala.collection.immutable.List.map(List.scala:246)
>>
>>
>>
>> On Sat, Aug 5, 2023 at 5:42 PM Sean Owen  wrote:
>>
>> I'm still testing other combinations, but it looks like tests fail on
>> Java 17 after building with Java 8, which should be a normal supported
>> configuration.
>>
>> This is described at https://github.com/apache/spark/pull/41943
>> <https://mailshield.baidu.com/check?q=ql9V9tzNbdXj5TkKGZwzVT77jYQzOydIlG1qmLV7nz%2foGxyXKYhPn9fye1uAazWW>
>> and looks like it is resolved by moving back to Scala 2.13.8 for now.
>>
>> Unless I'm missing something we need to fix this for 3.5 or it's not
>> clear the build will run on Java 17.
>>
>>
>>
>> On Fri, Aug 4, 2023 at 5:45 PM Yuanjian Li 
>> wrote:
>>
>> Please vote on releasing the following candidate(RC1) as Apache Spark
>> version 3.5.0.
>>
>>
>>
>> The vote is open until 11:59pm Pacific time *Aug 9th* and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>>
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>> <https://mailshield.baidu.com/check?q=eJcUboQ1HRRomPZKEwRzpl69wA8Db

[VOTE] Release Apache Spark 3.5.0 (RC1)

2023-08-04 Thread Yuanjian Li
Please vote on releasing the following candidate(RC1) as Apache Spark
version 3.5.0.

The vote is open until 11:59pm Pacific time Aug 9th and passes if a
majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.5.0

[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is v3.5.0-rc1 (commit
7e862c01fc9a1d3b47764df8b6a4b5c4cafb0807):

https://github.com/apache/spark/tree/v3.5.0-rc1

The release files, including signatures, digests, etc. can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-bin/

Signatures used for Spark RCs can be found in this file:

https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:

https://repository.apache.org/content/repositories/orgapachespark-1444

The documentation corresponding to this release can be found at:

https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc1-docs/

The list of bug fixes going into 3.5.0 can be found at the following URL:

https://issues.apache.org/jira/projects/SPARK/versions/12352848

This release is using the release script of the tag v3.5.0-rc1.


FAQ

=

How can I help test this release?

=

If you are a Spark user, you can help us test this release by taking

an existing Spark workload and running on this release candidate, then

reporting any regressions.

If you're working in PySpark you can set up a virtual env and install

the current RC and see if anything important breaks, in the Java/Scala

you can add the staging repository to your projects resolvers and test

with the RC (make sure to clean up the artifact cache before/after so

you don't end up building with an out of date RC going forward).

===

What should happen to JIRA tickets still targeting 3.5.0?

===

The current list of open tickets targeted at 3.5.0 can be found at:

https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.5.0

Committers should look at those and triage. Extremely important bug

fixes, documentation, and API tweaks that impact compatibility should

be worked on immediately. Everything else please retarget to an

appropriate release.

==

But my bug isn't fixed?

==

In order to make timely releases, we will typically not hold the

release unless the bug in question is a regression from the previous

release. That being said, if there is something which is a regression

that has not been correctly targeted please ping me or a committer to

help target the issue.

Thanks,

Yuanjian Li


[Reminder] Spark 3.5 RC Cut

2023-07-29 Thread Yuanjian Li
Hi everyone,

Following the release timeline, I will cut the RC on* Tuesday, Aug 1st at 1
pm PST* as scheduled.

DateEvent
July 17th 2023
Late July
2023 Code freeze. Release branch cut.
QA period. Focus on bug fixes, tests, stability and docs.
Generally, no new features merged.


August 2023 Release candidates (RC), voting, etc. until final release passes

Best,
Yuanjian


Re: Spark 3.5 Branch Cut

2023-07-17 Thread Yuanjian Li
Further reminder for the release timeline:

DateEvent
July 17th 2023 Code freeze. Release branch cut.
Late July 2023 QA period. Focus on bug fixes, tests, stability and docs.
Generally, no new features merged.
August 2023 Release candidates (RC), voting, etc. until final release passes

Please begin your QA against branch-3.5 now.

Thank you!

Raghu Angadi  于2023年7月17日周一 13:29写道:

> Thanks Yuanjian for accepting these for warmfix.
>
> Raghu.
>
> On Mon, Jul 17, 2023 at 1:04 PM Yuanjian Li 
> wrote:
>
>> Hi, all
>>
>> FYI, I cut branch-3.5 as https://github.com/apache/spark/tree/branch-3.5
>>
>> Here is the complete list of exception merge requests received before the
>> cut:
>>
>>-
>>
>>SPARK-44421: Reattach to existing execute in Spark Connect (server
>>mechanism)
>>-
>>
>>SPARK-44423:  Reattach to existing execute in Spark Connect (scala
>>client)
>>-
>>
>>SPARK-44424:  Reattach to existing execute in Spark Connect (python
>>client)
>>-
>>
>>Sub-tasks in epic SPARK-42938
>><https://issues.apache.org/jira/browse/SPARK-42938>: Structured
>>Streaming with Spark Connect
>>-
>>
>>   SPARK-42944 : (Will mostly hit Monday deadline, just in case)
>>   Python foreachBatch
>>   -
>>
>>   SPARK-42941 : (WIP, but might slip Monday deadline): Python
>>   streaming listener
>>   -
>>
>>   SPARK-44400 : Improve session access in connect Scala
>>   StreamingQueryListener
>>   -
>>
>>   SPARK-44432 : Allow timeout of sessions when client disconnects
>>   and terminate queries
>>   -
>>
>>   SPARK-44433 : Improve termination logic for Python processes for
>>   foreachBatch & query listener
>>   -
>>
>>   SPARK-44434 : More Scala tests for foreachBatch & query listener
>>   -
>>
>>   SPARK-44435 : More Python tests for foreachBatch & query listener
>>   -
>>
>>   SPARK-44436 : Use Connect DataFrame for Scala foreachBatch in
>>   Connect
>>   -
>>
>>Sub-task in epic SPARK-43754
>><https://issues.apache.org/jira/browse/SPARK-43754>: Spark Connect
>>Session & Query lifecycle
>>-
>>
>>   SPARK-44422: Fine grained interrupt in Spark Connect
>>   -
>>
>>SPARK-43923: [CONNECT] Post listenerBus events during
>>ExecutePlanRequest
>>-
>>
>>SPARK-44394: Add a new Spark UI page for Spark Connect
>>-
>>
>>SPARK-44262: JdbcUtils hardcodes some SQL statements
>>-
>>
>>SPARK-38200: Spark JDBC Savemode Supports Upsert
>>-
>>
>>SPARK-44396 <https://issues.apache.org/jira/browse/SPARK-44396> :
>>Direct Arrow Deserialization
>>-
>>
>>SPARK-9 <https://issues.apache.org/jira/browse/SPARK-9> :
>>Upcasting for Arrow Deserialization
>>-
>>
>>SPARK-44450 <https://issues.apache.org/jira/browse/SPARK-44450> :
>>Make direct Arrow encoding work with SQL/API.
>>
>>
>> Best,
>>
>> Yuanjian
>>
>>


Spark 3.5 Branch Cut

2023-07-17 Thread Yuanjian Li
Hi, all

FYI, I cut branch-3.5 as https://github.com/apache/spark/tree/branch-3.5

Here is the complete list of exception merge requests received before the
cut:

   -

   SPARK-44421: Reattach to existing execute in Spark Connect (server
   mechanism)
   -

   SPARK-44423:  Reattach to existing execute in Spark Connect (scala
   client)
   -

   SPARK-44424:  Reattach to existing execute in Spark Connect (python
   client)
   -

   Sub-tasks in epic SPARK-42938
   : Structured
   Streaming with Spark Connect
   -

  SPARK-42944 : (Will mostly hit Monday deadline, just in case) Python
  foreachBatch
  -

  SPARK-42941 : (WIP, but might slip Monday deadline): Python streaming
  listener
  -

  SPARK-44400 : Improve session access in connect Scala
  StreamingQueryListener
  -

  SPARK-44432 : Allow timeout of sessions when client disconnects and
  terminate queries
  -

  SPARK-44433 : Improve termination logic for Python processes for
  foreachBatch & query listener
  -

  SPARK-44434 : More Scala tests for foreachBatch & query listener
  -

  SPARK-44435 : More Python tests for foreachBatch & query listener
  -

  SPARK-44436 : Use Connect DataFrame for Scala foreachBatch in Connect
  -

   Sub-task in epic SPARK-43754
   : Spark Connect
   Session & Query lifecycle
   -

  SPARK-44422: Fine grained interrupt in Spark Connect
  -

   SPARK-43923: [CONNECT] Post listenerBus events during ExecutePlanRequest
   -

   SPARK-44394: Add a new Spark UI page for Spark Connect
   -

   SPARK-44262: JdbcUtils hardcodes some SQL statements
   -

   SPARK-38200: Spark JDBC Savemode Supports Upsert
   -

   SPARK-44396  : Direct
   Arrow Deserialization
   -

   SPARK-9  :
   Upcasting for Arrow Deserialization
   -

   SPARK-44450  : Make
   direct Arrow encoding work with SQL/API.


Best,

Yuanjian


Re: Time for Spark v3.5.0 release

2023-07-14 Thread Yuanjian Li
Thanks for raising all the requests. Let's stick to the previously agreed
branch cut time. Based on past practice, let's label the above requests as
exception features.

I have just sent out a branch cut reminder titled "[Reminder] Spark 3.5
Branch Cut." Please ensure that all your requests are included.

Best,
Yuanjian

Julek Sompolski  于2023年7月14日周五 09:07写道:

> I am working on SPARK-44421, SPARK-44423 and SPARK-44424 in Spark Connect
> to support execution reconnection. A week or two of warmfix grace period
> would be much appreciated for this work.
>
> Best regards,
> Juliusz Sompolski
>
> On Fri, Jul 14, 2023 at 5:40 PM Raghu Angadi
>  wrote:
>
>> We have a bunch of work in progress for Spark Connect trying to meet the
>> branch cut deadline.
>>
>> Moving to 17th is certainly welcome.
>>
>> Is it feasible to extend it by a couple of more days?
>> Alternatively, we could have a relaxed warmfix process for Spark Connect
>> code for a week or two since it does not affect core Spark.
>>
>> Thank you.
>> Raghu.
>>
>> On Tue, Jul 4, 2023 at 3:42 PM Xinrong Meng  wrote:
>>
>>> +1
>>>
>>> Thank you!
>>>
>>> On Tue, Jul 4, 2023 at 3:04 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Wed, Jul 5, 2023 at 2:23 AM L. C. Hsieh  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Thanks Yuanjian.
>>>>>
>>>>> On Tue, Jul 4, 2023 at 7:45 AM yangjie01  wrote:
>>>>> >
>>>>> > +1
>>>>> >
>>>>> >
>>>>> >
>>>>> > 发件人: Maxim Gekk 
>>>>> > 日期: 2023年7月4日 星期二 17:24
>>>>> > 收件人: Kent Yao 
>>>>> > 抄送: "dev@spark.apache.org" 
>>>>> > 主题: Re: Time for Spark v3.5.0 release
>>>>> >
>>>>> >
>>>>> >
>>>>> > +1
>>>>> >
>>>>> > On Tue, Jul 4, 2023 at 11:55 AM Kent Yao  wrote:
>>>>> >
>>>>> > +1, thank you
>>>>> >
>>>>> > Kent
>>>>> >
>>>>> > On 2023/07/04 05:32:52 Dongjoon Hyun wrote:
>>>>> > > +1
>>>>> > >
>>>>> > > Thank you, Yuanjian
>>>>> > >
>>>>> > > Dongjoon
>>>>> > >
>>>>> > > On Tue, Jul 4, 2023 at 1:03 AM Hyukjin Kwon 
>>>>> wrote:
>>>>> > >
>>>>> > > > Yeah one day postponed shouldn't be a big deal.
>>>>> > > >
>>>>> > > > On Tue, Jul 4, 2023 at 7:10 AM Yuanjian Li <
>>>>> xyliyuanj...@gmail.com> wrote:
>>>>> > > >
>>>>> > > >> Hi All,
>>>>> > > >>
>>>>> > > >> According to the Spark versioning policy at
>>>>> > > >> https://spark.apache.org/versioning-policy.html, should we cut
>>>>> > > >> *branch-3.5* on *July 17th, 2023*? (We initially proposed
>>>>> January 16th,
>>>>> > > >> but since it's a Sunday, I suggest we postpone it by one day).
>>>>> > > >>
>>>>> > > >> I would like to volunteer as the release manager for Apache
>>>>> Spark 3.5.0.
>>>>> > > >>
>>>>> > > >> Best,
>>>>> > > >> Yuanjian
>>>>> > > >>
>>>>> > > >
>>>>> > >
>>>>> >
>>>>> > -
>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>


[Reminder] Spark 3.5 Branch Cut

2023-07-14 Thread Yuanjian Li
Hi everyone,
As discussed earlier in "Time for Spark v3.5.0 release", I will cut
branch-3.5 on *Monday, July 17th at 1 pm PST* as scheduled.

Please plan your PR merge accordingly with the given timeline. Currently,
we have received the following exception merge requests:

   - SPARK-44421: Reattach to existing execute in Spark Connect (server
   mechanism)
   - SPARK-44423:  Reattach to existing execute in Spark Connect (scala
   client)
   - SPARK-44424:  Reattach to existing execute in Spark Connect (python
   client)

If there are any other exception feature requests, please reply to this
email. We will not merge any new features in 3.5 after the branch cut.

Best,
Yuanjian


Time for Spark v3.5.0 release

2023-07-03 Thread Yuanjian Li
Hi All,

According to the Spark versioning policy at
https://spark.apache.org/versioning-policy.html, should we cut *branch-3.5*
on *July 17th, 2023*? (We initially proposed January 16th, but since it's a
Sunday, I suggest we postpone it by one day).

I would like to volunteer as the release manager for Apache Spark 3.5.0.

Best,
Yuanjian


Re: Welcoming three new PMC members

2022-08-09 Thread Yuanjian Li
Congrats everyone!

L. C. Hsieh 于2022年8月9日 周二19:01写道:

> Congrats!
>
> On Tue, Aug 9, 2022 at 5:38 PM Chao Sun  wrote:
> >
> > Congrats everyone!
> >
> > On Tue, Aug 9, 2022 at 5:36 PM Dongjoon Hyun 
> wrote:
> > >
> > > Congrat to all!
> > >
> > > Dongjoon.
> > >
> > > On Tue, Aug 9, 2022 at 5:13 PM Takuya UESHIN 
> wrote:
> > > >
> > > > Congratulations!
> > > >
> > > > On Tue, Aug 9, 2022 at 4:57 PM Hyukjin Kwon 
> wrote:
> > > >>
> > > >> Congrats everybody!
> > > >>
> > > >> On Wed, 10 Aug 2022 at 05:50, Mridul Muralidharan 
> wrote:
> > > >>>
> > > >>>
> > > >>> Congratulations !
> > > >>> Great to have you join the PMC !!
> > > >>>
> > > >>> Regards,
> > > >>> Mridul
> > > >>>
> > > >>> On Tue, Aug 9, 2022 at 11:57 AM vaquar khan 
> wrote:
> > > 
> > >  Congratulations
> > > 
> > >  On Tue, Aug 9, 2022, 11:40 AM Xiao Li 
> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > The Spark PMC recently voted to add three new PMC members. Join
> me in welcoming them to their new roles!
> > > >
> > > > New PMC members: Huaxin Gao, Gengliang Wang and Maxim Gekk
> > > >
> > > > The Spark PMC
> > > >
> > > >
> > > >
> > > > --
> > > > Takuya UESHIN
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Yuanjian Li
Congratulations, Xinrong!

XiDuo You 于2022年8月9日 周二19:18写道:

> Congratulations!
>
> Haejoon Lee  于2022年8月10日周三 09:30写道:
> >
> > Congrats, Xinrong!!
> >
> > On Tue, Aug 9, 2022 at 5:12 PM Hyukjin Kwon  wrote:
> >>
> >> Hi all,
> >>
> >> The Spark PMC recently added Xinrong Meng as a committer on the
> project. Xinrong is the major contributor of PySpark especially Pandas API
> on Spark. She has guided a lot of new contributors enthusiastically. Please
> join me in welcoming Xinrong!
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Add RocksDB StateStore

2021-04-27 Thread Yuanjian Li
Hi all,

Following the latest comments in SPARK-34198
, Databricks decided to
donate the commercial implementation of the RocksDBStateStore. Compared
with the original decision, there’s only one topic we want to raise again
for discussion: can we directly add the RockDBStateStoreProvider in the
sql/core module? This suggestion based on the following reasons:

   1.

   The RocksDBStateStore aims to solve the problem of the original
   HDFSBasedStateStore, which is built-in.
   2.

   End users can conveniently set the config to use the new implementation.
   3.

   We can set the RocksDB one as the default one in the future.


For the consideration of the dependency, I also checked the rocksdbjni we
might introduce. As a JNI package
,
it should not have any dependency conflicts with Apache Spark.

Any suggestions are welcome!

Best,

Yuanjian

Reynold Xin  于2021年2月14日周日 上午6:54写道:

> Late +1
>
>
> On Sat, Feb 13 2021 at 2:49 PM, Liang-Chi Hsieh  wrote:
>
>> Hi devs,
>>
>> Thanks for all the inputs. I think overall there are positive inputs in
>> Spark community about having RocksDB state store as external module. Then
>> let's go forward with this direction and to improve structured streaming. I
>> will keep update to the JIRA SPARK-34198.
>>
>> Thanks all again for the inputs and discussion.
>>
>> Liang-Chi Hsieh
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> - To
>> unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>


Re: [PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-14 Thread Yuanjian Li
Awesome! Thanks for making this happen, Hyukjin!

Yi Wu  于2021年4月14日周三 下午2:51写道:

> Thanks for the great work, Hyukjin!
>
> On Wed, Apr 14, 2021 at 1:00 PM Gengliang Wang  wrote:
>
>> Thanks for the amazing work, Hyukjin!
>> I created a PR for trial and it looks well so far:
>> https://github.com/apache/spark/pull/32158
>>
>> On Wed, Apr 14, 2021 at 12:47 PM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> After https://github.com/apache/spark/pull/32092 merged, now we run the
>>> GitHub Actions
>>> workflows in your forked repository.
>>>
>>> In short, please see this example HyukjinKwon#34
>>> 
>>>
>>>1. You create a PR and your repository triggers the workflow. Your
>>>PR uses the resources allocated to you for testing.
>>>2. Apache Spark repository finds your workflow, and links it in a
>>>comment in your PR
>>>
>>> Please let me know if you guys find any weird behaviour related to this.
>>>
>>>
>>> *What does that mean to contributors?*
>>>
>>> Please sync your branch to the latest master branch in Apache Spark in
>>> order for your forked repository to run the workflow, and
>>> for the main repository to detect the workflow.
>>>
>>>
>>> *What does that mean to committers?*
>>>
>>> Now, GitHub Actions will show a green even when GitHub Actions builds
>>> are running (in contributor's forked repository).
>>> Please check the build notified by github-actions bot before merging it.
>>> There would be a followup work to reflect the status of the forked
>>> repository's build to the status of PR.
>>>
>>> 2021년 4월 14일 (수) 오후 1:42, Hyukjin Kwon 님이 작성:
>>>
 Hi all,

 After https://github.com/apache/spark/pull/32092 merged, now we run
 the GitHub Actions
 workflows in your forked repository.

 In short, please see this example HyukjinKwon#34
 

1. You create a PR and your repository triggers the workflow. Your
PR uses the resources allocated to you for testing.
2. Apache Spark repository finds your workflow, and links it in a
comment in your PR

 Please let me know if you guys find any weird behaviour related to this.


 *What does that mean to contributors?*

 Please sync your branch to the latest master branch in Apache Spark in
 order for the main repository to run the workflow and detect it.


 *What does that mean to committers?*

 Now, GitHub Actions will show a green even when GitHub Actions builds
 are running (in contributor's forked repository). Please check the build
 notified by github-actions bot before merging it.
 There would be a followup work to reflect the status of the forked
 repository's build to
 the status of PR.





Re: Welcoming six new Apache Spark committers

2021-03-28 Thread Yuanjian Li
Congrats all! Well deserved!!

Yi Wu  于2021年3月29日周一 上午10:01写道:

> Thank you, everyone! Thanks for all the help!
>
> Yi
>
> On Sun, Mar 28, 2021 at 4:53 PM Gengliang Wang  wrote:
>
>> Congrats all!
>>
>> On Sun, Mar 28, 2021 at 7:09 AM Xiao Li  wrote:
>>
>>> Congratulations, everyone!
>>>
>>> Xiao
>>>
>>> Chao Sun  于2021年3月26日周五 下午6:30写道:
>>>
 Congrats everyone!

 On Fri, Mar 26, 2021 at 6:23 PM Mridul Muralidharan 
 wrote:

>
> Congratulations, looking forward to more exciting contributions !
>
> Regards,
> Mridul
>
> On Fri, Mar 26, 2021 at 8:21 PM Dongjoon Hyun 
> wrote:
>
>>
>> Congratulations! :)
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Mar 26, 2021 at 5:55 PM angers zhu 
>> wrote:
>>
>>> Congratulations
>>>
>>> Prashant Sharma  于2021年3月27日周六 上午8:35写道:
>>>
 Congratulations  all!!

 On Sat, Mar 27, 2021, 5:10 AM huaxin gao 
 wrote:

> Congratulations to you all!!
>
> On Fri, Mar 26, 2021 at 4:22 PM Yuming Wang 
> wrote:
>
>> Congrats!
>>
>> On Sat, Mar 27, 2021 at 7:13 AM Takeshi Yamamuro <
>> linguin@gmail.com> wrote:
>>
>>> Congrats, all~
>>>
>>> On Sat, Mar 27, 2021 at 7:46 AM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 Congrats all!

 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이
 작성:

> Congrats! Welcome!
>
>
> Matei Zaharia wrote
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers.
> Please join me
> > in welcoming them to their new role! Our new committers are:
> >
> > - Maciej Szymkiewicz (contributor to PySpark)
> > - Max Gekk (contributor to Spark SQL)
> > - Kent Yao (contributor to Spark SQL)
> > - Attila Zsolt Piros (contributor to decommissioning and
> Spark on
> > Kubernetes)
> > - Yi Wu (contributor to Spark Core and SQL)
> > - Gabor Somogyi (contributor to Streaming and security)
> >
> > All six of them contributed to Spark 3.1 and we’re very
> excited to have
> > them join as committers.
> >
> > Matei and the Spark PMC
> >
> -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> --
> Sent from:
> http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-11-26 Thread Yuanjian Li
Nice blog! Thanks for sharing, Etienne!

Let's try to raise this discussion again after the 3.1 release. I do think
more committers/contributors had realized the issue of global watermark per
SPARK-24634 <https://issues.apache.org/jira/browse/SPARK-24634> and
SPARK-33259 <https://issues.apache.org/jira/browse/SPARK-33259>.

Leaving some thoughts on my end:
1. Compatibility: The per-operation watermark should be compatible with the
original global one when there are no multi-aggregations.
2. Versioning: If we need to change checkpoints' format, versioning info
should be added for the first time.
3. Fix more things together: We'd better fix more issues(e.g. per-operation
output mode for multi-aggregations) together, which would require
versioning changes in the same Spark version.

Best,
Yuanjian


Etienne Chauchot  于2020年11月26日周四 下午5:29写道:

> Hi,
>
> Regarding this subject I wrote a blog article that gives details about the
> watermark architecture proposal that was discussed in the design doc and in
> the PR:
>
>
> https://echauchot.blogspot.com/2020/11/watermark-architecture-proposal-for.html
>
> Best
>
> Etienne
> On 29/09/2020 03:24, Yuanjian Li wrote:
>
> Thanks for the great discussion!
>
> Also interested in this feature and did some investigation before. As Arun
> mentioned, similar to the "update" mode, the "complete" mode also needs
> more design. We might need an operation level output mode for the complete
> mode support. That is to say, if we use "complete" mode for every
> aggregation operators, the wrong result will return.
>
> SPARK-26655 would be a good start, which only considers about "append"
> mode. Maybe we need more discussion on the watermark interface. I will take
> a close look at the doc and PR. Hope we will have the first version with
> limitations and fix/remove them gradually.
>
> Best,
> Yuanjian
>
> Jungtaek Lim  于2020年9月26日周六 上午10:31写道:
>
>> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
>> sorry I forgot to send the reply (was in draft).
>>
>> Regarding investment in SS, well, unfortunately I don't know - I'm just
>> an individual. There might be various reasons to do so, most probably
>> "priority" among the stuff. There's not much I could change.
>>
>> I agree the workaround is sub-optimal, but unless I see sufficient
>> support in the community I probably couldn't make it go forward. I'll just
>> say there's an elephant in the room - as the project goes forward for more
>> than 10 years, backward compatibility is a top priority concern in the
>> project, even across the major versions along the features/APIs. It is
>> great for end users to migrate the version easily, but also blocks devs to
>> fix the bad design once it ships. I'm the one complaining about these
>> issues in the dev list, and I don't see willingness to correct them.
>>
>>
>> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot 
>> wrote:
>>
>>> Hi Jungtaek Lim,
>>>
>>> Nice to hear from you again since last time we talked :) and congrats on
>>> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
>>> not at the time)
>>>
>>> I totally agree with what you're saying on merging structural parts of
>>> Spark without having a broader consensus. What I don't understand is why
>>> there is not more investment in SS. Especially because in another thread
>>> the community is discussing about deprecating the regular DStream streaming
>>> framework.
>>>
>>> Is the orientation of Spark now mostly batch ?
>>>
>>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview
>>> 2 searching for this particular feature. And regarding the workaround, I'm
>>> not sure it meets my needs as it will add delays and also may mess up with
>>> watermarks.
>>>
>>> Best
>>>
>>> Etienne Chauchot
>>>
>>>
>>> On 04/09/2020 08:06, Jungtaek Lim wrote:
>>>
>>> Unfortunately I don't see enough active committers working on Structured
>>> Streaming; I don't expect major features/improvements can be brought in
>>> this situation.
>>>
>>> Technically I can review and merge the PR on major improvements in SS,
>>> but that depends on how huge the proposal is changing. If the proposal
>>> brings conceptual change, being reviewed by a committer wouldn't still be
>>> enough.
>>>
>>> So that's not due to the fact we think it's worthless. (That might be
>>> only me though.) I'd understand a

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-11 Thread Yuanjian Li
Already +1 in the PR. It would be great to mention the new config in the SS
migration guide.

Ryan Blue  于2020年11月11日周三 上午7:48写道:

> +1, I agree with Tom.
>
> On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun 
> wrote:
>
>> +1 for Apache Spark 3.1.0.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 
>> wrote:
>>
>>> +1 since its a correctness issue, I think its ok to change the behavior
>>> to make sure the user is aware of it and let them decide.
>>>
>>> Tom
>>>
>>> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
>>> vii...@gmail.com> wrote:
>>>
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, chained stateful operators possibly
>>> produces
>>> incorrect results under the global watermark. SPARK-33259
>>> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
>>> demostrating what the correctness issue could be.
>>>
>>> Currently we don't prevent users running such queries. Because the
>>> possible
>>> correctness in chained stateful operators in streaming query is not
>>> straightforward for users. From users perspective, it will possibly be
>>> considered as a Spark bug like SPARK-33259. It is also possible the worse
>>> case, users are not aware of the correctness issue and use wrong results.
>>>
>>> IMO, it is better to disable such queries and let users choose to run the
>>> query if they understand there is such risk, instead of implicitly
>>> running
>>> the query and let users to find out correctness issue by themselves.
>>>
>>> I would like to propose to disable the streaming query with possible
>>> correctness issue in chained stateful operators. The behavior can be
>>> controlled by a SQL config, so if users understand the risk and still
>>> want
>>> to run the query, they can disable the check.
>>>
>>> In the PR (https://github.com/apache/spark/pull/30210), the concern I
>>> got
>>> for now is, this changes current behavior and by default it will break
>>> some
>>> existing streaming queries. But I think it is pretty easy to disable the
>>> check with the new config. In the PR currently there is no objection but
>>> suggestion to hear more voices. Please let me know if you have some
>>> thoughts.
>>>
>>> Thanks.
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: What's the root cause of not supporting multiple aggregations in structured streaming?

2020-09-28 Thread Yuanjian Li
Thanks for the great discussion!

Also interested in this feature and did some investigation before. As Arun
mentioned, similar to the "update" mode, the "complete" mode also needs
more design. We might need an operation level output mode for the complete
mode support. That is to say, if we use "complete" mode for every
aggregation operators, the wrong result will return.

SPARK-26655 would be a good start, which only considers about "append"
mode. Maybe we need more discussion on the watermark interface. I will take
a close look at the doc and PR. Hope we will have the first version with
limitations and fix/remove them gradually.

Best,
Yuanjian

Jungtaek Lim  于2020年9月26日周六 上午10:31写道:

> Thanks Etienne! Yeah I forgot to say nice talking with you again. And
> sorry I forgot to send the reply (was in draft).
>
> Regarding investment in SS, well, unfortunately I don't know - I'm just an
> individual. There might be various reasons to do so, most probably
> "priority" among the stuff. There's not much I could change.
>
> I agree the workaround is sub-optimal, but unless I see sufficient support
> in the community I probably couldn't make it go forward. I'll just say
> there's an elephant in the room - as the project goes forward for more than
> 10 years, backward compatibility is a top priority concern in the project,
> even across the major versions along the features/APIs. It is great for end
> users to migrate the version easily, but also blocks devs to fix the bad
> design once it ships. I'm the one complaining about these issues in the dev
> list, and I don't see willingness to correct them.
>
>
> On Fri, Sep 4, 2020 at 5:55 PM Etienne Chauchot 
> wrote:
>
>> Hi Jungtaek Lim,
>>
>> Nice to hear from you again since last time we talked :) and congrats on
>> becoming a Spark committer in the meantime ! (if I'm not mistaking you were
>> not at the time)
>>
>> I totally agree with what you're saying on merging structural parts of
>> Spark without having a broader consensus. What I don't understand is why
>> there is not more investment in SS. Especially because in another thread
>> the community is discussing about deprecating the regular DStream streaming
>> framework.
>>
>> Is the orientation of Spark now mostly batch ?
>>
>> PS: yeah I saw your update on the doc when I took a look at 3.0 preview 2
>> searching for this particular feature. And regarding the workaround, I'm
>> not sure it meets my needs as it will add delays and also may mess up with
>> watermarks.
>>
>> Best
>>
>> Etienne Chauchot
>>
>>
>> On 04/09/2020 08:06, Jungtaek Lim wrote:
>>
>> Unfortunately I don't see enough active committers working on Structured
>> Streaming; I don't expect major features/improvements can be brought in
>> this situation.
>>
>> Technically I can review and merge the PR on major improvements in SS,
>> but that depends on how huge the proposal is changing. If the proposal
>> brings conceptual change, being reviewed by a committer wouldn't still be
>> enough.
>>
>> So that's not due to the fact we think it's worthless. (That might be
>> only me though.) I'd understand as there's not much investment on SS.
>> There's also a known workaround for multiple aggregations (I've documented
>> in the SS guide doc, in "Limitation of global watermark" section), though I
>> totally agree the workaround is bad.
>>
>> On Tue, Sep 1, 2020 at 12:28 AM Etienne Chauchot 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm also very interested in this feature but the PR is open since
>>> January 2019 and was not updated. It raised a design discussion around
>>> watermarks and a design doc was written (
>>> https://docs.google.com/document/d/1IAH9UQJPUiUCLd7H6dazRK2k1szDX38SnM6GVNZYvUo/edit#heading=h.npkueh4bbkz1).
>>> We also commented this design but no matter what it seems that the subject
>>> is still stale.
>>>
>>> Is there any interest in the community in delivering this feature or is
>>> it considered worthless ? If the latter, can you explain why ?
>>>
>>> Best
>>>
>>> Etienne
>>> On 22/05/2019 03:38, 张万新 wrote:
>>>
>>> Thanks, I'll check it out.
>>>
>>> Arun Mahadevan  于 2019年5月21日周二 01:31写道:
>>>
 Heres the proposal for supporting it in "append" mode -
 https://github.com/apache/spark/pull/23576. You could see if it
 addresses your requirement and post your feedback in the PR.
 For "update" mode its going to be much harder to support this without
 first adding support for "retractions", otherwise we would end up with
 wrong results.

 - Arun


 On Mon, 20 May 2019 at 01:34, Gabor Somogyi 
 wrote:

> There is PR for this but not yet merged.
>
> On Mon, May 20, 2019 at 10:13 AM 张万新  wrote:
>
>> Hi there,
>>
>> I'd like to know what's the root reason why multiple aggregations on
>> streaming dataframe is not allowed since it's a very useful feature, and
>> flink has supported it for a long time.
>>
>> Thanks.
>>
>


Re: Welcoming some new Apache Spark committers

2020-07-15 Thread Yuanjian Li
Congratulations!!

huaxin gao  于2020年7月16日周四 上午6:24写道:

> Thanks everyone! I am looking forward to working with you all in the
> future.
>
> On Tue, Jul 14, 2020 at 5:02 PM Hyukjin Kwon  wrote:
>
>> Congrats!
>>
>> 2020년 7월 15일 (수) 오전 7:56, Takeshi Yamamuro 님이 작성:
>>
>>> Congrats, all!
>>>
>>> On Wed, Jul 15, 2020 at 5:15 AM Takuya UESHIN 
>>> wrote:
>>>
 Congrats and welcome!

 On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler  wrote:

> Congratulations and welcome!
>
> On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
> wrote:
>
>> Welcome, Huaxin, Jungtaek, and Dilip!
>>
>> Congratulations!
>>
>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia <
>> matei.zaha...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> The Spark PMC recently voted to add several new committers. Please
>>> join me in welcoming them to their new roles! The new committers are:
>>>
>>> - Huaxin Gao
>>> - Jungtaek Lim
>>> - Dilip Biswal
>>>
>>> All three of them contributed to Spark 3.0 and we’re excited to have
>>> them join the project.
>>>
>>> Matei and the Spark PMC
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

 --
 Takuya UESHIN


>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>


Re: [DISCUSS] Drop Python 2, 3.4 and 3.5

2020-07-01 Thread Yuanjian Li
+1, especially Python 2

Holden Karau  于2020年7月2日周四 上午10:20写道:

> I’m ok with us dropping Python 2, 3.4, and 3.5 in Spark 3.1 forward. It
> will be exciting to get to use more recent Python features. The most recent
> Ubuntu LTS ships with 3.7, and while the previous LTS ships with 3.5, if
> folks really can’t upgrade there’s conda.
>
> Is there anyone with a large Python 3.5 fleet who can’t use conda?
>
> On Wed, Jul 1, 2020 at 7:15 PM Hyukjin Kwon  wrote:
>
>> Yeah, sure. It will be dropped at Spark 3.1 onwards. I don't think we
>> should make such changes in maintenance releases
>>
>> 2020년 7월 2일 (목) 오전 11:13, Holden Karau 님이 작성:
>>
>>> To be clear the plan is to drop them in Spark 3.1 onwards, yes?
>>>
>>> On Wed, Jul 1, 2020 at 7:11 PM Hyukjin Kwon  wrote:
>>>
 Hi all,

 I would like to discuss dropping deprecated Python versions 2, 3.4 and
 3.5 at https://github.com/apache/spark/pull/28957. I assume people
 support it in general
 but I am writing this to make sure everybody is happy.

 Fokko made a very good investigation on it, see
 https://github.com/apache/spark/pull/28957#issuecomment-652022449.
 Assuming from the statistics, I think we're pretty safe to drop them.
 Also note that dropping Python 2 was actually declared at
 https://python3statement.org/

 Roughly speaking, there are many main advantages by dropping them:
   1. It removes a bunch of hacks we added around 700 lines in PySpark.
   2. PyPy2 has a critical bug that causes a flaky test,
 https://issues.apache.org/jira/browse/SPARK-28358 given my testing and
 investigation.
   3. Users can use Python type hints with Pandas UDFs without thinking
 about Python version
   4. Users can leverage one latest cloudpickle,
 https://github.com/apache/spark/pull/28950. With Python 3.8+ it can
 also leverage C pickle.
   5. ...

 So it benefits both users and dev. WDYT guys?


 --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


[DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Yuanjian Li
Hi dev-list,

I’m writing this to raise the discussion about Spark 3.0.1 feasibility
since 4 blocker issues were found after Spark 3.0.0:


   1.

   [SPARK-31990]  The
   state store compatibility broken will cause a correctness issue when
   Streaming query with `dropDuplicate` uses the checkpoint written by the old
   Spark version.
   2.

   [SPARK-32038]  The
   regression bug in handling NaN values in COUNT(DISTINCT)
   3.

   [SPARK-31918] [WIP]
   CRAN requires to make it working with the latest R 4.0. It makes the 3.0
   release unavailable on CRAN, and only supports R [3.5, 4.0)
   4.

   [SPARK-31967] 
   Downgrade vis.js to fix Jobs UI loading time regression


I also noticed branch-3.0 already has 39 commits

after Spark 3.0.0. I think it would be great if we have Spark 3.0.1 to
deliver the critical fixes.

Any comments are appreciated.

Best,

Yuanjian


Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-12 Thread Yuanjian Li
Thanks Holden and Dongjoon for the help!
The bugfix for SPARK-31663 is ready for review, hope it can be picked up in
2.4.7 if possible.
https://github.com/apache/spark/pull/28501

Best,
Yuanjian

Takeshi Yamamuro  于2020年5月11日周一 上午9:03写道:

> I checked on my MacOS env; all the tests
> with `-Pyarn -Phadoop-2.7 -Pdocker-integration-tests -Phive
> -Phive-thriftserver -Pmesos -Pkubernetes -Psparkr`
> passed and I couldn't find any issue;
>
> maropu@~:$java -version
> java version "1.8.0_181"
> Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
>
> Bests,
> Takeshi
>
>
> On Sun, May 10, 2020 at 2:50 AM Holden Karau  wrote:
>
>> Thanks Dongjoon :)
>> So it’s not a regression, but if it won’t be a large delay I think
>> holding for the correctness fix would be good (and we can pick up the two
>> issues fixed in 2.4.7). What does everyone think?
>>
>> On Fri, May 8, 2020 at 11:40 AM Dongjoon Hyun 
>> wrote:
>>
>>> I confirmed and update the JIRA. SPARK-31663 is a correctness issue
>>> since Apache Spark 2.4.0.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, May 8, 2020 at 10:26 AM Holden Karau 
>>> wrote:
>>>
>>>> Can you provide a bit more context (is it a regression?)
>>>>
>>>> On Fri, May 8, 2020 at 9:33 AM Yuanjian Li 
>>>> wrote:
>>>>
>>>>> Hi Holden,
>>>>>
>>>>> I'm working on the bugfix of SPARK-31663
>>>>> <https://issues.apache.org/jira/browse/SPARK-31663>, let me post it
>>>>> here since it's a correctness bug and also affects 2.4.6.
>>>>>
>>>>> Best,
>>>>> Yuanjian
>>>>>
>>>>> Sean Owen  于2020年5月8日周五 下午11:42写道:
>>>>>
>>>>>> +1 from me. The usual: sigs OK, license looks as intended, tests pass
>>>>>> from a source build for me.
>>>>>>
>>>>>> On Thu, May 7, 2020 at 1:29 PM Holden Karau 
>>>>>> wrote:
>>>>>> >
>>>>>> > Please vote on releasing the following candidate as Apache Spark
>>>>>> version 2.4.6.
>>>>>> >
>>>>>> > The vote is open until February 5th 11PM PST and passes if a
>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>> >
>>>>>> > [ ] +1 Release this package as Apache Spark 2.4.6
>>>>>> > [ ] -1 Do not release this package because ...
>>>>>> >
>>>>>> > To learn more about Apache Spark, please see
>>>>>> http://spark.apache.org/
>>>>>> >
>>>>>> > There are currently no issues targeting 2.4.6 (try project = SPARK
>>>>>> AND "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In
>>>>>> Progress"))
>>>>>> >
>>>>>> > We _may_ want to hold the 2.4.6 release for something targetted to
>>>>>> 2.4.7 ( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
>>>>>> SPARK-24266 & SPARK-26908 and I believe there is some discussion on if we
>>>>>> should include SPARK-31399 in this release.
>>>>>> >
>>>>>> > The tag to be voted on is v2.4.5-rc2 (commit
>>>>>> a3cffc997035d11e1f6c092c1186e943f2f63544):
>>>>>> > https://github.com/apache/spark/tree/v2.4.6-rc1
>>>>>> >
>>>>>> > The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
>>>>>> >
>>>>>> > Signatures used for Spark RCs can be found in this file:
>>>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>> >
>>>>>> > The staging repository for this release can be found at:
>>>>>> >
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1340/
>>>>>> >
>>>>>> > The documentation corresponding to this release can be found at:
>>>>>> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
>>>>>> >
>>>>>> > The list of bug fixes going into 2.4.6 can be fou

Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-08 Thread Yuanjian Li
Hi Holden,

I'm working on the bugfix of SPARK-31663
, let me post it here
since it's a correctness bug and also affects 2.4.6.

Best,
Yuanjian

Sean Owen  于2020年5月8日周五 下午11:42写道:

> +1 from me. The usual: sigs OK, license looks as intended, tests pass
> from a source build for me.
>
> On Thu, May 7, 2020 at 1:29 PM Holden Karau  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.6.
> >
> > The vote is open until February 5th 11PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.6
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > There are currently no issues targeting 2.4.6 (try project = SPARK AND
> "Target Version/s" = "2.4.6" AND status in (Open, Reopened, "In Progress"))
> >
> > We _may_ want to hold the 2.4.6 release for something targetted to 2.4.7
> ( project = SPARK AND "Target Version/s" = "2.4.7") , currently,
> SPARK-24266 & SPARK-26908 and I believe there is some discussion on if we
> should include SPARK-31399 in this release.
> >
> > The tag to be voted on is v2.4.5-rc2 (commit
> a3cffc997035d11e1f6c092c1186e943f2f63544):
> > https://github.com/apache/spark/tree/v2.4.6-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1340/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.6-rc1-docs/
> >
> > The list of bug fixes going into 2.4.6 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12346781
> >
> > This release is using the release script of the tag v2.4.6-rc1.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.6?
> > ===
> >
> > The current list of open tickets targeted at 2.4.5 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.6
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] PostgreSQL dialect

2019-12-04 Thread Yuanjian Li
Thanks all of you for joining the discussion.
The PR is given in https://github.com/apache/spark/pull/26763, all the
PostgreSQL dialect related PRs are linked in the description.
Hoping the authors could help in reviewing.

Best,
Yuanjian

Driesprong, Fokko  于2019年12月1日周日 下午7:24写道:

> +1 (non-binding)
>
> Cheers, Fokko
>
> Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun  >:
>
>> +1
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Yea, +1, that looks pretty reasonable to me.
>>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> I personally think we could at least stop work about the Dialect until
>>> 3.0 released.
>>>
>>>
>>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>>> gengliang.w...@databricks.com> wrote:
>>>
 +1 with the practical proposal.
 To me, the major concern is that the code base becomes complicated,
 while the PostgreSQL dialect has very limited features. I tried introducing
 one big flag `spark.sql.dialect` and isolating related code in #25697
 , but it seems hard to be
 clean.
 Furthermore, the PostgreSQL dialect configuration overlaps with the
 ANSI mode, which can be confusing sometimes.

 Gengliang

 On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:

> +1
>
>
>> One particular negative effect has been that new postgresql tests add
>> well over an hour to tests,
>
>
> Adding postgresql tests is for improving the test coverage of Spark
> SQL. We should continue to do this by importing more test cases. The
> quality of Spark highly depends on the test coverage. We can further
> paralyze the test execution to reduce the test time.
>
> Migrating PostgreSQL workloads to Spark SQL
>
>
> This should not be our current focus. In the near future, it is
> impossible to be fully compatible with PostgreSQL. We should focus on
> adding features that are useful to Spark community. PostgreSQL is a good
> reference, but we do not need to blindly follow it. We already closed
> multiple related JIRAs that try to add some PostgreSQL features that are
> not commonly used.
>
> Cheers,
>
> Xiao
>
>
> On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> I think it is important to distinguish between two different concepts:
>>
>>- Adherence to standards and their well established
>>implementations.
>>- Enabling migrations from some product X to Spark.
>>
>> While these two problems are related, there are independent and one
>> can be achieved without the other.
>>
>>- The former approach doesn't imply that all features of SQL
>>standard (or its specific implementation) are provided. It is 
>> sufficient
>>that commonly used features that are implemented, are standard 
>> compliant.
>>Therefore if end user applies some well known pattern, thing will 
>> work as
>>expected. I
>>
>>In my personal opinion that's something that is worth the
>>required development resources, and in general should happen within 
>> the
>>project.
>>
>>
>>- The latter one is more complicated. First of all the premise
>>that one can "migrate PostgreSQL workloads to Spark" seems to be 
>> flawed.
>>While both Spark and PostgreSQL evolve, and probably have more in 
>> common
>>today, than a few years ago, they're not even close enough to pretend 
>> that
>>one can be replacement for the other. In contrast, existing 
>> compatibility
>>layers between major vendors make sense, because feature disparity
>>(at least when it comes to core functionality) is usually
>>minimal. And that doesn't even touch the problem that PostgreSQL 
>> provides
>>extensively used extension points that enable broad and evolving 
>> ecosystem
>>(what should we do about continuous queries? Should Structured 
>> Streaming
>>provide some compatibility layer as well?).
>>
>>More realistically Spark could provide a compatibility layer with
>>some analytical tools that itself provide some PostgreSQL 
>> compatibility,
>>but these are not always fully compatible with upstream PostgreSQL, 
>> nor
>>necessarily follow the latest PostgreSQL development.
>>
>>Furthermore compatibility layer can be, within certain limits
>>(i.e. availability of required primitives), maintained as a separate
>>project, without putting more strain on existing resources. 
>> Effectively
>>what we care about here is if we can 

Re: Welcoming some new committers and PMC members

2019-09-09 Thread Yuanjian Li
Congratulations!

sujith chacko  于2019年9月10日周二 上午10:15写道:

> Congratulations all.
>
> On Tue, 10 Sep 2019 at 7:27 AM, Haibo  wrote:
>
>> congratulations~
>>
>>
>>
>> 在2019年09月10日 09:30,Joseph Torres
>>  写道:
>>
>> congratulations!
>>
>> On Mon, Sep 9, 2019 at 6:27 PM 王 斐  wrote:
>>
>>> congratulations!
>>>
>>> 获取 Outlook for iOS 
>>>
>>> --
>>> *发件人:* Ye Xianjin 
>>> *发送时间:* 星期二, 九月 10, 2019 09:26
>>> *收件人:* Jeff Zhang
>>> *抄送:* Saisai Shao; dev
>>> *主题:* Re: Welcoming some new committers and PMC members
>>>
>>> Congratulations!
>>>
>>> Sent from my iPhone
>>>
>>> On Sep 10, 2019, at 9:19 AM, Jeff Zhang  wrote:
>>>
>>> Congratulations!
>>>
>>> Saisai Shao  于2019年9月10日周二 上午9:16写道:
>>>
 Congratulations!

 Jungtaek Lim  于2019年9月9日周一 下午6:11写道:

> Congratulations! Well deserved!
>
> On Tue, Sep 10, 2019 at 9:51 AM John Zhuge  wrote:
>
>> Congratulations!
>>
>> On Mon, Sep 9, 2019 at 5:45 PM Shane Knapp 
>> wrote:
>>
>>> congrats everyone!  :)
>>>
>>> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > The Spark PMC recently voted to add several new committers and one
>>> PMC member. Join me in welcoming them to their new roles!
>>> >
>>> > New PMC member: Dongjoon Hyun
>>> >
>>> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming
>>> Wang, Weichen Xu, Ruifeng Zheng
>>> >
>>> > The new committers cover lots of important areas including ML,
>>> SQL, and data sources, so it’s great to have them here. All the best,
>>> >
>>> > Matei and the Spark PMC
>>> >
>>> >
>>> >
>>> -
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>

>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>


Re: [SPARK-23207] Repro

2019-08-12 Thread Yuanjian Li
Hi Tyson,

Thanks for the reporting!
I reproduced this locally based on your code with some changes, which only
keep the wrong answer job. The code as below:

import scala.sys.process._
import org.apache.spark.TaskContext

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).cache.repartition(239).map { x =>
  if (TaskContext.get.attemptNumber == 0 &&
TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber
== 0) {
throw new Exception("pkill -f -n java".!!)
  }
  x
}
val r2 = df.distinct.count()

I think the reason for the wrong answer is, in the CachedRDDBuilder, we
miss tracking the `isOrderSensitive` characteristic to the newly created
MapPartitionsRDD. Jira created in:
https://issues.apache.org/jira/browse/SPARK-28699.
The fix will base on Wenchen's work SPARK-23243. Currently, we make the job
fail when we find an indeterminate stage retry. Feel free to have a review.

The support of Spark rerun the indeterminate stage will be done after
SPARK-25341 . If you
need the indeterminate stage after cache operation right now, you can test
on this branch
.

Best,
Yuanjian

Wenchen Fan  于2019年8月12日周一 下午8:19写道:

> Hi Tyson,
>
> Thanks for reporting it! I quickly checked the related scheduler code but
> can't find an obvious place that can go wrong with cached RDD.
>
> Sean said that he can't produce it, but the second job fails. This is
> actually expected. We need a lot more changes to completely fix this
> problem, so currently the fix is to fail the job if the scheduler needs to
> retry an indeterminate shuffle map stage.
>
> It would be great to know if we can reproduce this bug with the master
> branch.
>
> Thanks,
> Wenchen
>
> On Sun, Aug 11, 2019 at 7:22 AM Xiao Li  wrote:
>
>> Hi, Tyson,
>>
>> Could you open a new JIRA with correctness label? SPARK-23207 might not
>> cover all the scenarios, especially when you using cache.
>>
>> Cheers,
>>
>> Xiao
>>
>> On Fri, Aug 9, 2019 at 9:26 AM  wrote:
>>
>>> Hi Sean,
>>>
>>> To finish the job, I did need to set spark.stage.maxConsecutiveAttempts
>>> to a large number e.g., 100; a suggestion from Jiang Xingbo.
>>>
>>> I haven't seen any recent movement/PRs on this issue, but I'll see if we
>>> can repro with a more recent version of Spark.
>>>
>>> Best regards,
>>> Tyson
>>>
>>> -Original Message-
>>> From: Sean Owen 
>>> Sent: Friday, August 9, 2019 7:49 AM
>>> To: tcon...@gmail.com
>>> Cc: dev 
>>> Subject: Re: [SPARK-23207] Repro
>>>
>>> Interesting but I'd put this on the JIRA, and also test vs master first.
>>> It's entirely possible this is something else that was subsequently fixed,
>>> and maybe even backported for 2.4.4.
>>> (I can't quite reproduce it - just makes the second job fail, which is
>>> also puzzling)
>>>
>>> On Fri, Aug 9, 2019 at 8:11 AM  wrote:
>>> >
>>> > Hi,
>>> >
>>> >
>>> >
>>> > We are able to reproduce this bug in Spark 2.4 using the following
>>> program:
>>> >
>>> >
>>> >
>>> > import scala.sys.process._
>>> >
>>> > import org.apache.spark.TaskContext
>>> >
>>> >
>>> >
>>> > val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000,
>>> > x)}.repartition(20)
>>> >
>>> > res.distinct.count
>>> >
>>> >
>>> >
>>> > // kill an executor in the stage that performs repartition(239)
>>> >
>>> > val df = res.repartition(113).cache.repartition(239).map { x =>
>>> >
>>> >   if (TaskContext.get.attemptNumber == 0 &&
>>> > TaskContext.get.partitionId < 1) {
>>> >
>>> > throw new Exception("pkill -f java".!!)
>>> >
>>> >   }
>>> >
>>> >   x
>>> >
>>> > }
>>> >
>>> > df.distinct.count()
>>> >
>>> >
>>> >
>>> > The first df.distinct.count correctly produces 1
>>> >
>>> > The second df.distinct.count incorrect produces 9769
>>> >
>>> >
>>> >
>>> > If the cache step is removed then the bug does not reproduce.
>>> >
>>> >
>>> >
>>> > Best regards,
>>> >
>>> > Tyson
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> 
>>
>


Re: Welcome Jose Torres as a Spark committer

2019-01-29 Thread Yuanjian Li
Congrats Jose!

Best,
Yuanjian

Takeshi Yamamuro  于2019年1月30日周三 上午8:21写道:

> Congrats, Jose!
>
> Best,
> Takeshi
>
> On Wed, Jan 30, 2019 at 6:10 AM Jungtaek Lim  wrote:
>
>> Congrats Jose! Well deserved.
>>
>> - Jungtaek Lim (HeartSaVioR)
>>
>> 2019년 1월 30일 (수) 오전 5:19, Dongjoon Hyun 님이 작성:
>>
>>> Congrats, Jose! :)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Jan 29, 2019 at 11:41 AM Arun Mahadevan 
>>> wrote:
>>>
 Congrats Jose! Well deserved.

 On Tue, 29 Jan 2019 at 11:15, Jules Damji  wrote:

> Congrats Jose!
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Jan 29, 2019, at 11:07 AM, shane knapp  wrote:
>
> congrats, and welcome!
>
> On Tue, Jan 29, 2019 at 11:07 AM Dean Wampler 
> wrote:
>
>> Congrats, Jose!
>>
>>
>> *Dean Wampler, Ph.D.*
>>
>> *VP, Fast Data Engineering at Lightbend*
>>
>>
>> On Tue, Jan 29, 2019 at 12:52 PM Burak Yavuz 
>> wrote:
>>
>>> Congrats Jose!
>>>
>>> On Tue, Jan 29, 2019 at 10:50 AM Xiao Li 
>>> wrote:
>>>
 Congratulations!

 Xiao

 Shixiong Zhu  于2019年1月29日周二 上午10:48写道:

> Hi all,
>
> The Apache Spark PMC recently added Jose Torres as a committer on
> the project. Jose has been a major contributor to Structured 
> Streaming.
> Please join me in welcoming him!
>
> Best Regards,
>
> Shixiong Zhu
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Continuous task retry support

2018-11-04 Thread Yuanjian Li
>
> *I found that task retries are currently not supported
> 
>  in
> continuous processing mode. Is there another way to recover from continuous
> task failures currently?*

Yes, currently task level retry is not supported in CP mode and the recover
strategy instead by stage restart.

 *If not, are there plans to support this in a future release?*

 Actually task level retry in CP mode is easy to implement in map-only
operators but need more discussion when we plan to support more shuffled
stateful operators in CP. More discussion in
https://github.com/apache/spark/pull/20675.

Basil Hariri  于2018年11月3日周六 上午3:09写道:

> *Hi all,*
>
>
>
> *I found that task retries are currently not supported
> 
> in continuous processing mode. Is there another way to recover from
> continuous task failures currently? If not, are there plans to support this
> in a future release?*
>
> Thanks,
>
> Basil
>


Re: [DISCUSS] SPIP: Native support of session window

2018-10-06 Thread Yuanjian Li
Cool, thanks!
Sorry for the late reply, we'll check out the UT and your design doc ASAP
when we back from National Day holiday.

Thanks,
Yuanjian Li

Jungtaek Lim  于2018年9月29日周六 上午5:21写道:

> Btw, just wrote up detailed design doc on existing patch:
>
> https://docs.google.com/document/d/1tUO29BDXb9127RiivUS7Hv324dC0YHuokYvyQRpurDY/edit?usp=sharing
>
> This doc is a wall of text, since I guess we already imagine how session
> window works (and I showed a simple example in SPIP doc), so try to avoid
> drawing something which would take non-trivial efforts. New classes are
> linked to the actual source code so that we can read the code directly
> whenever curious/wonders about something.
>
> Please let me know anytime if something is unclear and need elaboration.
>
> -Jungtaek Lim (HeartSaVioR)
>
> 2018년 9월 28일 (금) 오후 10:18, Jungtaek Lim 님이 작성:
>
>> Thanks for sharing your proposal as well as implementation. Looks like
>> your proposal is more like focused to design details: I may be better to
>> write one more for design details and share it as well. Stay tuned!
>>
>> Btw, I'm trying out your patch to see whether it passes the tests I've
>> added, and looks like it fails on below UT:
>>
>> https://github.com/apache/spark/blob/ad0b7466ef3f79354a99bd1b95c23e4c308502d5/sql/core/src/test/scala/org/apache/spark/sql/streaming/EventTimeWatermarkSuite.scala#L475-L573
>> Could you take a look at UT and see whether I'm missing here or the UT is
>> correct?
>>
>> (Actually most of UTs I've added fail but some UTs are for update mode,
>> and the patch doesn't provide same experience with select only session
>> window, so I'm pointing only one UT which is testing basic session window.)
>>
>> -Jungtaek Lim (HeartSaVioR)
>>
>> 2018년 9월 28일 (금) 오후 9:22, Yuanjian Li 님이 작성:
>>
>>> Hi Jungtaek:
>>>
>>>We also meet this problem during migration of streaming application
>>> to Structure Streaming in Baidu practice, we solved this in our folk and
>>> just steady running in product.
>>>As the initial plan we are doing the code clean work and preparing
>>> give a SPIP in Oct, happy to see your proposal. Hope we share some spots
>>> together.
>>>Here’s the PR and doc:
>>> https://github.com/apache/spark/pull/22583
>>>
>>> https://docs.google.com/document/d/1zeAc7QKSO7J4-Yk06kc76kvldl-QHLCDJuu04d7k2bg/edit?usp=sharing
>>>
>>> Thanks,
>>> Yuanjian Li
>>>
>>>
>>> 在 2018年9月28日,06:22,Jungtaek Lim  写道:
>>>
>>> Hi all,
>>>
>>> I would like to initiate discussion thread to discuss "Native support of
>>> session window".
>>> Origin issue is filed to SPARK-10816 [1] but I can file another one for
>>> representing SPIP if necessary. WIP but working PR is available as well, so
>>> we can even test it directly or see the difference if some of us feel more
>>> convenient to go through the source code instead of doc.
>>>
>>> I've attached PDF version of SPIP in SPARK-10816, but adding Google Docs
>>> link [2] for who feel convenient to comment in doc.
>>>
>>> Please let me know if we would like to see also technical design for
>>> this. I avoid to go too deep on SPIP doc so anyone could review and see the
>>> benefit of adopting this.
>>>
>>> Looking forward to hear your feedback.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1. https://issues.apache.org/jira/browse/SPARK-10816
>>> 2.
>>> https://docs.google.com/document/d/1_rMLmUSyGzb62RnP2A3WX6D6uRxox8Q_7WcoI_HrTw4/edit?usp=sharing
>>> 3. https://github.com/apache/spark/pull/22482
>>>
>>>
>>>
>>>


Re: welcome a new batch of committers

2018-10-06 Thread Yuanjian Li
Congratulations to all and thanks for all your help!!

Bhupendra Mishra  于2018年10月6日周六 上午11:38写道:

> Congratulations to all of you
> Good Luck
> Regards
>
> On Wed, Oct 3, 2018 at 2:29 PM Reynold Xin  wrote:
>
>> Hi all,
>>
>> The Apache Spark PMC has recently voted to add several new committers to
>> the project, for their contributions:
>>
>> - Shane Knapp (contributor to infra)
>> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
>> - Kazuaki Ishizaki (contributor to Spark SQL)
>> - Xingbo Jiang (contributor to Spark Core and SQL)
>> - Yinan Li (contributor to Spark on Kubernetes)
>> - Takeshi Yamamuro (contributor to Spark SQL)
>>
>> Please join me in welcoming them!
>>
>>


Re: [DISCUSS] SPIP: Native support of session window

2018-09-28 Thread Yuanjian Li
Hi Jungtaek:

   We also meet this problem during migration of streaming application to 
Structure Streaming in Baidu practice, we solved this in our folk and just 
steady running in product.
   As the initial plan we are doing the code clean work and preparing give a 
SPIP in Oct, happy to see your proposal. Hope we share some spots together.
   Here’s the PR and doc: 
https://github.com/apache/spark/pull/22583 
<https://github.com/apache/spark/pull/22583>

https://docs.google.com/document/d/1zeAc7QKSO7J4-Yk06kc76kvldl-QHLCDJuu04d7k2bg/edit?usp=sharing
 
<https://docs.google.com/document/d/1zeAc7QKSO7J4-Yk06kc76kvldl-QHLCDJuu04d7k2bg/edit?usp=sharing>

Thanks,
Yuanjian Li


> 在 2018年9月28日,06:22,Jungtaek Lim  <mailto:kabh...@gmail.com>> 写道:
> 
> Hi all,
> 
> I would like to initiate discussion thread to discuss "Native support of 
> session window".
> Origin issue is filed to SPARK-10816 [1] but I can file another one for 
> representing SPIP if necessary. WIP but working PR is available as well, so 
> we can even test it directly or see the difference if some of us feel more 
> convenient to go through the source code instead of doc.
> 
> I've attached PDF version of SPIP in SPARK-10816, but adding Google Docs link 
> [2] for who feel convenient to comment in doc. 
> 
> Please let me know if we would like to see also technical design for this. I 
> avoid to go too deep on SPIP doc so anyone could review and see the benefit 
> of adopting this.
> 
> Looking forward to hear your feedback. 
> 
> Thanks,
> Jungtaek Lim (HeartSaVioR)
> 
> 1. https://issues.apache.org/jira/browse/SPARK-10816 
> <https://issues.apache.org/jira/browse/SPARK-10816>
> 2. 
> https://docs.google.com/document/d/1_rMLmUSyGzb62RnP2A3WX6D6uRxox8Q_7WcoI_HrTw4/edit?usp=sharing
>  
> <https://docs.google.com/document/d/1_rMLmUSyGzb62RnP2A3WX6D6uRxox8Q_7WcoI_HrTw4/edit?usp=sharing>
> 3. https://github.com/apache/spark/pull/22482 
> <https://github.com/apache/spark/pull/22482>
> 
> 



Re: Something wrong of Jenkins proxy

2018-09-23 Thread Yuanjian Li
Great thanks for your help shane!
https://hadrian.ist.berkeley.edu/jenkins/ works for me and I'll share it to
others.

shane knapp  于2018年9月24日周一 上午11:58写道:

> i don't manage the certs on the box doing the reverse proxy, so i've
> reached out to the proper party and will hopefully things will be sorted by
> early tomorrow.
>
> On Sun, Sep 23, 2018 at 8:37 PM, shane knapp  wrote:
>
>> for now, you can visit:
>>
>> https://hadrian.ist.berkeley.edu/jenkins/
>>
>> something is up w/the reverse proxy setup.
>>
>> On Sun, Sep 23, 2018 at 8:37 PM, shane knapp  wrote:
>>
>>> i just noticed this...  taking a look now.
>>>
>>> On Sun, Sep 23, 2018 at 4:38 AM, Yuanjian Li 
>>> wrote:
>>>
>>>> Hi devs,
>>>> Is there something wrong of Jenkins proxy?
>>>> [image: image.png]
>>>> I got this proxy 500 whole days.
>>>>
>>>> Thanks,
>>>> Yuanjian Li
>>>>
>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Something wrong of Jenkins proxy

2018-09-23 Thread Yuanjian Li
Hi devs,
Is there something wrong of Jenkins proxy?
[image: image.png]
I got this proxy 500 whole days.

Thanks,
Yuanjian Li


Re: [Feedback Requested] SPARK-25299: Using Distributed Storage for Persisting Shuffle Data

2018-08-31 Thread Yuanjian Li
Hi Matt,
 Thanks for the great document and proposal, I want to +1 for the
reliable shuffle data and give some feedback.
 I think a reliable shuffle service based on DFS is necessary on Spark,
especially running Spark job over unstable environment. For example, while
mixed deploying Spark with online service, Spark executor will be killed
any time. Current stage retry strategy will make the job many times slower
than normal job.
 Actually we(Baidu inc) solved this problem by stable shuffle service
over Hadoop, and we are now docking Spark to this shuffle service. The POC
work will be done at October as expect. We'll post more benchmark and
detailed work at that time. I'm still reading your discussion document and
happy to give more feedback in the doc.

Thanks,
Yuanjian Li

Matt Cheah  于2018年9月1日周六 上午8:42写道:

> Hi everyone,
>
>
>
> I filed SPARK-25299 <https://issues.apache.org/jira/browse/SPARK-25299>
> to promote discussion on how we can improve the shuffle operation in Spark.
> The basic premise is to discuss the ways we can leverage distributed
> storage to improve the reliability and isolation of Spark’s shuffle
> architecture.
>
>
>
> A few designs and a full problem statement are outlined in this architecture
> discussion document
> <https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40>
> .
>
>
>
> This is a complex problem and it would be great to get feedback from the
> community about the right direction to take this work in. Note that we have
> not yet committed to a specific implementation and architecture – there’s a
> lot that needs to be discussed for this improvement, so we hope to get as
> much input as possible before moving forward with a design.
>
>
>
> Please feel free to leave comments and suggestions on the JIRA ticket or
> on the discussion document.
>
>
>
> Thank you!
>
>
>
> -Matt Cheah
>


Re: [DISCUSS] Adaptive execution in Spark SQL

2018-07-31 Thread Yuanjian Li
Thanks Carson, great note!
Actually Baidu has ported this patch in our internal folk. I collected some
user cases and performance improve effect during Baidu internal usage of
this patch, summarize as following 3 scenario:
1. SortMergeJoin to BroadcastJoin
The SortMergeJoin transform to BroadcastJoin over deeply tree node can
bring us 50% to 200% boosting on query performance, and this strategy alway
hit the BI scenario like join several tables with filter strategy in
subquery
2. Long running application or use Spark as a service
In this case, long running application refers to the duration of
application near 1 hour. Using Spark as a service refers to use spark-shell
and keep submit sql or use the service of Spark like Zeppelin, Livy or our
internal sql service Baidu BigSQL. In such scenario, all spark jobs share
same partition number, so enable AE and add configs about expected task
info including data size, row number, min\max partition number and etc,
will bring us 50%-100% boosting on performance improvement.
3. GraphFrame jobs
The last scenario is the application use GraphFrame, in this case, user has
a 2-dimension graph with 1 billion edges, use the connected
componentsalgorithm in GraphFrame. With enabling AE, the duration of app
reduce from 58min to 32min, almost 100% boosting on performance improvement.

The detailed screenshot and config in the JIRA SPARK-23128
<https://issues.apache.org/jira/browse/SPARK-23128> attached pdf.

Thanks,
Yuanjian Li

Wang, Carson  于2018年7月28日周六 上午12:49写道:

> Dear all,
>
>
>
> The initial support of adaptive execution[SPARK-9850
> <https://issues.apache.org/jira/browse/SPARK-9850>] in Spark SQL has been
> there since Spark 1.6, but there is no more update since then. One of the
> key features in adaptive execution is to determine the number of reducer
> automatically at runtime. This is a feature required by many Spark users
> especially the infrastructure team in many companies, as there are
> thousands of queries running on the cluster where the shuffle partition
> number may not be set properly for every query. The same shuffle partition
> number also doesn’t work well for all stages in a query because each stage
> has different input data size. Other features in adaptive execution include
> optimizing join strategy at runtime and handling skewed join automatically,
> which have not been implemented in Spark.
>
>
>
> In the current implementation, an Exchange coordinator is used to
> determine the number of post-shuffle partitions for a stage. However,
> exchange coordinator is added when Exchange is being added, so it actually
> lacks a global picture of all shuffle dependencies of a post-shuffle
> stage.  I.e. for 3 tables’ join in a single stage, the same
> ExchangeCoordinator should be used in three Exchanges but currently two
> separated ExchangeCoordinator will be added. It also adds additional
> Exchanges in some cases. So I think it is time to rethink how to better
> support adaptive execution in Spark SQL. I have proposed a new approach in
> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128>. A
> document about the idea is described at here
> <https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing>.
> The idea about how to changing a sort merge join to a broadcast hash join
> at runtime is also described in a separated doc
> <https://docs.google.com/document/d/1WCJ2BmA8_dJL_jmYie_x9ZCrz7r3ZjleJSoX0dlDXaw/edit?usp=sharing>.
>
>
>
>
> The docs have been there for a while, and I also had an implementation
> based on Spark 2.3 available at
> https://github.com/Intel-bigdata/spark-adaptive. The code is split into 7
> PRs labeled with AE2.3-0x if you look at the pull requests. I asked many
> partners to evaluate the patch including Baidu, Alibaba, JD.com, etc and
> received very good feedback. Baidu also shared their result at the Jira. We
> also finished a 100 TB TPC-DS benchmark earlier using the patch which
> passed all queries with good performance improvement.
>
>
>
> I’d like to call for a review on the docs and even code and we can further
> discuss in this thread. Thanks very much!
>
>
>
> Thanks,
>
> Carson
>
>
>


Re: Design for continuous processing shuffle

2018-05-07 Thread Yuanjian Li
Hi Joseph and devs,

Happy to see the discussion of CP shuffle, as comment in 
https://issues.apache.org/jira/browse/SPARK-20928?focusedCommentId=16245556=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16245556
 
,
 the team also do some design and demo work over CP shuffle, the docs show more 
detailed work:

https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI 




> 在 2018年5月5日,02:27,Joseph Torres  写道:
> 
> Hi all,
> 
> A few of us have been working on a design for how to do shuffling in 
> continuous processing. Feel free to chip in if you have any comments or 
> questions.
> 
> doc:
> https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE
>  
> 
> 
> continuous processing SPIP: https://issues.apache.org/jira/browse/SPARK-20928 
> 
> 
> 
> Jose



Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Yuanjian Li
Congratulations Zhenhua!!

2018-04-02 13:28 GMT+08:00 Wenchen Fan :

> Hi all,
>
> The Spark PMC recently added Zhenhua Wang as a committer on the project.
> Zhenhua is the major contributor of the CBO project, and has been
> contributing across several areas of Spark for a while, focusing especially
> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>
> Wenchen
>