Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Dongjoon Hyun
Thank you so much for your feedback, Koert.

Yes, SPARK-20202 was created in April 2017
and targeted for 3.1.0 since Nov 2019.

However, I believe Apache Spark 3.1.0 (Hadoop 3.2/Hive 2.3 distribution)
will work with old Hadoop 2.x clusters
if you isolated the classpath via SPARK-31960.

SPARK-31960 Only populate Hadoop classpath for no-hadoop build

Could you try with snapshot build?

Bests,
Dongjoon.




On Wed, Oct 7, 2020 at 3:24 PM Koert Kuipers  wrote:

> it seems to me with SPARK-20202 we are no longer planning to support
> hadoop2 + hive 1.2. is that correct?
>
> so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with
> hive?
>
> my use case is building spark 3.1 and launching on these existing
> clusters that are not managed by me. e.g. i do not use the spark version
> provided by cloudera.
> however there are workarounds for me (using older spark version to extract
> out of hive, then switch to newer spark version) so i am not too worried
> about this. just making sure i understand.
>
> thanks
>
> On Sat, Oct 3, 2020 at 8:17 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> As of today, master branch (Apache Spark 3.1.0) resolved
>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>> According to the 3.1.0 release window, branch-3.1 will be
>> created on November 1st and enters QA period.
>>
>> Here are some notable updates I've been monitoring.
>>
>> *Language*
>> 01. SPARK-25075 Support Scala 2.13
>>   - Since SPARK-32926, Scala 2.13 build test has
>> become a part of GitHub Action jobs.
>>   - After SPARK-33044, Scala 2.13 test will be
>> a part of Jenkins jobs.
>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>> 03. SPARK-32082 Project Zen: Improving Python usability
>>   - 7 of 16 issues are resolved.
>> 04. SPARK-32073 Drop R < 3.5 support
>>   - This is done for Spark 3.0.1 and 3.1.0.
>>
>> *Dependency*
>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>   - This changes the default dist. for better cloud support
>> 06. SPARK-32981 Remove hive-1.2 distribution
>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>   - This will remove Hive 1.2.1 from source code
>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>
>> *Core*
>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>   - 11 of 15 issues are resolved
>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>   - 8 of 14 issues are resolved
>>
>> *Resource Manager*
>> 11. SPARK-33005 Kubernetes GA preparation
>>   - It is on the way and we are waiting for more feedback.
>>
>> *SQL*
>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>   to JSON/Avro
>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>   - 11 of 17 issues are resolved
>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>   and added more features in 3.1 but still we missed
>>   - All built-in DataSource v2 write paths are disabled
>> and v1 write is used instead.
>>   - Support partition pruning with subqueries
>>   - Support bucketing
>>
>> We still have one month before the feature freeze
>> and starting QA. If you are working for 3.1,
>> please consider the timeline and share your schedule
>> with the Apache Spark community. For the other stuff,
>> we can put it into 3.2 release scheduled in June 2021.
>>
>> Last not but least, I want to emphasize (7) once again.
>> We need to remove the forked unofficial Hive eventually.
>> Please let us know your reasons if you need to build
>> from Apache Spark 3.1 source code for Hive 1.2.
>>
>> https://github.com/apache/spark/pull/29936
>>
>> As I wrote in the above PR description, for old releases,
>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>> Hive 1.2-based distribution.
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Koert Kuipers
it seems to me with SPARK-20202 we are no longer planning to support
hadoop2 + hive 1.2. is that correct?

so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with
hive?

my use case is building spark 3.1 and launching on these existing clusters
that are not managed by me. e.g. i do not use the spark version provided by
cloudera.
however there are workarounds for me (using older spark version to extract
out of hive, then switch to newer spark version) so i am not too worried
about this. just making sure i understand.

thanks

On Sat, Oct 3, 2020 at 8:17 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As of today, master branch (Apache Spark 3.1.0) resolved
> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
> According to the 3.1.0 release window, branch-3.1 will be
> created on November 1st and enters QA period.
>
> Here are some notable updates I've been monitoring.
>
> *Language*
> 01. SPARK-25075 Support Scala 2.13
>   - Since SPARK-32926, Scala 2.13 build test has
> become a part of GitHub Action jobs.
>   - After SPARK-33044, Scala 2.13 test will be
> a part of Jenkins jobs.
> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
> 03. SPARK-32082 Project Zen: Improving Python usability
>   - 7 of 16 issues are resolved.
> 04. SPARK-32073 Drop R < 3.5 support
>   - This is done for Spark 3.0.1 and 3.1.0.
>
> *Dependency*
> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>   - This changes the default dist. for better cloud support
> 06. SPARK-32981 Remove hive-1.2 distribution
> 07. SPARK-20202 Remove references to org.spark-project.hive
>   - This will remove Hive 1.2.1 from source code
> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>
> *Core*
> 09. SPARK-27495 Support Stage level resource conf and scheduling
>   - 11 of 15 issues are resolved
> 10. SPARK-25299 Use remote storage for persisting shuffle data
>   - 8 of 14 issues are resolved
>
> *Resource Manager*
> 11. SPARK-33005 Kubernetes GA preparation
>   - It is on the way and we are waiting for more feedback.
>
> *SQL*
> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>   to JSON/Avro
> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>   - 11 of 17 issues are resolved
> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>   and added more features in 3.1 but still we missed
>   - All built-in DataSource v2 write paths are disabled
> and v1 write is used instead.
>   - Support partition pruning with subqueries
>   - Support bucketing
>
> We still have one month before the feature freeze
> and starting QA. If you are working for 3.1,
> please consider the timeline and share your schedule
> with the Apache Spark community. For the other stuff,
> we can put it into 3.2 release scheduled in June 2021.
>
> Last not but least, I want to emphasize (7) once again.
> We need to remove the forked unofficial Hive eventually.
> Please let us know your reasons if you need to build
> from Apache Spark 3.1 source code for Hive 1.2.
>
> https://github.com/apache/spark/pull/29936
>
> As I wrote in the above PR description, for old releases,
> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
> Hive 1.2-based distribution.
>
> Bests,
> Dongjoon.
>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
As pointed out by Dongjoon, the 2nd half of December is the holiday season
in most countries. If we do the code freeze in mid November and release the
first RC in mid December. I am afraid the community will not be active to
verify the release candidates during the holiday season. Normally, the RC
stage is the most critical period to detect the defects and unexpected
behavior changes. Thus, starting the RC in the next January might be a good
option IMHO.

Cheers,

Xiao


Igor Dvorzhak  于2020年10月4日周日 下午10:35写道:

> Why to move the code freeze to early December? Seems like even according
> to the changed release cadence the code freeze should happen in
> mid-November.
>
> On Sun, Oct 4, 2020 at 6:26 PM Xiao Li  wrote:
>
>> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>>
>>
>> I think we made a change in release cadence since Spark 2.3. See the
>> commit:
>> https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
>> Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if
>> we do not want to change the release cadence?
>>
>> How about moving the code freeze of Spark 3.1 to *Early Dec 2020* and
>> the RC1 date to* Early Jan 2021*?
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> Dongjoon Hyun  于2020年10月4日周日 下午12:44写道:
>>
>>> For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
>>> different from 2.3 or 2.4.
>>>
>>> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>>>
>>> - Apache Spark 2.0.0 was released on July 26, 2016.
>>> - Apache Spark 2.1.0 was released on December 28, 2016.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
>>> wrote:
>>>
 Thank you all.

 BTW, Xiao and Mridul, I'm wondering what date you have in your mind
 specifically.

 Usually, `Christmas and New Year season` doesn't give us much
 additional time.

 If you think so, could you make a PR for Apache Spark website
 according to your expectation?

 https://spark.apache.org/versioning-policy.html

 Bests,
 Dongjoon.


 On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
 wrote:

>
> +1 on pushing the branch cut for increased dev time to match previous
> releases.
>
> Regards,
> Mridul
>
> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>
>> Thank you for your updates.
>>
>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date
>> of the 3.1 branch cut, the feature development time window is less than 5
>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>
>> Below are three highly desirable feature work I am watching.
>> Hopefully, we can finish them before the branch cut.
>>
>>- Support push-based shuffle to improve shuffle efficiency:
>>https://issues.apache.org/jira/browse/SPARK-30602
>>- Unify create table syntax:
>>https://issues.apache.org/jira/browse/SPARK-31257
>>- Bloom filter join:
>>https://issues.apache.org/jira/browse/SPARK-32268
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>>
>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>> dropped R 3.5 and below at branch 2.4 as well.
>>>
>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>> wrote:
>>>
 Hi, All.

 As of today, master branch (Apache Spark 3.1.0) resolved
 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
 According to the 3.1.0 release window, branch-3.1 will be
 created on November 1st and enters QA period.

 Here are some notable updates I've been monitoring.

 *Language*
 01. SPARK-25075 Support Scala 2.13
   - Since SPARK-32926, Scala 2.13 build test has
 become a part of GitHub Action jobs.
   - After SPARK-33044, Scala 2.13 test will be
 a part of Jenkins jobs.
 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
 03. SPARK-32082 Project Zen: Improving Python usability
   - 7 of 16 issues are resolved.
 04. SPARK-32073 Drop R < 3.5 support
   - This is done for Spark 3.0.1 and 3.1.0.

 *Dependency*
 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
   - This changes the default dist. for better cloud support
 06. SPARK-32981 Remove hive-1.2 distribution
 07. SPARK-20202 Remove references to org.spark-project.hive
   - This will remove Hive 1.2.1 from source code
 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

 *Core*
 09. SPARK-27495 Support Stage level resource conf and scheduling
   - 11 of 15 issues are resolved
 10. SPARK-25299 Use remote storage for persisting shuffle data
   - 8 of 

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Xiao Li
>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.


I think we made a change in release cadence since Spark 2.3. See the
commit:
https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if
we do not want to change the release cadence?

How about moving the code freeze of Spark 3.1 to *Early Dec 2020* and the
RC1 date to* Early Jan 2021*?

Thanks,

Xiao


Dongjoon Hyun  于2020年10月4日周日 下午12:44写道:

> For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
> different from 2.3 or 2.4.
>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>
> - Apache Spark 2.0.0 was released on July 26, 2016.
> - Apache Spark 2.1.0 was released on December 28, 2016.
>
> Bests,
> Dongjoon.
>
>
> On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
> wrote:
>
>> Thank you all.
>>
>> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
>> specifically.
>>
>> Usually, `Christmas and New Year season` doesn't give us much additional
>> time.
>>
>> If you think so, could you make a PR for Apache Spark website according
>> to your expectation?
>>
>> https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1 on pushing the branch cut for increased dev time to match previous
>>> releases.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>>>
 Thank you for your updates.

 Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date
 of the 3.1 branch cut, the feature development time window is less than 5
 months. This is shorter than what we did in Spark 2.3 and 2.4 releases.

 Below are three highly desirable feature work I am watching. Hopefully,
 we can finish them before the branch cut.

- Support push-based shuffle to improve shuffle efficiency:
https://issues.apache.org/jira/browse/SPARK-30602
- Unify create table syntax:
https://issues.apache.org/jira/browse/SPARK-31257
- Bloom filter join:
https://issues.apache.org/jira/browse/SPARK-32268

 Thanks,

 Xiao


 Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:

> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
> dropped R 3.5 and below at branch 2.4 as well.
>
> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
> wrote:
>
>> Hi, All.
>>
>> As of today, master branch (Apache Spark 3.1.0) resolved
>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>> According to the 3.1.0 release window, branch-3.1 will be
>> created on November 1st and enters QA period.
>>
>> Here are some notable updates I've been monitoring.
>>
>> *Language*
>> 01. SPARK-25075 Support Scala 2.13
>>   - Since SPARK-32926, Scala 2.13 build test has
>> become a part of GitHub Action jobs.
>>   - After SPARK-33044, Scala 2.13 test will be
>> a part of Jenkins jobs.
>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>> 03. SPARK-32082 Project Zen: Improving Python usability
>>   - 7 of 16 issues are resolved.
>> 04. SPARK-32073 Drop R < 3.5 support
>>   - This is done for Spark 3.0.1 and 3.1.0.
>>
>> *Dependency*
>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>   - This changes the default dist. for better cloud support
>> 06. SPARK-32981 Remove hive-1.2 distribution
>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>   - This will remove Hive 1.2.1 from source code
>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>
>> *Core*
>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>   - 11 of 15 issues are resolved
>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>   - 8 of 14 issues are resolved
>>
>> *Resource Manager*
>> 11. SPARK-33005 Kubernetes GA preparation
>>   - It is on the way and we are waiting for more feedback.
>>
>> *SQL*
>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>   to JSON/Avro
>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>   - 11 of 17 issues are resolved
>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>   and added more features in 3.1 but still we missed
>>   - All built-in DataSource v2 write paths are disabled
>> and v1 write is used instead.
>>   - Support partition pruning with subqueries
>>   - Support bucketing
>>
>> We still have one month before the feature freeze
>> and starting QA. If you are working for 3.1,
>> please consider the timeline and share your schedule
>> with the Apache Spark community. For 

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
different from 2.3 or 2.4.

Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.

- Apache Spark 2.0.0 was released on July 26, 2016.
- Apache Spark 2.1.0 was released on December 28, 2016.

Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun 
wrote:

> Thank you all.
>
> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
> specifically.
>
> Usually, `Christmas and New Year season` doesn't give us much additional
> time.
>
> If you think so, could you make a PR for Apache Spark website according
> to your expectation?
>
> https://spark.apache.org/versioning-policy.html
>
> Bests,
> Dongjoon.
>
>
> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan 
> wrote:
>
>>
>> +1 on pushing the branch cut for increased dev time to match previous
>> releases.
>>
>> Regards,
>> Mridul
>>
>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>>
>>> Thank you for your updates.
>>>
>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
>>> the 3.1 branch cut, the feature development time window is less than 5
>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>
>>> Below are three highly desirable feature work I am watching. Hopefully,
>>> we can finish them before the branch cut.
>>>
>>>- Support push-based shuffle to improve shuffle efficiency:
>>>https://issues.apache.org/jira/browse/SPARK-30602
>>>- Unify create table syntax:
>>>https://issues.apache.org/jira/browse/SPARK-31257
>>>- Bloom filter join:
>>>https://issues.apache.org/jira/browse/SPARK-32268
>>>
>>> Thanks,
>>>
>>> Xiao
>>>
>>>
>>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>>>
 Nice summary. Thanks Dongjoon. One minor correction -> I believe we
 dropped R 3.5 and below at branch 2.4 as well.

 On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
 wrote:

> Hi, All.
>
> As of today, master branch (Apache Spark 3.1.0) resolved
> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
> According to the 3.1.0 release window, branch-3.1 will be
> created on November 1st and enters QA period.
>
> Here are some notable updates I've been monitoring.
>
> *Language*
> 01. SPARK-25075 Support Scala 2.13
>   - Since SPARK-32926, Scala 2.13 build test has
> become a part of GitHub Action jobs.
>   - After SPARK-33044, Scala 2.13 test will be
> a part of Jenkins jobs.
> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
> 03. SPARK-32082 Project Zen: Improving Python usability
>   - 7 of 16 issues are resolved.
> 04. SPARK-32073 Drop R < 3.5 support
>   - This is done for Spark 3.0.1 and 3.1.0.
>
> *Dependency*
> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>   - This changes the default dist. for better cloud support
> 06. SPARK-32981 Remove hive-1.2 distribution
> 07. SPARK-20202 Remove references to org.spark-project.hive
>   - This will remove Hive 1.2.1 from source code
> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>
> *Core*
> 09. SPARK-27495 Support Stage level resource conf and scheduling
>   - 11 of 15 issues are resolved
> 10. SPARK-25299 Use remote storage for persisting shuffle data
>   - 8 of 14 issues are resolved
>
> *Resource Manager*
> 11. SPARK-33005 Kubernetes GA preparation
>   - It is on the way and we are waiting for more feedback.
>
> *SQL*
> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>   to JSON/Avro
> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>   - 11 of 17 issues are resolved
> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>   and added more features in 3.1 but still we missed
>   - All built-in DataSource v2 write paths are disabled
> and v1 write is used instead.
>   - Support partition pruning with subqueries
>   - Support bucketing
>
> We still have one month before the feature freeze
> and starting QA. If you are working for 3.1,
> please consider the timeline and share your schedule
> with the Apache Spark community. For the other stuff,
> we can put it into 3.2 release scheduled in June 2021.
>
> Last not but least, I want to emphasize (7) once again.
> We need to remove the forked unofficial Hive eventually.
> Please let us know your reasons if you need to build
> from Apache Spark 3.1 source code for Hive 1.2.
>
> https://github.com/apache/spark/pull/29936
>
> As I wrote in the above PR description, for old releases,
> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
> Hive 1.2-based distribution.
>
> Bests,
> Dongjoon.
>



Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Dongjoon Hyun
Thank you all.

BTW, Xiao and Mridul, I'm wondering what date you have in your mind
specifically.

Usually, `Christmas and New Year season` doesn't give us much additional
time.

If you think so, could you make a PR for Apache Spark website according to
your expectation?

https://spark.apache.org/versioning-policy.html

Bests,
Dongjoon.


On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan  wrote:

>
> +1 on pushing the branch cut for increased dev time to match previous
> releases.
>
> Regards,
> Mridul
>
> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:
>
>> Thank you for your updates.
>>
>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
>> the 3.1 branch cut, the feature development time window is less than 5
>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>
>> Below are three highly desirable feature work I am watching. Hopefully,
>> we can finish them before the branch cut.
>>
>>- Support push-based shuffle to improve shuffle efficiency:
>>https://issues.apache.org/jira/browse/SPARK-30602
>>- Unify create table syntax:
>>https://issues.apache.org/jira/browse/SPARK-31257
>>- Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268
>>
>> Thanks,
>>
>> Xiao
>>
>>
>> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>>
>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>> dropped R 3.5 and below at branch 2.4 as well.
>>>
>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, 
>>> wrote:
>>>
 Hi, All.

 As of today, master branch (Apache Spark 3.1.0) resolved
 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
 According to the 3.1.0 release window, branch-3.1 will be
 created on November 1st and enters QA period.

 Here are some notable updates I've been monitoring.

 *Language*
 01. SPARK-25075 Support Scala 2.13
   - Since SPARK-32926, Scala 2.13 build test has
 become a part of GitHub Action jobs.
   - After SPARK-33044, Scala 2.13 test will be
 a part of Jenkins jobs.
 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
 03. SPARK-32082 Project Zen: Improving Python usability
   - 7 of 16 issues are resolved.
 04. SPARK-32073 Drop R < 3.5 support
   - This is done for Spark 3.0.1 and 3.1.0.

 *Dependency*
 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
   - This changes the default dist. for better cloud support
 06. SPARK-32981 Remove hive-1.2 distribution
 07. SPARK-20202 Remove references to org.spark-project.hive
   - This will remove Hive 1.2.1 from source code
 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

 *Core*
 09. SPARK-27495 Support Stage level resource conf and scheduling
   - 11 of 15 issues are resolved
 10. SPARK-25299 Use remote storage for persisting shuffle data
   - 8 of 14 issues are resolved

 *Resource Manager*
 11. SPARK-33005 Kubernetes GA preparation
   - It is on the way and we are waiting for more feedback.

 *SQL*
 12. SPARK-30648/SPARK-32346 Support filters pushdown
   to JSON/Avro
 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
 14. SPARK-12312 Support JDBC Kerberos w/ keytab
   - 11 of 17 issues are resolved
 15. SPARK-27589 DSv2 was mostly completed in 3.0
   and added more features in 3.1 but still we missed
   - All built-in DataSource v2 write paths are disabled
 and v1 write is used instead.
   - Support partition pruning with subqueries
   - Support bucketing

 We still have one month before the feature freeze
 and starting QA. If you are working for 3.1,
 please consider the timeline and share your schedule
 with the Apache Spark community. For the other stuff,
 we can put it into 3.2 release scheduled in June 2021.

 Last not but least, I want to emphasize (7) once again.
 We need to remove the forked unofficial Hive eventually.
 Please let us know your reasons if you need to build
 from Apache Spark 3.1 source code for Hive 1.2.

 https://github.com/apache/spark/pull/29936

 As I wrote in the above PR description, for old releases,
 Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
 Hive 1.2-based distribution.

 Bests,
 Dongjoon.

>>>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-04 Thread Mridul Muralidharan
+1 on pushing the branch cut for increased dev time to match previous
releases.

Regards,
Mridul

On Sat, Oct 3, 2020 at 10:22 PM Xiao Li  wrote:

> Thank you for your updates.
>
> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
> the 3.1 branch cut, the feature development time window is less than 5
> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>
> Below are three highly desirable feature work I am watching. Hopefully, we
> can finish them before the branch cut.
>
>- Support push-based shuffle to improve shuffle efficiency:
>https://issues.apache.org/jira/browse/SPARK-30602
>- Unify create table syntax:
>https://issues.apache.org/jira/browse/SPARK-31257
>- Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268
>
> Thanks,
>
> Xiao
>
>
> Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:
>
>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>> dropped R 3.5 and below at branch 2.4 as well.
>>
>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:
>>
>>> Hi, All.
>>>
>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>> According to the 3.1.0 release window, branch-3.1 will be
>>> created on November 1st and enters QA period.
>>>
>>> Here are some notable updates I've been monitoring.
>>>
>>> *Language*
>>> 01. SPARK-25075 Support Scala 2.13
>>>   - Since SPARK-32926, Scala 2.13 build test has
>>> become a part of GitHub Action jobs.
>>>   - After SPARK-33044, Scala 2.13 test will be
>>> a part of Jenkins jobs.
>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>   - 7 of 16 issues are resolved.
>>> 04. SPARK-32073 Drop R < 3.5 support
>>>   - This is done for Spark 3.0.1 and 3.1.0.
>>>
>>> *Dependency*
>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>   - This changes the default dist. for better cloud support
>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>   - This will remove Hive 1.2.1 from source code
>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>
>>> *Core*
>>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>>   - 11 of 15 issues are resolved
>>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>>   - 8 of 14 issues are resolved
>>>
>>> *Resource Manager*
>>> 11. SPARK-33005 Kubernetes GA preparation
>>>   - It is on the way and we are waiting for more feedback.
>>>
>>> *SQL*
>>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>>   to JSON/Avro
>>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>>   - 11 of 17 issues are resolved
>>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>>   and added more features in 3.1 but still we missed
>>>   - All built-in DataSource v2 write paths are disabled
>>> and v1 write is used instead.
>>>   - Support partition pruning with subqueries
>>>   - Support bucketing
>>>
>>> We still have one month before the feature freeze
>>> and starting QA. If you are working for 3.1,
>>> please consider the timeline and share your schedule
>>> with the Apache Spark community. For the other stuff,
>>> we can put it into 3.2 release scheduled in June 2021.
>>>
>>> Last not but least, I want to emphasize (7) once again.
>>> We need to remove the forked unofficial Hive eventually.
>>> Please let us know your reasons if you need to build
>>> from Apache Spark 3.1 source code for Hive 1.2.
>>>
>>> https://github.com/apache/spark/pull/29936
>>>
>>> As I wrote in the above PR description, for old releases,
>>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>>> Hive 1.2-based distribution.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Xiao Li
Thank you for your updates.

Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date of
the 3.1 branch cut, the feature development time window is less than 5
months. This is shorter than what we did in Spark 2.3 and 2.4 releases.

Below are three highly desirable feature work I am watching. Hopefully, we
can finish them before the branch cut.

   - Support push-based shuffle to improve shuffle efficiency:
   https://issues.apache.org/jira/browse/SPARK-30602
   - Unify create table syntax:
   https://issues.apache.org/jira/browse/SPARK-31257
   - Bloom filter join: https://issues.apache.org/jira/browse/SPARK-32268

Thanks,

Xiao


Hyukjin Kwon  于2020年10月3日周六 下午5:41写道:

> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
> dropped R 3.5 and below at branch 2.4 as well.
>
> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:
>
>> Hi, All.
>>
>> As of today, master branch (Apache Spark 3.1.0) resolved
>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>> According to the 3.1.0 release window, branch-3.1 will be
>> created on November 1st and enters QA period.
>>
>> Here are some notable updates I've been monitoring.
>>
>> *Language*
>> 01. SPARK-25075 Support Scala 2.13
>>   - Since SPARK-32926, Scala 2.13 build test has
>> become a part of GitHub Action jobs.
>>   - After SPARK-33044, Scala 2.13 test will be
>> a part of Jenkins jobs.
>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>> 03. SPARK-32082 Project Zen: Improving Python usability
>>   - 7 of 16 issues are resolved.
>> 04. SPARK-32073 Drop R < 3.5 support
>>   - This is done for Spark 3.0.1 and 3.1.0.
>>
>> *Dependency*
>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>   - This changes the default dist. for better cloud support
>> 06. SPARK-32981 Remove hive-1.2 distribution
>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>   - This will remove Hive 1.2.1 from source code
>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>
>> *Core*
>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>   - 11 of 15 issues are resolved
>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>   - 8 of 14 issues are resolved
>>
>> *Resource Manager*
>> 11. SPARK-33005 Kubernetes GA preparation
>>   - It is on the way and we are waiting for more feedback.
>>
>> *SQL*
>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>   to JSON/Avro
>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>   - 11 of 17 issues are resolved
>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>   and added more features in 3.1 but still we missed
>>   - All built-in DataSource v2 write paths are disabled
>> and v1 write is used instead.
>>   - Support partition pruning with subqueries
>>   - Support bucketing
>>
>> We still have one month before the feature freeze
>> and starting QA. If you are working for 3.1,
>> please consider the timeline and share your schedule
>> with the Apache Spark community. For the other stuff,
>> we can put it into 3.2 release scheduled in June 2021.
>>
>> Last not but least, I want to emphasize (7) once again.
>> We need to remove the forked unofficial Hive eventually.
>> Please let us know your reasons if you need to build
>> from Apache Spark 3.1 source code for Hive 1.2.
>>
>> https://github.com/apache/spark/pull/29936
>>
>> As I wrote in the above PR description, for old releases,
>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>> Hive 1.2-based distribution.
>>
>> Bests,
>> Dongjoon.
>>
>


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Hyukjin Kwon
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped
R 3.5 and below at branch 2.4 as well.

On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:

> Hi, All.
>
> As of today, master branch (Apache Spark 3.1.0) resolved
> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
> According to the 3.1.0 release window, branch-3.1 will be
> created on November 1st and enters QA period.
>
> Here are some notable updates I've been monitoring.
>
> *Language*
> 01. SPARK-25075 Support Scala 2.13
>   - Since SPARK-32926, Scala 2.13 build test has
> become a part of GitHub Action jobs.
>   - After SPARK-33044, Scala 2.13 test will be
> a part of Jenkins jobs.
> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
> 03. SPARK-32082 Project Zen: Improving Python usability
>   - 7 of 16 issues are resolved.
> 04. SPARK-32073 Drop R < 3.5 support
>   - This is done for Spark 3.0.1 and 3.1.0.
>
> *Dependency*
> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>   - This changes the default dist. for better cloud support
> 06. SPARK-32981 Remove hive-1.2 distribution
> 07. SPARK-20202 Remove references to org.spark-project.hive
>   - This will remove Hive 1.2.1 from source code
> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>
> *Core*
> 09. SPARK-27495 Support Stage level resource conf and scheduling
>   - 11 of 15 issues are resolved
> 10. SPARK-25299 Use remote storage for persisting shuffle data
>   - 8 of 14 issues are resolved
>
> *Resource Manager*
> 11. SPARK-33005 Kubernetes GA preparation
>   - It is on the way and we are waiting for more feedback.
>
> *SQL*
> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>   to JSON/Avro
> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>   - 11 of 17 issues are resolved
> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>   and added more features in 3.1 but still we missed
>   - All built-in DataSource v2 write paths are disabled
> and v1 write is used instead.
>   - Support partition pruning with subqueries
>   - Support bucketing
>
> We still have one month before the feature freeze
> and starting QA. If you are working for 3.1,
> please consider the timeline and share your schedule
> with the Apache Spark community. For the other stuff,
> we can put it into 3.2 release scheduled in June 2021.
>
> Last not but least, I want to emphasize (7) once again.
> We need to remove the forked unofficial Hive eventually.
> Please let us know your reasons if you need to build
> from Apache Spark 3.1 source code for Hive 1.2.
>
> https://github.com/apache/spark/pull/29936
>
> As I wrote in the above PR description, for old releases,
> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
> Hive 1.2-based distribution.
>
> Bests,
> Dongjoon.
>


Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Dongjoon Hyun
Hi, All.

As of today, master branch (Apache Spark 3.1.0) resolved
852+ JIRA issues and 606+ issues are 3.1.0-only patches.
According to the 3.1.0 release window, branch-3.1 will be
created on November 1st and enters QA period.

Here are some notable updates I've been monitoring.

*Language*
01. SPARK-25075 Support Scala 2.13
  - Since SPARK-32926, Scala 2.13 build test has
become a part of GitHub Action jobs.
  - After SPARK-33044, Scala 2.13 test will be
a part of Jenkins jobs.
02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
03. SPARK-32082 Project Zen: Improving Python usability
  - 7 of 16 issues are resolved.
04. SPARK-32073 Drop R < 3.5 support
  - This is done for Spark 3.0.1 and 3.1.0.

*Dependency*
05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
  - This changes the default dist. for better cloud support
06. SPARK-32981 Remove hive-1.2 distribution
07. SPARK-20202 Remove references to org.spark-project.hive
  - This will remove Hive 1.2.1 from source code
08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)

*Core*
09. SPARK-27495 Support Stage level resource conf and scheduling
  - 11 of 15 issues are resolved
10. SPARK-25299 Use remote storage for persisting shuffle data
  - 8 of 14 issues are resolved

*Resource Manager*
11. SPARK-33005 Kubernetes GA preparation
  - It is on the way and we are waiting for more feedback.

*SQL*
12. SPARK-30648/SPARK-32346 Support filters pushdown
  to JSON/Avro
13. SPARK-32948/SPARK-32958 Add Json expression optimizer
14. SPARK-12312 Support JDBC Kerberos w/ keytab
  - 11 of 17 issues are resolved
15. SPARK-27589 DSv2 was mostly completed in 3.0
  and added more features in 3.1 but still we missed
  - All built-in DataSource v2 write paths are disabled
and v1 write is used instead.
  - Support partition pruning with subqueries
  - Support bucketing

We still have one month before the feature freeze
and starting QA. If you are working for 3.1,
please consider the timeline and share your schedule
with the Apache Spark community. For the other stuff,
we can put it into 3.2 release scheduled in June 2021.

Last not but least, I want to emphasize (7) once again.
We need to remove the forked unofficial Hive eventually.
Please let us know your reasons if you need to build
from Apache Spark 3.1 source code for Hive 1.2.

https://github.com/apache/spark/pull/29936

As I wrote in the above PR description, for old releases,
Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
Hive 1.2-based distribution.

Bests,
Dongjoon.