Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Xiao Li Sun, 04 Oct 2020 18:27:14 -0700

>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.


I think we made a change in release cadence since Spark 2.3. See the
commit:
https://github.com/apache/spark-website/commit/88990968962e5cc47db8bc2c11a50742d2438daa
Thus, Spark 3.1 might just follow the release cadence of Spark 2.3/2.4, if
we do not want to change the release cadence?

How about moving the code freeze of Spark 3.1 to *Early Dec 2020* and the
RC1 date to* Early Jan 2021*?

Thanks,

Xiao


Dongjoon Hyun <[email protected]> 于2020年10月4日周日 下午12:44写道：

> For Xiao's comment, I want to point out that Apache Spark 3.1.0 is
> different from 2.3 or 2.4.
>
> Apache Spark 3.1.0 should be compared with Apache Spark 2.1.0.
>
> - Apache Spark 2.0.0 was released on July 26, 2016.
> - Apache Spark 2.1.0 was released on December 28, 2016.
>
> Bests,
> Dongjoon.
>
>
> On Sun, Oct 4, 2020 at 10:53 AM Dongjoon Hyun <[email protected]>
> wrote:
>
>> Thank you all.
>>
>> BTW, Xiao and Mridul, I'm wondering what date you have in your mind
>> specifically.
>>
>> Usually, `Christmas and New Year season` doesn't give us much additional
>> time.
>>
>> If you think so, could you make a PR for Apache Spark website according
>> to your expectation?
>>
>> https://spark.apache.org/versioning-policy.html
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Sun, Oct 4, 2020 at 7:18 AM Mridul Muralidharan <[email protected]>
>> wrote:
>>
>>>
>>> +1 on pushing the branch cut for increased dev time to match previous
>>> releases.
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Sat, Oct 3, 2020 at 10:22 PM Xiao Li <[email protected]> wrote:
>>>
>>>> Thank you for your updates.
>>>>
>>>> Spark 3.0 got released on Jun 18, 2020. If Nov 1st is the target date
>>>> of the 3.1 branch cut, the feature development time window is less than 5
>>>> months. This is shorter than what we did in Spark 2.3 and 2.4 releases.
>>>>
>>>> Below are three highly desirable feature work I am watching. Hopefully,
>>>> we can finish them before the branch cut.
>>>>
>>>>    - Support push-based shuffle to improve shuffle efficiency:
>>>>    https://issues.apache.org/jira/browse/SPARK-30602
>>>>    - Unify create table syntax:
>>>>    https://issues.apache.org/jira/browse/SPARK-31257
>>>>    - Bloom filter join:
>>>>    https://issues.apache.org/jira/browse/SPARK-32268
>>>>
>>>> Thanks,
>>>>
>>>> Xiao
>>>>
>>>>
>>>> Hyukjin Kwon <[email protected]> 于2020年10月3日周六 下午5:41写道：
>>>>
>>>>> Nice summary. Thanks Dongjoon. One minor correction -> I believe we
>>>>> dropped R 3.5 and below at branch 2.4 as well.
>>>>>
>>>>> On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun, <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> As of today, master branch (Apache Spark 3.1.0) resolved
>>>>>> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
>>>>>> According to the 3.1.0 release window, branch-3.1 will be
>>>>>> created on November 1st and enters QA period.
>>>>>>
>>>>>> Here are some notable updates I've been monitoring.
>>>>>>
>>>>>> *Language*
>>>>>> 01. SPARK-25075 Support Scala 2.13
>>>>>>       - Since SPARK-32926, Scala 2.13 build test has
>>>>>>         become a part of GitHub Action jobs.
>>>>>>       - After SPARK-33044, Scala 2.13 test will be
>>>>>>         a part of Jenkins jobs.
>>>>>> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
>>>>>> 03. SPARK-32082 Project Zen: Improving Python usability
>>>>>>       - 7 of 16 issues are resolved.
>>>>>> 04. SPARK-32073 Drop R < 3.5 support
>>>>>>       - This is done for Spark 3.0.1 and 3.1.0.
>>>>>>
>>>>>> *Dependency*
>>>>>> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>>>>>>       - This changes the default dist. for better cloud support
>>>>>> 06. SPARK-32981 Remove hive-1.2 distribution
>>>>>> 07. SPARK-20202 Remove references to org.spark-project.hive
>>>>>>       - This will remove Hive 1.2.1 from source code
>>>>>> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>>>>>>
>>>>>> *Core*
>>>>>> 09. SPARK-27495 Support Stage level resource conf and scheduling
>>>>>>       - 11 of 15 issues are resolved
>>>>>> 10. SPARK-25299 Use remote storage for persisting shuffle data
>>>>>>       - 8 of 14 issues are resolved
>>>>>>
>>>>>> *Resource Manager*
>>>>>> 11. SPARK-33005 Kubernetes GA preparation
>>>>>>       - It is on the way and we are waiting for more feedback.
>>>>>>
>>>>>> *SQL*
>>>>>> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>>>>>>       to JSON/Avro
>>>>>> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
>>>>>> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>>>>>>       - 11 of 17 issues are resolved
>>>>>> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>>>>>>       and added more features in 3.1 but still we missed
>>>>>>       - All built-in DataSource v2 write paths are disabled
>>>>>>         and v1 write is used instead.
>>>>>>       - Support partition pruning with subqueries
>>>>>>       - Support bucketing
>>>>>>
>>>>>> We still have one month before the feature freeze
>>>>>> and starting QA. If you are working for 3.1,
>>>>>> please consider the timeline and share your schedule
>>>>>> with the Apache Spark community. For the other stuff,
>>>>>> we can put it into 3.2 release scheduled in June 2021.
>>>>>>
>>>>>> Last not but least, I want to emphasize (7) once again.
>>>>>> We need to remove the forked unofficial Hive eventually.
>>>>>> Please let us know your reasons if you need to build
>>>>>> from Apache Spark 3.1 source code for Hive 1.2.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/29936
>>>>>>
>>>>>> As I wrote in the above PR description, for old releases,
>>>>>> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
>>>>>> Hive 1.2-based distribution.
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>

Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

Reply via email to