Re: Apache Spark 3.3 Release

Maciej Fri, 29 Apr 2022 09:50:59 -0700

Thanks for the updated Max!

Just a small clarification ‒ the following should be moved to RESOLVED:


1. SPARK-37396: Inline type hint files for files in python/pyspark/mllib
2. SPARK-37395: Inline type hint files for files in python/pyspark/ml
3. SPARK-37093: Inline type hints python/pyspark/streaming

On 4/28/22 14:42, Maxim Gekk wrote:
> Hello All,
> 
> I am going to create the first release candidate of Spark 3.3 at the
> beginning of the next week if there are no objections. Below is the list
> of allow features, and their current status. At the moment, only one
> feature is still in progress, but it can be postponed to the next
> release, I guess:
> 
> IN PROGRESS:
> 
>  1. SPARK-28516: Data Type Formatting Functions: `to_char`
> 
> IN PROGRESS but won't/couldn't be merged to branch-3.3:
> 
>  1. SPARK-37650: Tell spark-env.sh the python interpreter
>  2. SPARK-36664: Log time spent waiting for cluster resources
>  3. SPARK-37396: Inline type hint files for files in python/pyspark/mllib
>  4. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>  5. SPARK-37093: Inline type hints python/pyspark/streaming
> 
> RESOLVED:
> 
>  1. SPARK-32268: Bloom Filter Join
>  2. SPARK-38548: New SQL function: try_sum
>  3. SPARK-38063: Support SQL split_part function
>  4. SPARK-38432: Refactor framework so as JDBC dialect could compile
>     filter by self way
>  5. SPARK-34863: Support nested column in Spark Parquet vectorized readers
>  6. SPARK-38194: Make Yarn memory overhead factor configurable
>  7. SPARK-37618: Support cleaning up shuffle blocks from external
>     shuffle service
>  8. SPARK-37831: Add task partition id in metrics
>  9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>     DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
> 10. SPARK-38590: New SQL function: try_to_binary
> 11. SPARK-37377: Refactor V2 Partitioning interface and remove
>     deprecated usage of Distribution
> 12. SPARK-38085: DataSource V2: Handle DELETE commands for group-based
>     sources
> 13. SPARK-34659: Web UI does not correctly get appId
> 14. SPARK-38589: New SQL function: try_avg
> 15. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
> 16. SPARK-34079: Improvement CTE table scan
> 
> 
> Max Gekk
> 
> Software Engineer
> 
> Databricks, Inc.
> 
> 
> 
> On Fri, Apr 15, 2022 at 4:28 PM Maxim Gekk <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hello All,
> 
>     Current status of features from the allow list for branch-3.3 is:
> 
>     IN PROGRESS:
> 
>      1. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>      2. SPARK-28516: Data Type Formatting Functions: `to_char`
>      3. SPARK-34079: Improvement CTE table scan
> 
>     IN PROGRESS but won't/couldn't be merged to branch-3.3:
> 
>      1. SPARK-37650: Tell spark-env.sh the python interpreter
>      2. SPARK-36664: Log time spent waiting for cluster resources
>      3. SPARK-37396: Inline type hint files for files in
>         python/pyspark/mllib
>      4. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>      5. SPARK-37093: Inline type hints python/pyspark/streaming
> 
>     RESOLVED:
> 
>      1. SPARK-32268: Bloom Filter Join
>      2. SPARK-38548: New SQL function: try_sum
>      3. SPARK-38063: Support SQL split_part function
>      4. SPARK-38432: Refactor framework so as JDBC dialect could compile
>         filter by self way
>      5. SPARK-34863: Support nested column in Spark Parquet vectorized
>         readers
>      6. SPARK-38194: Make Yarn memory overhead factor configurable
>      7. SPARK-37618: Support cleaning up shuffle blocks from external
>         shuffle service
>      8. SPARK-37831: Add task partition id in metrics
>      9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>         DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
>     10. SPARK-38590: New SQL function: try_to_binary
>     11. SPARK-37377: Refactor V2 Partitioning interface and remove
>         deprecated usage of Distribution
>     12. SPARK-38085: DataSource V2: Handle DELETE commands for
>         group-based sources
>     13. SPARK-34659: Web UI does not correctly get appId
>     14. SPARK-38589: New SQL function: try_avg
> 
> 
>     Max Gekk
> 
>     Software Engineer
> 
>     Databricks, Inc.
> 
> 
> 
>     On Mon, Apr 4, 2022 at 9:27 PM Maxim Gekk <[email protected]
>     <mailto:[email protected]>> wrote:
> 
>         Hello All,
> 
>         Below is current status of features from the allow list:
> 
>         IN PROGRESS:
> 
>          1. SPARK-37396: Inline type hint files for files in
>             python/pyspark/mllib
>          2. SPARK-37395: Inline type hint files for files in
>             python/pyspark/ml
>          3. SPARK-37093: Inline type hints python/pyspark/streaming
>          4. SPARK-37377: Refactor V2 Partitioning interface and remove
>             deprecated usage of Distribution
>          5. SPARK-38085: DataSource V2: Handle DELETE commands for
>             group-based sources
>          6. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>          7. SPARK-28516: Data Type Formatting Functions: `to_char`
>          8. SPARK-36664: Log time spent waiting for cluster resources
>          9. SPARK-34659: Web UI does not correctly get appId
>         10. SPARK-37650: Tell spark-env.sh the python interpreter
>         11. SPARK-38589: New SQL function: try_avg
>         12. SPARK-38590: New SQL function: try_to_binary
>         13. SPARK-34079: Improvement CTE table scan
> 
>         RESOLVED:
> 
>          1. SPARK-32268: Bloom Filter Join
>          2. SPARK-38548: New SQL function: try_sum
>          3. SPARK-38063: Support SQL split_part function
>          4. SPARK-38432: Refactor framework so as JDBC dialect could
>             compile filter by self way
>          5. SPARK-34863: Support nested column in Spark Parquet
>             vectorized readers
>          6. SPARK-38194: Make Yarn memory overhead factor configurable
>          7. SPARK-37618: Support cleaning up shuffle blocks from
>             external shuffle service
>          8. SPARK-37831: Add task partition id in metrics
>          9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>             DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
> 
>         We need to decide whether we are going to wait a little bit more
>         or close the doors.
> 
>         Maxim Gekk
> 
>         Software Engineer
> 
>         Databricks, Inc.
> 
> 
> 
>         On Fri, Mar 18, 2022 at 9:22 AM Maxim Gekk
>         <[email protected] <mailto:[email protected]>>
>         wrote:
> 
>             Hi All,
> 
>             Here is the allow list which I built based on your requests
>             in this thread:
> 
>              1. SPARK-37396: Inline type hint files for files in
>                 python/pyspark/mllib
>              2. SPARK-37395: Inline type hint files for files in
>                 python/pyspark/ml
>              3. SPARK-37093: Inline type hints python/pyspark/streaming
>              4. SPARK-37377: Refactor V2 Partitioning interface and
>                 remove deprecated usage of Distribution
>              5. SPARK-38085: DataSource V2: Handle DELETE commands for
>                 group-based sources
>              6. SPARK-32268: Bloom Filter Join
>              7. SPARK-38548: New SQL function: try_sum
>              8. SPARK-37691: Support ANSI Aggregation Function:
>                 percentile_disc
>              9. SPARK-38063: Support SQL split_part function
>             10. SPARK-28516: Data Type Formatting Functions: `to_char`
>             11. SPARK-38432: Refactor framework so as JDBC dialect could
>                 compile filter by self way
>             12. SPARK-34863: Support nested column in Spark Parquet
>                 vectorized readers
>             13. SPARK-38194: Make Yarn memory overhead factor configurable
>             14. SPARK-37618: Support cleaning up shuffle blocks from
>                 external shuffle service
>             15. SPARK-37831: Add task partition id in metrics
>             16. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
>                 DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
>             17. SPARK-36664: Log time spent waiting for cluster resources
>             18. SPARK-34659: Web UI does not correctly get appId
>             19. SPARK-37650: Tell spark-env.sh the python interpreter
>             20. SPARK-38589: New SQL function: try_avg
>             21. SPARK-38590: New SQL function: try_to_binary
>             22. SPARK-34079: Improvement CTE table scan
> 
>             Best regards,
>             Max Gekk
> 
> 
>             On Thu, Mar 17, 2022 at 4:59 PM Tom Graves
>             <[email protected] <mailto:[email protected]>> wrote:
> 
>                 Is the feature freeze target date March 22nd then?  I
>                 saw a few dates thrown around want to confirm what we
>                 landed on 
> 
>                 I am trying to get the following improvements finished
>                 review and in, if concerns with either, let me know:
>                 - [SPARK-34079][SQL] Merge non-correlated scalar
>                 subqueries <https://github.com/apache/spark/pull/32298#>
>                 - [SPARK-37618][CORE] Remove shuffle blocks using the
>                 shuffle service for released executors
>                 <https://github.com/apache/spark/pull/35085#>
> 
>                 Tom
> 
> 
>                 On Thursday, March 17, 2022, 07:24:41 AM CDT, Gengliang
>                 Wang <[email protected] <mailto:[email protected]>> wrote:
> 
> 
>                 I'd like to add the following new SQL functions in the
>                 3.3 release. These functions are useful when overflow or
>                 encoding errors occur:
> 
>                   * [SPARK-38548][SQL] New SQL function: try_sum
>                     <https://github.com/apache/spark/pull/35848> 
>                   * [SPARK-38589][SQL] New SQL function: try_avg
>                     <https://github.com/apache/spark/pull/35896>
>                   * [SPARK-38590][SQL] New SQL function: try_to_binary
>                     <https://github.com/apache/spark/pull/35897> 
> 
>                 Gengliang
> 
>                 On Thu, Mar 17, 2022 at 7:59 AM Andrew Melo
>                 <[email protected] <mailto:[email protected]>>
>                 wrote:
> 
>                     Hello,
> 
>                     I've been trying for a bit to get the following two
>                     PRs merged and
>                     into a release, and I'm having some difficulty
>                     moving them forward:
> 
>                     https://github.com/apache/spark/pull/34903
>                     <https://github.com/apache/spark/pull/34903> - This
>                     passes the current
>                     python interpreter to spark-env.sh to allow some
>                     currently-unavailable
>                     customization to happen
>                     https://github.com/apache/spark/pull/31774
>                     <https://github.com/apache/spark/pull/31774> - This
>                     fixes a bug in the
>                     SparkUI reverse proxy-handling code where it does a
>                     greedy match for
>                     "proxy" in the URL, and will mistakenly replace the
>                     App-ID in the
>                     wrong place.
> 
>                     I'm not exactly sure of how to get attention of PRs
>                     that have been
>                     sitting around for a while, but these are really
>                     important to our
>                     use-cases, and it would be nice to have them merged in.
> 
>                     Cheers
>                     Andrew
> 
>                     On Wed, Mar 16, 2022 at 6:21 PM Holden Karau
>                     <[email protected] <mailto:[email protected]>>
>                     wrote:
>                     >
>                     > I'd like to add/backport the logging in
>                     https://github.com/apache/spark/pull/35881
>                     <https://github.com/apache/spark/pull/35881> PR so
>                     that when users submit issues with dynamic
>                     allocation we can better debug what's going on.
>                     >
>                     > On Wed, Mar 16, 2022 at 3:45 PM Chao Sun
>                     <[email protected] <mailto:[email protected]>> wrote:
>                     >>
>                     >> There is one item on our side that we want to
>                     backport to 3.3:
>                     >> - vectorized
>                     DELTA_BYTE_ARRAY/DELTA_LENGTH_BYTE_ARRAY encodings for
>                     >> Parquet V2 support
>                     (https://github.com/apache/spark/pull/35262
>                     <https://github.com/apache/spark/pull/35262>)
>                     >>
>                     >> It's already reviewed and approved.
>                     >>
>                     >> On Wed, Mar 16, 2022 at 9:13 AM Tom Graves
>                     <[email protected]> wrote:
>                     >> >
>                     >> > It looks like the version hasn't been updated
>                     on master and still shows 3.3.0-SNAPSHOT, can you
>                     please update that.
>                     >> >
>                     >> > Tom
>                     >> >
>                     >> > On Wednesday, March 16, 2022, 01:41:00 AM CDT,
>                     Maxim Gekk <[email protected]
>                     <mailto:[email protected]>.invalid> wrote:
>                     >> >
>                     >> >
>                     >> > Hi All,
>                     >> >
>                     >> > I have created the branch for Spark 3.3:
>                     >> >
>                     https://github.com/apache/spark/commits/branch-3.3
>                     <https://github.com/apache/spark/commits/branch-3.3>
>                     >> >
>                     >> > Please, backport important fixes to it, and if
>                     you have some doubts, ping me in the PR. Regarding
>                     new features, we are still building the allow list
>                     for branch-3.3.
>                     >> >
>                     >> > Best regards,
>                     >> > Max Gekk
>                     >> >
>                     >> >
>                     >> > On Wed, Mar 16, 2022 at 5:51 AM Dongjoon Hyun
>                     <[email protected]
>                     <mailto:[email protected]>> wrote:
>                     >> >
>                     >> > Yes, I agree with you for your whitelist
>                     approach for backporting. :)
>                     >> > Thank you for summarizing.
>                     >> >
>                     >> > Thanks,
>                     >> > Dongjoon.
>                     >> >
>                     >> >
>                     >> > On Tue, Mar 15, 2022 at 4:20 PM Xiao Li
>                     <[email protected] <mailto:[email protected]>>
>                     wrote:
>                     >> >
>                     >> > I think I finally got your point. What you want
>                     to keep unchanged is the branch cut date of Spark
>                     3.3. Today? or this Friday? This is not a big deal.
>                     >> >
>                     >> > My major concern is whether we should keep
>                     merging the feature work or the dependency upgrade
>                     after the branch cut. To make our release time more
>                     predictable, I am suggesting we should finalize the
>                     exception PR list first, instead of merging them in
>                     an ad hoc way. In the past, we spent a lot of time
>                     on the revert of the PRs that were merged after the
>                     branch cut. I hope we can minimize unnecessary
>                     arguments in this release. Do you agree, Dongjoon?
>                     >> >
>                     >> >
>                     >> >
>                     >> > Dongjoon Hyun <[email protected]
>                     <mailto:[email protected]>> 于2022年3月15日周
>                     二 15:55写道：
>                     >> >
>                     >> > That is not totally fine, Xiao. It sounds like
>                     you are asking a change of plan without a proper reason.
>                     >> >
>                     >> > Although we cut the branch Today according our
>                     plan, you still can collect the list and make a list
>                     of exceptions. I'm not blocking what you want to do.
>                     >> >
>                     >> > Please let the community start to ramp down as
>                     we agreed before.
>                     >> >
>                     >> > Dongjoon
>                     >> >
>                     >> >
>                     >> >
>                     >> > On Tue, Mar 15, 2022 at 3:07 PM Xiao Li
>                     <[email protected] <mailto:[email protected]>>
>                     wrote:
>                     >> >
>                     >> > Please do not get me wrong. If we don't cut a
>                     branch, we are allowing all patches to land Apache
>                     Spark 3.3. That is totally fine. After we cut the
>                     branch, we should avoid merging the feature work. In
>                     the next three days, let us collect the actively
>                     developed PRs that we want to make an exception
>                     (i.e., merged to 3.3 after the upcoming branch cut).
>                     Does that make sense?
>                     >> >
>                     >> > Dongjoon Hyun <[email protected]
>                     <mailto:[email protected]>> 于2022年3月15日周
>                     二 14:54写道：
>                     >> >
>                     >> > Xiao. You are working against what you are saying.
>                     >> > If you don't cut a branch, it means you are
>                     allowing all patches to land Apache Spark 3.3. No?
>                     >> >
>                     >> > > we need to avoid backporting the feature work
>                     that are not being well discussed.
>                     >> >
>                     >> >
>                     >> >
>                     >> > On Tue, Mar 15, 2022 at 12:12 PM Xiao Li
>                     <[email protected] <mailto:[email protected]>>
>                     wrote:
>                     >> >
>                     >> > Cutting the branch is simple, but we need to
>                     avoid backporting the feature work that are not
>                     being well discussed. Not all the members are
>                     actively following the dev list. I think we should
>                     wait 3 more days for collecting the PR list before
>                     cutting the branch.
>                     >> >
>                     >> > BTW, there are very few 3.4-only feature work
>                     that will be affected.
>                     >> >
>                     >> > Xiao
>                     >> >
>                     >> > Dongjoon Hyun <[email protected]
>                     <mailto:[email protected]>> 于2022年3月15日周
>                     二 11:49写道：
>                     >> >
>                     >> > Hi, Max, Chao, Xiao, Holden and all.
>                     >> >
>                     >> > I have a different idea.
>                     >> >
>                     >> > Given the situation and small patch list, I
>                     don't think we need to postpone the branch cut for
>                     those patches. It's easier to cut a branch-3.3 and
>                     allow backporting.
>                     >> >
>                     >> > As of today, we already have an obvious Apache
>                     Spark 3.4 patch in the branch together. This
>                     situation only becomes worse and worse because there
>                     is no way to block the other patches from landing
>                     unintentionally if we don't cut a branch.
>                     >> >
>                     >> >     [SPARK-38335][SQL] Implement parser support
>                     for DEFAULT column values
>                     >> >
>                     >> > Let's cut `branch-3.3` Today for Apache Spark
>                     3.3.0 preparation.
>                     >> >
>                     >> > Best,
>                     >> > Dongjoon.
>                     >> >
>                     >> >
>                     >> > On Tue, Mar 15, 2022 at 10:17 AM Chao Sun
>                     <[email protected] <mailto:[email protected]>> wrote:
>                     >> >
>                     >> > Cool, thanks for clarifying!
>                     >> >
>                     >> > On Tue, Mar 15, 2022 at 10:11 AM Xiao Li
>                     <[email protected] <mailto:[email protected]>>
>                     wrote:
>                     >> > >>
>                     >> > >> For the following list:
>                     >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime
>                     Filtering
>                     >> > >> #34659 [SPARK-34863][SQL] Support complex
>                     types for Parquet vectorized reader
>                     >> > >> #35848 [SPARK-38548][SQL] New SQL function:
>                     try_sum
>                     >> > >> Do you mean we should include them, or
>                     exclude them from 3.3?
>                     >> > >
>                     >> > >
>                     >> > > If possible, I hope these features can be
>                     shipped with Spark 3.3.
>                     >> > >
>                     >> > >
>                     >> > >
>                     >> > > Chao Sun <[email protected]
>                     <mailto:[email protected]>> 于2022年3月15日周二
>                     10:06写道：
>                     >> > >>
>                     >> > >> Hi Xiao,
>                     >> > >>
>                     >> > >> For the following list:
>                     >> > >>
>                     >> > >> #35789 [SPARK-32268][SQL] Row-level Runtime
>                     Filtering
>                     >> > >> #34659 [SPARK-34863][SQL] Support complex
>                     types for Parquet vectorized reader
>                     >> > >> #35848 [SPARK-38548][SQL] New SQL function:
>                     try_sum
>                     >> > >>
>                     >> > >> Do you mean we should include them, or
>                     exclude them from 3.3?
>                     >> > >>
>                     >> > >> Thanks,
>                     >> > >> Chao
>                     >> > >>
>                     >> > >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon
>                     Hyun <[email protected]
>                     <mailto:[email protected]>> wrote:
>                     >> > >> >
>                     >> > >> > The following was tested and merged a few
>                     minutes ago. So, we can remove it from the list.
>                     >> > >> >
>                     >> > >> > #35819 [SPARK-38524][SPARK-38553][K8S]
>                     Bump Volcano to v1.5.1
>                     >> > >> >
>                     >> > >> > Thanks,
>                     >> > >> > Dongjoon.
>                     >> > >> >
>                     >> > >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li
>                     <[email protected] <mailto:[email protected]>>
>                     wrote:
>                     >> > >> >>
>                     >> > >> >> Let me clarify my above suggestion. Maybe
>                     we can wait 3 more days to collect the list of
>                     actively developed PRs that we want to merge to 3.3
>                     after the branch cut?
>                     >> > >> >>
>                     >> > >> >> Please do not rush to merge the PRs that
>                     are not fully reviewed. We can cut the branch this
>                     Friday and continue merging the PRs that have been
>                     discussed in this thread. Does that make sense?
>                     >> > >> >>
>                     >> > >> >> Xiao
>                     >> > >> >>
>                     >> > >> >>
>                     >> > >> >>
>                     >> > >> >> Holden Karau <[email protected]
>                     <mailto:[email protected]>> 于2022年3月15日周二
>                     09:10写道：
>                     >> > >> >>>
>                     >> > >> >>> May I suggest we push out one week
>                     (22nd) just to give everyone a bit of breathing
>                     space? Rushed software development more often
>                     results in bugs.
>                     >> > >> >>>
>                     >> > >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun
>                     Jiang <[email protected]
>                     <mailto:[email protected]>> wrote:
>                     >> > >> >>>>
>                     >> > >> >>>> > To make our release time more
>                     predictable, let us collect the PRs and wait three
>                     more days before the branch cut?
>                     >> > >> >>>>
>                     >> > >> >>>> For SPIP: Support Customized Kubernetes
>                     Schedulers:
>                     >> > >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S]
>                     Bump Volcano to v1.5.1
>                     >> > >> >>>>
>                     >> > >> >>>> Three more days are OK for this from my
>                     view.
>                     >> > >> >>>>
>                     >> > >> >>>> Regards,
>                     >> > >> >>>> Yikun
>                     >> > >> >>>
>                     >> > >> >>> --
>                     >> > >> >>> Twitter: https://twitter.com/holdenkarau
>                     <https://twitter.com/holdenkarau>
>                     >> > >> >>> Books (Learning Spark, High Performance
>                     Spark, etc.): https://amzn.to/2MaRAG9
>                     <https://amzn.to/2MaRAG9>
>                     >> > >> >>> YouTube Live Streams:
>                     https://www.youtube.com/user/holdenkarau
>                     <https://www.youtube.com/user/holdenkarau>
>                     >
>                     >
>                     >
>                     > --
>                     > Twitter: https://twitter.com/holdenkarau
>                     <https://twitter.com/holdenkarau>
>                     > Books (Learning Spark, High Performance Spark,
>                     etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>                     > YouTube Live Streams:
>                     https://www.youtube.com/user/holdenkarau
>                     <https://www.youtube.com/user/holdenkarau>
> 
>                     
> ---------------------------------------------------------------------
>                     To unsubscribe e-mail:
>                     [email protected]
>                     <mailto:[email protected]>
> 


-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
PGP: A30CEF0C31A501EC

OpenPGP_signature
Description: OpenPGP digital signature

Re: Apache Spark 3.3 Release

Reply via email to