Re: Spark 3.0 preview release feature list and major changes

Xingbo Jiang Tue, 08 Oct 2019 14:32:53 -0700

>
>  What's the process to propose a feature to be included in the final Spark
> 3.0 release?
>


I don't know whether there exists any specific process here, normally you
just merge the feature into Spark master before release code freeze, and
then the feature would probably be included in the release. The code freeze
date for Spark 3.0 has not been decided yet, though.

Li Jin <[email protected]> 于2019年10月8日周二 下午2:14写道：

> Thanks for summary!
>
> I have a question that is semi-related - What's the process to propose a
> feature to be included in the final Spark 3.0 release?
>
> In particular, I am interested in
> https://issues.apache.org/jira/browse/SPARK-28006.  I am happy to do the
> work so want to make sure I don't miss the "cut" date.
>
> On Tue, Oct 8, 2019 at 4:53 PM Xingbo Jiang <[email protected]> wrote:
>
>> Hi all,
>>
>> Thanks for all the feedbacks, here is the updated feature list:
>>
>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
>> columns support added to various Transformers: StringIndexer
>>
>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>> Implement Dynamic Partition Pruning
>>
>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
>> Tree-Based Feature Transformation
>>
>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>> MultilabelClassificationEvaluator
>>
>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>> sample weights to decision trees
>>
>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>
>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>> for Power Iteration Clustering
>>
>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
>> logic for timing out executors in dynamic allocation
>>
>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>> Eliminate unnecessary shuffle with adjacent Window expressions
>>
>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
>> new executors to avoid hang because of blacklisting
>>
>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
>> columns support added to various Transformers: PySpark QuantileDiscretizer
>>
>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>> approach to do adaptive execution in Spark SQL
>>
>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
>> custom log URL pattern for executor log URLs in SHS
>>
>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>> support for Kafka headers
>>
>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>> Spark ML Listener for Tracking ML Pipeline Status
>>
>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
>> the built-in Hive to 2.3.5 for hadoop-3.2
>>
>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
>> with validation set to Gradient Boosted Trees: Python API
>>
>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>> and Run Spark on JDK11
>>
>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>> Accelerator-aware task scheduling for Spark
>>
>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>> sharing Netty's memory pool allocators
>>
>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
>> condition with tasks running when new attempt for same stage is created
>> leads to other task in the next attempt running on the same partition id
>> retry multiple times
>>
>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
>> rolling back a shuffle map stage and re-generate the shuffle files
>>
>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>> source for binary files
>>
>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>> kafka delegation token support
>>
>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>> Generalize Nested Column Pruning
>>
>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
>> support for Scala 2.11 in Spark 3.0.0
>>
>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
>> reserved keywords after SQL standard
>>
>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
>> Pandas UDF to take an iterator of pd.DataFrames
>>
>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
>> optimization in SparkR's interoperability
>>
>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>> source v2 API refactor: streaming write
>>
>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>>
>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
>> streaming output mode from data source v2 APIs
>>
>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
>> StreamingWrite at the beginning of streaming execution
>>
>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
>> infer schema when reading Hive serde table with native data source
>>
>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>> Implement join strategy hints
>>
>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>
>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>> deadlock between TaskMemoryManager and
>> UnsafeExternalSorter$SpillableIterator
>>
>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
>> APIs for extended Columnar Processing Support
>>
>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
>> Dataframe Cogroup via Pandas UDFs
>>
>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>> Re-implement file sources with data source V2 API
>>
>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>> Dynamic Allocation
>>
>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>> Partially push down disjunctive predicated in Parquet/ORC
>>
>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>> test cases from PostgreSQL to Spark SQL
>>
>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>> Deprecate Python 2 support
>>
>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
>> applicable *.sql tests into UDF integrated test base
>>
>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
>> dynamic allocation without an external shuffle service
>>
>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
>> post shuffle partition number in adaptive execution
>>
>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>> Trigger implementations to Triggers.scala and avoid exposing these to the
>> end users
>>
>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
>> Spark WEB UI
>>
>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>> RobustScaler feature transformer
>>
>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
>> Handling in Thrift Server
>>
>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a
>> SQL reference doc
>>
>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
>> test coverage of ThriftServer
>>
>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>> Dynamically reuse subqueries in AQE
>>
>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
>> outdated Experimental, Evolving annotations
>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
>> deprecated items since <= 2.2.0
>>
>> Cheers,
>>
>> Xingbo
>>
>> Hyukjin Kwon <[email protected]> 于2019年10月7日周一 下午9:29写道：
>>
>>> Cogroup Pandas UDF missing:
>>>
>>> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
>>> Dataframe Cogroup via Pandas UDFs
>>> Vectorized R execution:
>>>
>>> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
>>> optimization in SparkR's interoperability
>>>
>>>
>>> 2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <[email protected]>님이
>>> 작성:
>>>
>>>> Thanks for bringing the nice summary of Spark 3.0 improvements!
>>>>
>>>> I'd like to add some items from structured streaming side,
>>>>
>>>> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
>>>> Trigger implementations to Triggers.scala and avoid exposing these to the
>>>> end users (removal of deprecated)
>>>> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
>>>> support for Kafka headers in Structured Streaming
>>>> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add
>>>> kafka delegation token support (there were follow-up issues to add
>>>> functionalities like support multi clusters, etc.)
>>>> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848>
>>>> Introduce new option to Kafka source: offset by timestamp (starting/ending)
>>>> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log
>>>> warn message on possible correctness issue for multiple stateful operations
>>>> in single query
>>>>
>>>> and core side,
>>>>
>>>> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
>>>> feature: apply custom log URL pattern for executor log URLs in SHS
>>>> (follow-up issue expanded the functionality to Spark UI as well)
>>>>
>>>> FYI if we count on current work in progress, there's ongoing umbrella
>>>> issue regarding rolling event log & snapshot (SPARK-28594
>>>> <https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle
>>>> to get things done in Spark 3.0.
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>>
>>>> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I went over all the finished JIRA tickets targeted to Spark 3.0.0,
>>>>> here I'm listing all the notable features and major changes that are ready
>>>>> to test/deliver, please don't hesitate to add more to the list:
>>>>>
>>>>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215>
>>>>> Multiple columns support added to various Transformers: StringIndexer
>>>>>
>>>>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>>>>> Implement Dynamic Partition Pruning
>>>>>
>>>>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677>
>>>>> Support Tree-Based Feature Transformation
>>>>>
>>>>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>>>>> MultilabelClassificationEvaluator
>>>>>
>>>>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>>>>> sample weights to decision trees
>>>>>
>>>>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712>
>>>>> Pushing Left Semi and Left Anti joins through Project, Aggregate, Window,
>>>>> Union etc.
>>>>>
>>>>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>>>>> for Power Iteration Clustering
>>>>>
>>>>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286>
>>>>> Improve logic for timing out executors in dynamic allocation
>>>>>
>>>>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>>>>> Eliminate unnecessary shuffle with adjacent Window expressions
>>>>>
>>>>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148>
>>>>> Acquire new executors to avoid hang because of blacklisting
>>>>>
>>>>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796>
>>>>> Multiple columns support added to various Transformers: PySpark
>>>>> QuantileDiscretizer
>>>>>
>>>>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>>>>> approach to do adaptive execution in Spark SQL
>>>>>
>>>>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>>>>> Spark ML Listener for Tracking ML Pipeline Status
>>>>>
>>>>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710>
>>>>> Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>>>>
>>>>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add
>>>>> fit with validation set to Gradient Boosted Trees: Python API
>>>>>
>>>>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>>>>> and Run Spark on JDK11
>>>>>
>>>>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>>>>> Accelerator-aware task scheduling for Spark
>>>>>
>>>>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>>>>> sharing Netty's memory pool allocators
>>>>>
>>>>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix
>>>>> race condition with tasks running when new attempt for same stage is
>>>>> created leads to other task in the next attempt running on the same
>>>>> partition id retry multiple times
>>>>>
>>>>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341>
>>>>> Support rolling back a shuffle map stage and re-generate the shuffle files
>>>>>
>>>>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>>>>> source for binary files
>>>>>
>>>>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>>>>> Generalize Nested Column Pruning
>>>>>
>>>>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132>
>>>>> Remove support for Scala 2.11 in Spark 3.0.0
>>>>>
>>>>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215>
>>>>> define reserved keywords after SQL standard
>>>>>
>>>>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
>>>>> Pandas UDF to take an iterator of pd.DataFrames
>>>>>
>>>>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>>>>> source v2 API refactor: streaming write
>>>>>
>>>>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956>
>>>>> remove streaming output mode from data source v2 APIs
>>>>>
>>>>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064>
>>>>> create StreamingWrite at the beginning of streaming execution
>>>>>
>>>>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do
>>>>> not infer schema when reading Hive serde table with native data source
>>>>>
>>>>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>>>>> Implement join strategy hints
>>>>>
>>>>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>>>>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>>>>
>>>>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>>>>> deadlock between TaskMemoryManager and
>>>>> UnsafeExternalSorter$SpillableIterator
>>>>>
>>>>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396>
>>>>> Public APIs for extended Columnar Processing Support
>>>>>
>>>>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>>>>> Re-implement file sources with data source V2 API
>>>>>
>>>>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>>>>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>>>>> Dynamic Allocation
>>>>>
>>>>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>>>>> Partially push down disjunctive predicated in Parquet/ORC
>>>>>
>>>>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>>>>> test cases from PostgreSQL to Spark SQL (ongoing)
>>>>>
>>>>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>>>>> Deprecate Python 2 support
>>>>>
>>>>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921>
>>>>> Convert applicable *.sql tests into UDF integrated test base
>>>>>
>>>>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
>>>>> dynamic allocation without an external shuffle service
>>>>>
>>>>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177>
>>>>> Adjust post shuffle partition number in adaptive execution
>>>>>
>>>>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372>
>>>>> Document Spark WEB UI
>>>>>
>>>>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>>>>> RobustScaler feature transformer
>>>>>
>>>>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426>
>>>>> Metadata Handling in Thrift Server
>>>>>
>>>>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build
>>>>> a SQL reference doc (ongoing)
>>>>>
>>>>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608>
>>>>> Improve test coverage of ThriftServer
>>>>>
>>>>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>>>>> Dynamically reuse subqueries in AQE
>>>>>
>>>>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855>
>>>>> Remove outdated Experimental, Evolving annotations
>>>>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>>>>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980>
>>>>> Remove deprecated items since <= 2.2.0
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Xingbo
>>>>>
>>>>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to