Re: Spark 3.0 preview release feature list and major changes

Hyukjin Kwon Mon, 07 Oct 2019 21:40:28 -0700

Cogroup Pandas UDF missing:

SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
Dataframe Cogroup via Pandas UDFs
Vectorized R execution:


SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
optimization in SparkR's interoperability


2019년 10월 8일 (화) 오전 7:50, Jungtaek Lim <kabhwan.opensou...@gmail.com>님이 작성:

> Thanks for bringing the nice summary of Spark 3.0 improvements!
>
> I'd like to add some items from structured streaming side,
>
> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
> Trigger implementations to Triggers.scala and avoid exposing these to the
> end users (removal of deprecated)
> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
> support for Kafka headers in Structured Streaming
> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add kafka
> delegation token support (there were follow-up issues to add
> functionalities like support multi clusters, etc.)
> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce
> new option to Kafka source: offset by timestamp (starting/ending)
> SPARK-28074 <https://issues.apache.org/jira/browse/SPARK-28074> Log warn
> message on possible correctness issue for multiple stateful operations in
> single query
>
> and core side,
>
> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> New
> feature: apply custom log URL pattern for executor log URLs in SHS
> (follow-up issue expanded the functionality to Spark UI as well)
>
> FYI if we count on current work in progress, there's ongoing umbrella
> issue regarding rolling event log & snapshot (SPARK-28594
> <https://issues.apache.org/jira/browse/SPARK-28594>) which we struggle to
> get things done in Spark 3.0.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Tue, Oct 8, 2019 at 7:02 AM Xingbo Jiang <jiangxb1...@gmail.com> wrote:
>
>> Hi all,
>>
>> I went over all the finished JIRA tickets targeted to Spark 3.0.0, here
>> I'm listing all the notable features and major changes that are ready to
>> test/deliver, please don't hesitate to add more to the list:
>>
>> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
>> columns support added to various Transformers: StringIndexer
>>
>> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150>
>> Implement Dynamic Partition Pruning
>>
>> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
>> Tree-Based Feature Transformation
>>
>> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
>> MultilabelClassificationEvaluator
>>
>> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
>> sample weights to decision trees
>>
>> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
>> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>>
>> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API
>> for Power Iteration Clustering
>>
>> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
>> logic for timing out executors in dynamic allocation
>>
>> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636>
>> Eliminate unnecessary shuffle with adjacent Window expressions
>>
>> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
>> new executors to avoid hang because of blacklisting
>>
>> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
>> columns support added to various Transformers: PySpark QuantileDiscretizer
>>
>> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
>> approach to do adaptive execution in Spark SQL
>>
>> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add
>> Spark ML Listener for Tracking ML Pipeline Status
>>
>> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
>> the built-in Hive to 2.3.5 for hadoop-3.2
>>
>> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
>> with validation set to Gradient Boosted Trees: Python API
>>
>> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build
>> and Run Spark on JDK11
>>
>> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
>> Accelerator-aware task scheduling for Spark
>>
>> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
>> sharing Netty's memory pool allocators
>>
>> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
>> condition with tasks running when new attempt for same stage is created
>> leads to other task in the next attempt running on the same partition id
>> retry multiple times
>>
>> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
>> rolling back a shuffle map stage and re-generate the shuffle files
>>
>> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
>> source for binary files
>>
>> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
>> Generalize Nested Column Pruning
>>
>> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
>> support for Scala 2.11 in Spark 3.0.0
>>
>> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
>> reserved keywords after SQL standard
>>
>> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
>> Pandas UDF to take an iterator of pd.DataFrames
>>
>> SPARK-26785 <https://issues.apache.org/jira/browse/SPARK-26785> data
>> source v2 API refactor: streaming write
>>
>> SPARK-26956 <https://issues.apache.org/jira/browse/SPARK-26956> remove
>> streaming output mode from data source v2 APIs
>>
>> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
>> StreamingWrite at the beginning of streaming execution
>>
>> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
>> infer schema when reading Hive serde table with native data source
>>
>> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225>
>> Implement join strategy hints
>>
>> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
>> pandas DataFrame for struct type argument in Scalar Pandas UDF
>>
>> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
>> deadlock between TaskMemoryManager and
>> UnsafeExternalSorter$SpillableIterator
>>
>> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
>> APIs for extended Columnar Processing Support
>>
>> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
>> Re-implement file sources with data source V2 API
>>
>> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
>> Disk-persisted RDD blocks served by shuffle service, and ignored for
>> Dynamic Allocation
>>
>> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699>
>> Partially push down disjunctive predicated in Parquet/ORC
>>
>> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port
>> test cases from PostgreSQL to Spark SQL (ongoing)
>>
>> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884>
>> Deprecate Python 2 support
>>
>> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
>> applicable *.sql tests into UDF integrated test base
>>
>> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
>> dynamic allocation without an external shuffle service
>>
>> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
>> post shuffle partition number in adaptive execution
>>
>> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
>> Spark WEB UI
>>
>> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
>> RobustScaler feature transformer
>>
>> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
>> Handling in Thrift Server
>>
>> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a
>> SQL reference doc (ongoing)
>>
>> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
>> test coverage of ThriftServer
>>
>> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
>> Dynamically reuse subqueries in AQE
>>
>> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
>> outdated Experimental, Evolving annotations
>> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
>> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
>> deprecated items since <= 2.2.0
>>
>> Cheers,
>>
>> Xingbo
>>
>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to