Re: Spark 3.0 preview release feature list and major changes

Xingbo Jiang Thu, 10 Oct 2019 15:11:48 -0700

Hi all,

Here is the updated feature list:



SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
columns support added to various Transformers: StringIndexer

SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement
Dynamic Partition Pruning

SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
Tree-Based Feature Transformation

SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
MultilabelClassificationEvaluator

SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add sample
weights to decision trees

SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.

SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for
Power Iteration Clustering

SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
logic for timing out executors in dynamic allocation

SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate
unnecessary shuffle with adjacent Window expressions

SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire new
executors to avoid hang because of blacklisting

SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
columns support added to various Transformers: PySpark QuantileDiscretizer

SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
approach to do adaptive execution in Spark SQL

SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
custom log URL pattern for executor log URLs in SHS

SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add support
for Kafka headers

SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark
ML Listener for Tracking ML Pipeline Status

SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade the
built-in Hive to 2.3.5 for hadoop-3.2

SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
with validation set to Gradient Boosted Trees: Python API

SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and
Run Spark on JDK11

SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
Accelerator-aware task scheduling for Spark

SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
sharing Netty's memory pool allocators

SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
condition with tasks running when new attempt for same stage is created
leads to other task in the next attempt running on the same partition id
retry multiple times

SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
rolling back a shuffle map stage and re-generate the shuffle files

SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data source
for binary files

SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data source
V2 API refactoring

SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add Kafka
delegation token support

SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> Generalize
Nested Column Pruning

SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
support for Scala 2.11 in Spark 3.0.0

SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
reserved keywords after SQL standard

SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
Pandas UDF to take an iterator of pd.DataFrames

SPARK-26651 <https://issues.apache.org/jira/browse/SPARK-26651> Use
Proleptic Gregorian calendar

SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
optimization in SparkR's interoperability

SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce
new option to Kafka source: offset by timestamp (starting/ending)

SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
StreamingWrite at the beginning of streaming execution

SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
infer schema when reading Hive serde table with native data source

SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement
join strategy hints

SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use pandas
DataFrame for struct type argument in Scalar Pandas UDF

SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
deadlock between TaskMemoryManager and
UnsafeExternalSorter$SpillableIterator

SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public APIs
for extended Columnar Processing Support

SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
Dataframe Cogroup via Pandas UDFs

SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
Re-implement file sources with data source V2 API

SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
Disk-persisted RDD blocks served by shuffle service, and ignored for
Dynamic Allocation

SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially
push down disjunctive predicated in Parquet/ORC

SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test
cases from PostgreSQL to Spark SQL

SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate
Python 2 support

SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
applicable *.sql tests into UDF integrated test base

SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
dynamic allocation without an external shuffle service

SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust post
shuffle partition number in adaptive execution

SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
Trigger implementations to Triggers.scala and avoid exposing these to the
end users

SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
Spark WEB UI

SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
RobustScaler feature transformer

SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
Handling in Thrift Server

SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a SQL
reference doc

SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
test coverage of ThriftServer

SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> Dynamically
reuse subqueries in AQE

SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
outdated Experimental, Evolving annotations

SPARK-29345 <https://issues.apache.org/jira/browse/SPARK-29345> Add an API
that allows a user to define and observe arbitrary metrics on streaming
queries
SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> SPARK-28980
<https://issues.apache.org/jira/browse/SPARK-28980> Remove deprecated items
since <= 2.2.0

Cheers,

Xingbo


Sean Owen <sro...@gmail.com> 于2019年10月10日周四 下午12:50写道：

> See the JIRA - this is too open-ended and not obviously just due to
> choices in data representation, what you're trying to do, etc. It's
> correctly closed IMHO.
> However, identifying the issue more narrowly, and something that looks
> ripe for optimization, would be useful.
>
> On Thu, Oct 10, 2019 at 12:30 PM antonkulaga <antonkul...@gmail.com>
> wrote:
> >
> > I think for sure  SPARK-28547
> > <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547>
> > At the moment there are some flows in Spark architecture and it performs
> > miserably or even freezes everywhere where column number exceeds 10-15K
> > (even simple describe function takes ages while the same functions with
> > pandas and no Spark take seconds). In many fields (like bioinformatics)
> wide
> > datasets with both large numbers of rows and columns are very common
> (gene
> > expression data is a good example here) and Spark is totally useless
> there.
> >
> >
> >
> > --
> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to