Re: Spark 3.0 preview release feature list and major changes

Weichen Xu Thu, 10 Oct 2019 18:32:10 -0700

Wait... I have some supplement:

*New API:*
SPARK-25097 Support prediction on single instance in KMeans/BiKMeans/GMM
SPARK-28045 add missing RankingEvaluator
SPARK-29121 Support Dot Product for Vectors


*Behavior change or new API with behavior change:*
SPARK-23265 Update multi-column error handling logic in QuantileDiscretizer
SPARK-22798 Add multiple column support to PySpark StringIndexer
SPARK-11215 Add multiple columns support to StringIndexer
SPARK-24102 RegressionEvaluator should use sample weight data
SPARK-24101 MulticlassClassificationEvaluator should use sample weight data
SPARK-24103 BinaryClassificationEvaluator should use sample weight data
SPARK-23469 HashingTF should use corrected MurmurHash3 implementation

*Deprecated API removal:*
SPARK-25382 Remove ImageSchema.readImages in 3.0
SPARK-26133 Remove deprecated OneHotEncoder and rename
OneHotEncoderEstimator to OneHotEncoder
SPARK-25867 Remove KMeans computeCost
SPARK-28243 remove setFeatureSubsetStrategy and setSubsamplingRate from
Python TreeEnsembleParams

Thanks!

Weichen

On Fri, Oct 11, 2019 at 6:11 AM Xingbo Jiang <jiangxb1...@gmail.com> wrote:

> Hi all,
>
> Here is the updated feature list:
>
>
> SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement
> Dynamic Partition Pruning
>
> SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support
> Tree-Based Feature Transformation
>
> SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add
> MultilabelClassificationEvaluator
>
> SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add
> sample weights to decision trees
>
> SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for
> Power Iteration Clustering
>
> SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply
> custom log URL pattern for executor log URLs in SHS
>
> SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add
> support for Kafka headers
>
> SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and
> Run Spark on JDK11
>
> SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615>
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data
> source for binary files
>
> SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data
> source V2 API refactoring
>
> SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add Kafka
> delegation token support
>
> SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603>
> Generalize Nested Column Pruning
>
> SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define
> reserved keywords after SQL standard
>
> SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26651 <https://issues.apache.org/jira/browse/SPARK-26651> Use
> Proleptic Gregorian calendar
>
> SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow
> optimization in SparkR's interoperability
>
> SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce
> new option to Kafka source: offset by timestamp (starting/ending)
>
> SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create
> StreamingWrite at the beginning of streaming execution
>
> SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not
> infer schema when reading Hive serde table with native data source
>
> SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement
> join strategy hints
>
> SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use
> pandas DataFrame for struct type argument in Scalar Pandas UDF
>
> SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix
> deadlock between TaskMemoryManager and
> UnsafeExternalSorter$SpillableIterator
>
> SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public
> APIs for extended Columnar Processing Support
>
> SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support
> Dataframe Cogroup via Pandas UDFs
>
> SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589>
> Re-implement file sources with data source V2 API
>
> SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677>
> Disk-persisted RDD blocks served by shuffle service, and ignored for
> Dynamic Allocation
>
> SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially
> push down disjunctive predicated in Parquet/ORC
>
> SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test
> cases from PostgreSQL to Spark SQL
>
> SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate
> Python 2 support
>
> SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert
> applicable *.sql tests into UDF integrated test base
>
> SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow
> dynamic allocation without an external shuffle service
>
> SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust
> post shuffle partition number in adaptive execution
>
> SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move
> Trigger implementations to Triggers.scala and avoid exposing these to the
> end users
>
> SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document
> Spark WEB UI
>
> SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399>
> RobustScaler feature transformer
>
> SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata
> Handling in Thrift Server
>
> SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a
> SQL reference doc
>
> SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve
> test coverage of ThriftServer
>
> SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753>
> Dynamically reuse subqueries in AQE
>
> SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove
> outdated Experimental, Evolving annotations
>
> SPARK-29345 <https://issues.apache.org/jira/browse/SPARK-29345> Add an
> API that allows a user to define and observe arbitrary metrics on streaming
> queries
> SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908>
> SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove
> deprecated items since <= 2.2.0
>
> Cheers,
>
> Xingbo
>
>
> Sean Owen <sro...@gmail.com> 于2019年10月10日周四 下午12:50写道：
>
>> See the JIRA - this is too open-ended and not obviously just due to
>> choices in data representation, what you're trying to do, etc. It's
>> correctly closed IMHO.
>> However, identifying the issue more narrowly, and something that looks
>> ripe for optimization, would be useful.
>>
>> On Thu, Oct 10, 2019 at 12:30 PM antonkulaga <antonkul...@gmail.com>
>> wrote:
>> >
>> > I think for sure  SPARK-28547
>> > <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547>
>> > At the moment there are some flows in Spark architecture and it performs
>> > miserably or even freezes everywhere where column number exceeds 10-15K
>> > (even simple describe function takes ages while the same functions with
>> > pandas and no Spark take seconds). In many fields (like bioinformatics)
>> wide
>> > datasets with both large numbers of rows and columns are very common
>> (gene
>> > expression data is a good example here) and Spark is totally useless
>> there.
>> >
>> >
>> >
>> > --
>> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Spark 3.0 preview release feature list and major changes

Reply via email to