Wait... I have some supplement: *New API:* SPARK-25097 Support prediction on single instance in KMeans/BiKMeans/GMM SPARK-28045 add missing RankingEvaluator SPARK-29121 Support Dot Product for Vectors
*Behavior change or new API with behavior change:* SPARK-23265 Update multi-column error handling logic in QuantileDiscretizer SPARK-22798 Add multiple column support to PySpark StringIndexer SPARK-11215 Add multiple columns support to StringIndexer SPARK-24102 RegressionEvaluator should use sample weight data SPARK-24101 MulticlassClassificationEvaluator should use sample weight data SPARK-24103 BinaryClassificationEvaluator should use sample weight data SPARK-23469 HashingTF should use corrected MurmurHash3 implementation *Deprecated API removal:* SPARK-25382 Remove ImageSchema.readImages in 3.0 SPARK-26133 Remove deprecated OneHotEncoder and rename OneHotEncoderEstimator to OneHotEncoder SPARK-25867 Remove KMeans computeCost SPARK-28243 remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams Thanks! Weichen On Fri, Oct 11, 2019 at 6:11 AM Xingbo Jiang <jiangxb1...@gmail.com> wrote: > Hi all, > > Here is the updated feature list: > > > SPARK-11215 <https://issues.apache.org/jira/browse/SPARK-11215> Multiple > columns support added to various Transformers: StringIndexer > > SPARK-11150 <https://issues.apache.org/jira/browse/SPARK-11150> Implement > Dynamic Partition Pruning > > SPARK-13677 <https://issues.apache.org/jira/browse/SPARK-13677> Support > Tree-Based Feature Transformation > > SPARK-16692 <https://issues.apache.org/jira/browse/SPARK-16692> Add > MultilabelClassificationEvaluator > > SPARK-19591 <https://issues.apache.org/jira/browse/SPARK-19591> Add > sample weights to decision trees > > SPARK-19712 <https://issues.apache.org/jira/browse/SPARK-19712> Pushing > Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc. > > SPARK-19827 <https://issues.apache.org/jira/browse/SPARK-19827> R API for > Power Iteration Clustering > > SPARK-20286 <https://issues.apache.org/jira/browse/SPARK-20286> Improve > logic for timing out executors in dynamic allocation > > SPARK-20636 <https://issues.apache.org/jira/browse/SPARK-20636> Eliminate > unnecessary shuffle with adjacent Window expressions > > SPARK-22148 <https://issues.apache.org/jira/browse/SPARK-22148> Acquire > new executors to avoid hang because of blacklisting > > SPARK-22796 <https://issues.apache.org/jira/browse/SPARK-22796> Multiple > columns support added to various Transformers: PySpark QuantileDiscretizer > > SPARK-23128 <https://issues.apache.org/jira/browse/SPARK-23128> A new > approach to do adaptive execution in Spark SQL > > SPARK-23155 <https://issues.apache.org/jira/browse/SPARK-23155> Apply > custom log URL pattern for executor log URLs in SHS > > SPARK-23539 <https://issues.apache.org/jira/browse/SPARK-23539> Add > support for Kafka headers > > SPARK-23674 <https://issues.apache.org/jira/browse/SPARK-23674> Add Spark > ML Listener for Tracking ML Pipeline Status > > SPARK-23710 <https://issues.apache.org/jira/browse/SPARK-23710> Upgrade > the built-in Hive to 2.3.5 for hadoop-3.2 > > SPARK-24333 <https://issues.apache.org/jira/browse/SPARK-24333> Add fit > with validation set to Gradient Boosted Trees: Python API > > SPARK-24417 <https://issues.apache.org/jira/browse/SPARK-24417> Build and > Run Spark on JDK11 > > SPARK-24615 <https://issues.apache.org/jira/browse/SPARK-24615> > Accelerator-aware task scheduling for Spark > > SPARK-24920 <https://issues.apache.org/jira/browse/SPARK-24920> Allow > sharing Netty's memory pool allocators > > SPARK-25250 <https://issues.apache.org/jira/browse/SPARK-25250> Fix race > condition with tasks running when new attempt for same stage is created > leads to other task in the next attempt running on the same partition id > retry multiple times > > SPARK-25341 <https://issues.apache.org/jira/browse/SPARK-25341> Support > rolling back a shuffle map stage and re-generate the shuffle files > > SPARK-25348 <https://issues.apache.org/jira/browse/SPARK-25348> Data > source for binary files > > SPARK-25390 <https://issues.apache.org/jira/browse/SPARK-25390> data > source V2 API refactoring > > SPARK-25501 <https://issues.apache.org/jira/browse/SPARK-25501> Add Kafka > delegation token support > > SPARK-25603 <https://issues.apache.org/jira/browse/SPARK-25603> > Generalize Nested Column Pruning > > SPARK-26132 <https://issues.apache.org/jira/browse/SPARK-26132> Remove > support for Scala 2.11 in Spark 3.0.0 > > SPARK-26215 <https://issues.apache.org/jira/browse/SPARK-26215> define > reserved keywords after SQL standard > > SPARK-26412 <https://issues.apache.org/jira/browse/SPARK-26412> Allow > Pandas UDF to take an iterator of pd.DataFrames > > SPARK-26651 <https://issues.apache.org/jira/browse/SPARK-26651> Use > Proleptic Gregorian calendar > > SPARK-26759 <https://issues.apache.org/jira/browse/SPARK-26759> Arrow > optimization in SparkR's interoperability > > SPARK-26848 <https://issues.apache.org/jira/browse/SPARK-26848> Introduce > new option to Kafka source: offset by timestamp (starting/ending) > > SPARK-27064 <https://issues.apache.org/jira/browse/SPARK-27064> create > StreamingWrite at the beginning of streaming execution > > SPARK-27119 <https://issues.apache.org/jira/browse/SPARK-27119> Do not > infer schema when reading Hive serde table with native data source > > SPARK-27225 <https://issues.apache.org/jira/browse/SPARK-27225> Implement > join strategy hints > > SPARK-27240 <https://issues.apache.org/jira/browse/SPARK-27240> Use > pandas DataFrame for struct type argument in Scalar Pandas UDF > > SPARK-27338 <https://issues.apache.org/jira/browse/SPARK-27338> Fix > deadlock between TaskMemoryManager and > UnsafeExternalSorter$SpillableIterator > > SPARK-27396 <https://issues.apache.org/jira/browse/SPARK-27396> Public > APIs for extended Columnar Processing Support > > SPARK-27463 <https://issues.apache.org/jira/browse/SPARK-27463> Support > Dataframe Cogroup via Pandas UDFs > > SPARK-27589 <https://issues.apache.org/jira/browse/SPARK-27589> > Re-implement file sources with data source V2 API > > SPARK-27677 <https://issues.apache.org/jira/browse/SPARK-27677> > Disk-persisted RDD blocks served by shuffle service, and ignored for > Dynamic Allocation > > SPARK-27699 <https://issues.apache.org/jira/browse/SPARK-27699> Partially > push down disjunctive predicated in Parquet/ORC > > SPARK-27763 <https://issues.apache.org/jira/browse/SPARK-27763> Port test > cases from PostgreSQL to Spark SQL > > SPARK-27884 <https://issues.apache.org/jira/browse/SPARK-27884> Deprecate > Python 2 support > > SPARK-27921 <https://issues.apache.org/jira/browse/SPARK-27921> Convert > applicable *.sql tests into UDF integrated test base > > SPARK-27963 <https://issues.apache.org/jira/browse/SPARK-27963> Allow > dynamic allocation without an external shuffle service > > SPARK-28177 <https://issues.apache.org/jira/browse/SPARK-28177> Adjust > post shuffle partition number in adaptive execution > > SPARK-28199 <https://issues.apache.org/jira/browse/SPARK-28199> Move > Trigger implementations to Triggers.scala and avoid exposing these to the > end users > > SPARK-28372 <https://issues.apache.org/jira/browse/SPARK-28372> Document > Spark WEB UI > > SPARK-28399 <https://issues.apache.org/jira/browse/SPARK-28399> > RobustScaler feature transformer > > SPARK-28426 <https://issues.apache.org/jira/browse/SPARK-28426> Metadata > Handling in Thrift Server > > SPARK-28588 <https://issues.apache.org/jira/browse/SPARK-28588> Build a > SQL reference doc > > SPARK-28608 <https://issues.apache.org/jira/browse/SPARK-28608> Improve > test coverage of ThriftServer > > SPARK-28753 <https://issues.apache.org/jira/browse/SPARK-28753> > Dynamically reuse subqueries in AQE > > SPARK-28855 <https://issues.apache.org/jira/browse/SPARK-28855> Remove > outdated Experimental, Evolving annotations > > SPARK-29345 <https://issues.apache.org/jira/browse/SPARK-29345> Add an > API that allows a user to define and observe arbitrary metrics on streaming > queries > SPARK-25908 <https://issues.apache.org/jira/browse/SPARK-25908> > SPARK-28980 <https://issues.apache.org/jira/browse/SPARK-28980> Remove > deprecated items since <= 2.2.0 > > Cheers, > > Xingbo > > > Sean Owen <sro...@gmail.com> 于2019年10月10日周四 下午12:50写道: > >> See the JIRA - this is too open-ended and not obviously just due to >> choices in data representation, what you're trying to do, etc. It's >> correctly closed IMHO. >> However, identifying the issue more narrowly, and something that looks >> ripe for optimization, would be useful. >> >> On Thu, Oct 10, 2019 at 12:30 PM antonkulaga <antonkul...@gmail.com> >> wrote: >> > >> > I think for sure SPARK-28547 >> > <https://issues.apache.org/jira/projects/SPARK/issues/SPARK-28547> >> > At the moment there are some flows in Spark architecture and it performs >> > miserably or even freezes everywhere where column number exceeds 10-15K >> > (even simple describe function takes ages while the same functions with >> > pandas and no Spark take seconds). In many fields (like bioinformatics) >> wide >> > datasets with both large numbers of rows and columns are very common >> (gene >> > expression data is a good example here) and Spark is totally useless >> there. >> > >> > >> > >> > -- >> > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ >> > >> > --------------------------------------------------------------------- >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> > >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >>