huaxingao commented on a change in pull request #27785: [SPARK-30934][ML][DOCS] Update ml-guide and ml-migration-guide for 3.0 release URL: https://github.com/apache/spark/pull/27785#discussion_r388093510
########## File path: docs/ml-guide.md ########## @@ -87,31 +85,41 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 [^1]: To learn more about the benefits and background of system optimised natives, you may wish to watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/). -# Highlights in 2.3 +# Highlights in 3.0 -The list below highlights some of the new features and enhancements added to MLlib in the `2.3` +The list below highlights some of the new features and enhancements added to MLlib in the `3.0` release of Spark: -* Built-in support for reading images into a `DataFrame` was added -([SPARK-21866](https://issues.apache.org/jira/browse/SPARK-21866)). -* [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator) was added, and should be -used instead of the existing `OneHotEncoder` transformer. The new estimator supports -transforming multiple columns. -* Multiple column support was also added to `QuantileDiscretizer` and `Bucketizer` -([SPARK-22397](https://issues.apache.org/jira/browse/SPARK-22397) and -[SPARK-20542](https://issues.apache.org/jira/browse/SPARK-20542)) -* A new [`FeatureHasher`](ml-features.html#featurehasher) transformer was added - ([SPARK-13969](https://issues.apache.org/jira/browse/SPARK-13969)). -* Added support for evaluating multiple models in parallel when performing cross-validation using -[`TrainValidationSplit` or `CrossValidator`](ml-tuning.html) -([SPARK-19357](https://issues.apache.org/jira/browse/SPARK-19357)). -* Improved support for custom pipeline components in Python (see -[SPARK-21633](https://issues.apache.org/jira/browse/SPARK-21633) and -[SPARK-21542](https://issues.apache.org/jira/browse/SPARK-21542)). -* `DataFrame` functions for descriptive summary statistics over vector columns -([SPARK-19634](https://issues.apache.org/jira/browse/SPARK-19634)). -* Robust linear regression with Huber loss -([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)). +* Multiple columns support was added to `Binarizer`, `StringIndexer`, `StopWordsRemover` and PySpark `QuantileDiscretizer` +([SPARK-23578](https://issues.apache.org/jira/browse/SPARK-23578)), +([SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215)), +([SPARK-29808](https://issues.apache.org/jira/browse/SPARK-29808)), +([SPARK-22796](https://issues.apache.org/jira/browse/SPARK-22796)). +* Support Tree-Based Feature Transformation was added +([SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677)). +* Two new evaluators `MultilabelClassificationEvaluator` and `RankingEvaluator` were added +([SPARK-16692](https://issues.apache.org/jira/browse/SPARK-16692)), +([SPARK-28045](https://issues.apache.org/jira/browse/SPARK-28045)). +* Sample weights support was added in `DecisionTreeClassifier/Regressor`, `RandomForestClassifier/Regressor`, `BisectingKMeans`, `KMeans` and `GaussianMixture` +([SPARK-19591](https://issues.apache.org/jira/browse/SPARK-19591)), +([SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478)), +([SPARK-30351](https://issues.apache.org/jira/browse/SPARK-30351)), +([SPARK-29967](https://issues.apache.org/jira/browse/SPARK-29967)), +([SPARK-30102](https://issues.apache.org/jira/browse/SPARK-30102)). +* R API for `PowerIterationClustering` was added +([SPARK-19827](https://issues.apache.org/jira/browse/SPARK-19827)). +* Added Spark ML listener for tracking ML pipeline status +([SPARK-23674](https://issues.apache.org/jira/browse/SPARK-23674)). +* Fit with validation set was added to Gradient Boosted Trees in Python +([SPARK-24333](https://issues.apache.org/jira/browse/SPARK-24333)). +* [`RobustScaler`](ml-features.html#robustscaler) transformer was added +([SPARK-28399](https://issues.apache.org/jira/browse/SPARK-28399)). +* [`Factorization Machines`](ml-classification-regression.html#factorization-machines) classifier and regressor were added +([SPARK-29224](https://issues.apache.org/jira/browse/SPARK-29224)). +* Complement Naive Bayes Classifier was added +([SPARK-29942](https://issues.apache.org/jira/browse/SPARK-29942)). +* ML function parity between Scala and Python +([SPARK-28958](https://issues.apache.org/jira/browse/SPARK-28958)). Review comment: Will add all these you have mentioned. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
