[GitHub] spark pull request #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spar...

MLnick Wed, 01 Jun 2016 10:58:24 -0700

Github user MLnick commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13378#discussion_r65410927
  
    --- Diff: docs/mllib-guide.md ---
    @@ -102,32 +102,53 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future 
releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.5 to 1.6
    +## From 1.6 to 2.0
     
    -There are no breaking API changes in the `spark.mllib` or `spark.ml` 
packages, but there are
    -deprecations and changes of behavior.
    +The deprecations and changes of behavior in the `spark.mllib` or 
`spark.ml` packages include:
     
     Deprecations:
     
    -* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
    - In `spark.mllib.clustering.KMeans`, the `runs` parameter has been 
deprecated.
    -* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
    - In `spark.ml.classification.LogisticRegressionModel` and
    - `spark.ml.regression.LinearRegressionModel`, the `weights` field has been 
deprecated in favor of
    - the new name `coefficients`.  This helps disambiguate from instance (row) 
"weights" given to
    - algorithms.
    +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
    + In `spark.ml.regression.LinearRegressionSummary`, the `model` field has 
been deprecated.
    +* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
    + In `spark.ml.regression.RandomForestRegressionModel` and 
`spark.ml.classification.RandomForestClassificationModel`,
    + the `numTrees` parameter has been deprecated in favor of `getNumTrees` 
method.
    +* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
    + In `spark.ml.param.Params`, the `validateParams` method has been 
deprecated.
    + We move all functionality in overridden methods to the corresponding 
`transformSchema`.
    +* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
    + In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, 
`RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
    + We encourage users to use `spark.ml.regression.LinearRegresson` and 
`spark.ml.classification.LogisticRegresson`.
    +* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
    + In `spark.mllib.evaluation.MulticlassMetrics`, the parameters 
`precision`, `recall` and `fMeasure` have been deprecated in favor of 
`accuracy`.
     
     Changes of behavior:
     
    -* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
    - `spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed 
semantics in 1.6.
    - Previously, it was a threshold for absolute change in error. Now, it 
resembles the behavior of
    - `GradientDescent`'s `convergenceTol`: For large errors, it uses relative 
error (relative to the
    - previous error); for small errors (`< 0.01`), it uses absolute error.
    -* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
    - `spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings 
to lowercase before
    - tokenizing. Now, it converts to lowercase by default, with an option not 
to. This matches the
    - behavior of the simpler `Tokenizer` transformer.
    +* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
    + `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls 
`spark.ml.classification.LogisticRegresson` for binary classification now.
    + This will introduce the following behavior changes for 
`spark.mllib.classification.LogisticRegressionWithLBFGS`:
    +    * The intercept will not be regularized when training binary 
classification model with L1/L2 Updater.
    +    * If users set without regularization, training with or without 
feature scaling will return the same solution by the same convergence rate.
    +* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
    + In order to provide better and consistent result with 
`spark.ml.classification.LogisticRegresson`,
    + the default value of 
`spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has 
been changed from 1E-4 to 1E-6.
    +* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
    + Fix a bug of `PowerIterationClustering` which will likely change its 
result.
    +* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
    + `LDA` using the `EM` optimizer will keep the last checkpoint by default, 
if checkpointing is being used.
    +* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
    + `Word2Vec` now respects sentence boundaries. Previously, it did not 
handle them correctly.
    +* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
    + `HashingTF` uses `MurmurHash3` as default hash algorithm in both 
`spark.ml` and `spark.mllib`.
    +* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
    + The `expectedType` argument for PySpark `Param` was removed.
    +* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
    + Some default `Param` values, which were mismatched between pipelines in 
Scala and Python, have been changed.
    +* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
    + `QuantileDiscretizer` now uses 
`spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously 
used custom sampling logic).
    + The output buckets will differ for same input data and params.
    +* [SPARK-14814](https://issues.apache.org/jira/browse/SPARK-14814):
    --- End diff --
    
    I've added it to the list in 
[SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810). We can either 
remove it here from this PR and I will include it when I do the one for 
breaking changes, or add it to a breaking changes section in this PR, which I 
will update with the others later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spar...

Reply via email to