[GitHub] spark pull request #16076: [SPARK-18324][ML][DOC] Update ML programming and ...

jkbradley Thu, 01 Dec 2016 13:37:16 -0800

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16076#discussion_r90542680
  
    --- Diff: docs/ml-guide.md ---
    @@ -60,152 +60,37 @@ MLlib is under active development.
     The APIs marked `Experimental`/`DeveloperApi` may change in future 
releases,
     and the migration guide below will explain all changes between releases.
     
    -## From 1.6 to 2.0
    +## From 2.0 to 2.1
     
     ### Breaking changes
     
    -There were several breaking changes in Spark 2.0, which are outlined below.
    -
    -**Linear algebra classes for DataFrame-based APIs**
    -
    -Spark's linear algebra dependencies were moved to a new project, 
`mllib-local` 
    -(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). 
    -As part of this change, the linear algebra classes were copied to a new 
package, `spark.ml.linalg`. 
    -The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` 
classes, 
    -leading to a few breaking changes, predominantly in various model classes 
    -(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for 
a full list).
    -
    -**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the 
previous package `spark.mllib.linalg`.
    -
    -_Converting vectors and matrices_
    -
    -While most pipeline components support backward compatibility for loading, 
    -some existing `DataFrames` and pipelines in Spark versions prior to 2.0, 
that contain vector or matrix 
    -columns, may need to be migrated to the new `spark.ml` vector and matrix 
types. 
    -Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to 
`spark.ml.linalg` types
    -(and vice versa) can be found in `spark.mllib.util.MLUtils`.
    -
    -There are also utility methods available for converting single instances 
of 
    -vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / 
`mllib.linalg.Matrix`
    -for converting to `ml.linalg` types, and 
    -`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` 
    -for converting to `mllib.linalg` types.
    -
    -<div class="codetabs">
    -<div data-lang="scala"  markdown="1">
    -
    -{% highlight scala %}
    -import org.apache.spark.mllib.util.MLUtils
    -
    -// convert DataFrame columns
    -val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
    -val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
    -// convert a single vector or matrix
    -val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
    -val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
    -{% endhighlight %}
    -
    -Refer to the [`MLUtils` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further 
detail.
    -</div>
    -
    -<div data-lang="java" markdown="1">
    -
    -{% highlight java %}
    -import org.apache.spark.mllib.util.MLUtils;
    -import org.apache.spark.sql.Dataset;
    -
    -// convert DataFrame columns
    -Dataset<Row> convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
    -Dataset<Row> convertedMatrixDF = 
MLUtils.convertMatrixColumnsToML(matrixDF);
    -// convert a single vector or matrix
    -org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
    -org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
    -{% endhighlight %}
    -
    -Refer to the [`MLUtils` Java 
docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.
    -</div>
    -
    -<div data-lang="python"  markdown="1">
    -
    -{% highlight python %}
    -from pyspark.mllib.util import MLUtils
    -
    -# convert DataFrame columns
    -convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
    -convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
    -# convert a single vector or matrix
    -mlVec = mllibVec.asML()
    -mlMat = mllibMat.asML()
    -{% endhighlight %}
    -
    -Refer to the [`MLUtils` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further 
detail.
    -</div>
    -</div>
    -
     **Deprecated methods removed**
     
    -Several deprecated methods were removed in the `spark.mllib` and 
`spark.ml` packages:
    -
    -* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
    -* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
    -* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as 
`DeveloperApi`)
    -* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these 
functions are available on `RDD`s directly, and were marked as `DeveloperApi`)
    -* `defaultStategy` in `mllib.tree.configuration.Strategy`
    -* `build` in `mllib.tree.Node`
    -* libsvm loaders for multiclass and load/save labeledData methods in 
`mllib.util.MLUtils`
    -
    -A full list of breaking changes can be found at 
[SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
    +* `setLabelCol` in `feature.ChiSqSelectorModel`
    +* `numTrees` in `classification.RandomForestClassificationModel` (This now 
refers to the Param called `numTrees`)
    +* `numTrees` in `regression.RandomForestRegressionModel` (This now refers 
to the Param called `numTrees`)
    +* `model` in `regression.LinearRegressionSummary`
    +* `validateParams` in `PipelineStage`
     
     ### Deprecations and changes of behavior
     
     **Deprecations**
     
    -Deprecations in the `spark.mllib` and `spark.ml` packages include:
    -
    -* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984):
    - In `spark.ml.regression.LinearRegressionSummary`, the `model` field has 
been deprecated.
    -* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784):
    - In `spark.ml.regression.RandomForestRegressionModel` and 
`spark.ml.classification.RandomForestClassificationModel`,
    - the `numTrees` parameter has been deprecated in favor of `getNumTrees` 
method.
    -* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761):
    - In `spark.ml.param.Params`, the `validateParams` method has been 
deprecated.
    - We move all functionality in overridden methods to the corresponding 
`transformSchema`.
    -* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829):
    - In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, 
`RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated.
    - We encourage users to use `spark.ml.regression.LinearRegresson` and 
`spark.ml.classification.LogisticRegresson`.
    -* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900):
    - In `spark.mllib.evaluation.MulticlassMetrics`, the parameters 
`precision`, `recall` and `fMeasure` have been deprecated in favor of 
`accuracy`.
    -* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644):
    - In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` 
method has been deprecated in favor of `session`.
    -* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has 
been deprecated since it was not used by `ChiSqSelectorModel`.
    +* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
    +  Deprecate all setter methods for `DecisionTreeClassificationModel`, 
`GBTClassificationModel`, `RandomForestClassificationModel`, 
`DecisionTreeRegressionModel`, `GBTRegressionModel` and 
`RandomForestRegressionModel`
     
     **Changes of behavior**
     
    -Changes of behavior in the `spark.mllib` and `spark.ml` packages include:
    -
    -* [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780):
    - `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls 
`spark.ml.classification.LogisticRegresson` for binary classification now.
    - This will introduce the following behavior changes for 
`spark.mllib.classification.LogisticRegressionWithLBFGS`:
    -    * The intercept will not be regularized when training binary 
classification model with L1/L2 Updater.
    -    * If users set without regularization, training with or without 
feature scaling will return the same solution by the same convergence rate.
    -* [SPARK-13429](https://issues.apache.org/jira/browse/SPARK-13429):
    - In order to provide better and consistent result with 
`spark.ml.classification.LogisticRegresson`,
    - the default value of 
`spark.mllib.classification.LogisticRegressionWithLBFGS`: `convergenceTol` has 
been changed from 1E-4 to 1E-6.
    -* [SPARK-12363](https://issues.apache.org/jira/browse/SPARK-12363):
    - Fix a bug of `PowerIterationClustering` which will likely change its 
result.
    -* [SPARK-13048](https://issues.apache.org/jira/browse/SPARK-13048):
    - `LDA` using the `EM` optimizer will keep the last checkpoint by default, 
if checkpointing is being used.
    -* [SPARK-12153](https://issues.apache.org/jira/browse/SPARK-12153):
    - `Word2Vec` now respects sentence boundaries. Previously, it did not 
handle them correctly.
    -* [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574):
    - `HashingTF` uses `MurmurHash3` as default hash algorithm in both 
`spark.ml` and `spark.mllib`.
    -* [SPARK-14768](https://issues.apache.org/jira/browse/SPARK-14768):
    - The `expectedType` argument for PySpark `Param` was removed.
    -* [SPARK-14931](https://issues.apache.org/jira/browse/SPARK-14931):
    - Some default `Param` values, which were mismatched between pipelines in 
Scala and Python, have been changed.
    -* [SPARK-13600](https://issues.apache.org/jira/browse/SPARK-13600):
    - `QuantileDiscretizer` now uses 
`spark.sql.DataFrameStatFunctions.approxQuantile` to find splits (previously 
used custom sampling logic).
    - The output buckets will differ for same input data and params.
    +* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
    + Fix a bug of `ChiSqSelector` which will likely change its result.
    +* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
    + `KMeans` returns potentially fewer than k cluster centers in cases where 
k distinct centroids aren't available or aren't selected.
    +* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
    + `KMeans` reduces the default number of steps from 5 to 2 for the 
k-means|| initialization mode.
    +* [SPARK-18481](https://issues.apache.org/jira/browse/SPARK-18481):
    --- End diff --
    
    Ah, you're right.  I was thinking the "final" change was breaking for 
Scala, but it's not since the model constructor is private.  I'm OK with moving 
those back to changes of behavior, or just removing these items completely.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16076: [SPARK-18324][ML][DOC] Update ML programming and ...

Reply via email to