spark git commit: [SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4

meng Sun, 21 Jun 2015 16:26:04 -0700

Repository: spark
Updated Branches:
  refs/heads/master 83cdfd84f -> a1894422a



[SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4

Reorganized docs a bit.  Added migration guides.

**Q**: Do we want to say more for the 1.3 -> 1.4 migration guide for 
```spark.ml```?  It would be a lot.

CC: mengxr

Author: Joseph K. Bradley <[email protected]>

Closes #6897 from jkbradley/ml-guide-1.4 and squashes the following commits:

4bf26d6 [Joseph K. Bradley] tiny fix
8085067 [Joseph K. Bradley] fixed spacing/layout issues in ml guide from 
previous commit in this PR
6cd5c78 [Joseph K. Bradley] Updated MLlib programming guide for release 1.4


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a1894422
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a1894422
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a1894422

Branch: refs/heads/master
Commit: a1894422ad6b3335c84c73ba9466da6677d893cb
Parents: 83cdfd8
Author: Joseph K. Bradley <[email protected]>
Authored: Sun Jun 21 16:25:25 2015 -0700
Committer: Xiangrui Meng <[email protected]>
Committed: Sun Jun 21 16:25:25 2015 -0700

----------------------------------------------------------------------
 docs/ml-guide.md                 | 32 ++++++++++++++----------
 docs/mllib-feature-extraction.md |  3 ++-
 docs/mllib-guide.md              | 47 +++++++++++++++++++++--------------
 docs/mllib-migration-guides.md   | 16 ++++++++++++
 4 files changed, 65 insertions(+), 33 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/a1894422/docs/ml-guide.md
----------------------------------------------------------------------
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 4eb622d..c74cb1f 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -3,10 +3,10 @@ layout: global
 title: Spark ML Programming Guide
 ---
 
-`spark.ml` is a new package introduced in Spark 1.2, which aims to provide a 
uniform set of
+Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a 
uniform set of
 high-level APIs that help users create and tune practical machine learning 
pipelines.
-It is currently an alpha component, and we would like to hear back from the 
community about
-how it fits real-world use cases and how it could be improved.
+
+*Graduated from Alpha!*  The Pipelines API is no longer an alpha component, 
although many elements of it are still `Experimental` or `DeveloperApi`.
 
 Note that we will keep supporting and adding features to `spark.mllib` along 
with the
 development of `spark.ml`.
@@ -14,6 +14,12 @@ Users should be comfortable using `spark.mllib` features and 
expect more feature
 Developers should contribute new algorithms to `spark.mllib` and can 
optionally contribute
 to `spark.ml`.
 
+Guides for sub-packages of `spark.ml` include:
+
+* [Feature Extraction, Transformation, and Selection](ml-features.html): 
Details on transformers supported in the Pipelines API, including a few not in 
the lower-level `spark.mllib` API
+* [Ensembles](ml-ensembles.html): Details on ensemble learning methods in the 
Pipelines API
+
+
 **Table of Contents**
 
 * This will become a table of contents (this text will be scraped).
@@ -148,16 +154,6 @@ Parameters belong to specific instances of `Estimator`s 
and `Transformer`s.
 For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, 
then we can build a `ParamMap` with both `maxIter` parameters specified: 
`ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`.
 This is useful if there are two algorithms with the `maxIter` parameter in a 
`Pipeline`.
 
-# Algorithm Guides
-
-There are now several algorithms in the Pipelines API which are not in the 
lower-level MLlib API, so we link to documentation for them here.  These 
algorithms are mostly feature transformers, which fit naturally into the 
`Transformer` abstraction in Pipelines, and ensembles, which fit naturally into 
the `Estimator` abstraction in the Pipelines.
-
-**Pipelines API Algorithm Guides**
-
-* [Feature Extraction, Transformation, and Selection](ml-features.html)
-* [Ensembles](ml-ensembles.html)
-
-
 # Code Examples
 
 This section gives code examples illustrating the functionality discussed 
above.
@@ -783,6 +779,16 @@ Spark ML also depends upon Spark SQL, but the relevant 
parts of Spark SQL do not
 
 # Migration Guide
 
+## From 1.3 to 1.4
+
+Several major API changes occurred, including:
+* `Param` and other APIs for specifying parameters
+* `uid` unique IDs for Pipeline components
+* Reorganization of certain classes
+Since the `spark.ml` API was an Alpha Component in Spark 1.3, we do not list 
all changes here.
+
+However, now that `spark.ml` is no longer an Alpha Component, we will provide 
details on any API changes for future releases.
+
 ## From 1.2 to 1.3
 
 The main API changes are from Spark SQL.  We list the most important changes 
here:

http://git-wip-us.apache.org/repos/asf/spark/blob/a1894422/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 1197dbb..83e9376 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -576,8 +576,9 @@ parsedData = data.map(lambda x: [float(t) for t in 
x.split(" ")])
 transformingVector = Vectors.dense([0.0, 1.0, 2.0])
 transformer = ElementwiseProduct(transformingVector)
 
-# Batch transform.
+# Batch transform
 transformedData = transformer.transform(parsedData)
+# Single-row transform
 transformedData2 = transformer.transform(parsedData.first())
 
 {% endhighlight %}

http://git-wip-us.apache.org/repos/asf/spark/blob/a1894422/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index de7d66f..d2d1cc9 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -7,7 +7,19 @@ description: MLlib machine learning library overview for Spark 
SPARK_VERSION_SHO
 
 MLlib is Spark's scalable machine learning library consisting of common 
learning algorithms and utilities,
 including classification, regression, clustering, collaborative
-filtering, dimensionality reduction, as well as underlying optimization 
primitives, as outlined below:
+filtering, dimensionality reduction, as well as underlying optimization 
primitives.
+Guides for individual algorithms are listed below.
+
+The API is divided into 2 parts:
+
+* [The original `spark.mllib` 
API](mllib-guide.html#mllib-types-algorithms-and-utilities) is the primary API.
+* [The "Pipelines" `spark.ml` 
API](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) is a 
higher-level API for constructing ML workflows.
+
+We list major functionality from both below, with links to detailed guides.
+
+# MLlib types, algorithms and utilities
+
+This lists functionality included in `spark.mllib`, the main MLlib API.
 
 * [Data types](mllib-data-types.html)
 * [Basic statistics](mllib-statistics.html)
@@ -49,8 +61,8 @@ and the migration guide below will explain all changes 
between releases.
 
 Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a 
uniform set of
 high-level APIs that help users create and tune practical machine learning 
pipelines.
-It is currently an alpha component, and we would like to hear back from the 
community about
-how it fits real-world use cases and how it could be improved.
+
+*Graduated from Alpha!*  The Pipelines API is no longer an alpha component, 
although many elements of it are still `Experimental` or `DeveloperApi`.
 
 Note that we will keep supporting and adding features to `spark.mllib` along 
with the
 development of `spark.ml`.
@@ -58,7 +70,11 @@ Users should be comfortable using `spark.mllib` features and 
expect more feature
 Developers should contribute new algorithms to `spark.mllib` and can 
optionally contribute
 to `spark.ml`.
 
-See the **[spark.ml programming guide](ml-guide.html)** for more information 
on this package.
+More detailed guides for `spark.ml` include:
+
+* **[spark.ml programming guide](ml-guide.html)**: overview of the Pipelines 
API and major concepts
+* [Feature transformers](ml-features.html): Details on transformers supported 
in the Pipelines API, including a few not in the lower-level `spark.mllib` API
+* [Ensembles](ml-ensembles.html): Details on ensemble learning methods in the 
Pipelines API
 
 # Dependencies
 
@@ -90,21 +106,14 @@ version 1.4 or newer.
 
 For the `spark.ml` package, please see the [spark.ml Migration 
Guide](ml-guide.html#migration-guide).
 
-## From 1.2 to 1.3
-
-In the `spark.mllib` package, there were several breaking changes.  The first 
change (in `ALS`) is the only one in a component not marked as Alpha or 
Experimental.
-
-* *(Breaking change)* In 
[`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the 
extraneous method `solveLeastSquares` has been removed.  The `DeveloperApi` 
method `analyzeBlocks` was also removed.
-* *(Breaking change)* 
[`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel)
 remains an Alpha component. In it, the `variance` method has been replaced 
with the `std` method.  To compute the column variance values returned by the 
original `variance` method, simply square the standard deviation values 
returned by `std`.
-* *(Breaking change)* 
[`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD)
 remains an Experimental component.  In it, there were two changes:
-    * The constructor taking arguments was removed in favor of a builder 
patten using the default constructor plus parameter setter methods.
-    * Variable `model` is no longer public.
-* *(Breaking change)* 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) 
remains an Experimental component.  In it and its associated classes, there 
were several changes:
-    * In `DecisionTree`, the deprecated class method `train` has been removed. 
 (The object/static `train` methods remain.)
-    * In `Strategy`, the `checkpointDir` parameter has been removed.  
Checkpointing is still supported, but the checkpoint directory must be set 
before calling tree and tree ensemble training.
-* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was 
a public API but is now private, declared `private[python]`.  This was never 
meant for external use.
-* In linear regression (including Lasso and ridge regression), the squared 
loss is now divided by 2.
-  So in order to produce the same result as in 1.2, the regularization 
parameter needs to be divided by 2 and the step size needs to be multiplied by 
2.
+## From 1.3 to 1.4
+
+In the `spark.mllib` package, there were several breaking changes, but all in 
`DeveloperApi` or `Experimental` APIs:
+
+* Gradient-Boosted Trees
+    * *(Breaking change)* The signature of the 
[`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) 
method was changed.  This is only an issues for users who wrote their own 
losses for GBTs.
+    * *(Breaking change)* The `apply` and `copy` methods for the case class 
[`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy)
 have been changed because of a modification to the case class fields.  This 
could be an issue for users who use `BoostingStrategy` to set GBT parameters.
+* *(Breaking change)* The return value of 
[`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has 
changed.  It now returns an abstract class `LDAModel` instead of the concrete 
class `DistributedLDAModel`.  The object of type `LDAModel` can still be cast 
to the appropriate concrete type, which depends on the optimization algorithm.
 
 ## Previous Spark Versions
 

http://git-wip-us.apache.org/repos/asf/spark/blob/a1894422/docs/mllib-migration-guides.md
----------------------------------------------------------------------
diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md
index 4de2d94..8df68d8 100644
--- a/docs/mllib-migration-guides.md
+++ b/docs/mllib-migration-guides.md
@@ -7,6 +7,22 @@ description: MLlib migration guides from before Spark 
SPARK_VERSION_SHORT
 
 The migration guide for the current Spark version is kept on the [MLlib 
Programming Guide main page](mllib-guide.html#migration-guide).
 
+## From 1.2 to 1.3
+
+In the `spark.mllib` package, there were several breaking changes.  The first 
change (in `ALS`) is the only one in a component not marked as Alpha or 
Experimental.
+
+* *(Breaking change)* In 
[`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the 
extraneous method `solveLeastSquares` has been removed.  The `DeveloperApi` 
method `analyzeBlocks` was also removed.
+* *(Breaking change)* 
[`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel)
 remains an Alpha component. In it, the `variance` method has been replaced 
with the `std` method.  To compute the column variance values returned by the 
original `variance` method, simply square the standard deviation values 
returned by `std`.
+* *(Breaking change)* 
[`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD)
 remains an Experimental component.  In it, there were two changes:
+    * The constructor taking arguments was removed in favor of a builder 
pattern using the default constructor plus parameter setter methods.
+    * Variable `model` is no longer public.
+* *(Breaking change)* 
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) 
remains an Experimental component.  In it and its associated classes, there 
were several changes:
+    * In `DecisionTree`, the deprecated class method `train` has been removed. 
 (The object/static `train` methods remain.)
+    * In `Strategy`, the `checkpointDir` parameter has been removed.  
Checkpointing is still supported, but the checkpoint directory must be set 
before calling tree and tree ensemble training.
+* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was 
a public API but is now private, declared `private[python]`.  This was never 
meant for external use.
+* In linear regression (including Lasso and ridge regression), the squared 
loss is now divided by 2.
+  So in order to produce the same result as in 1.2, the regularization 
parameter needs to be divided by 2 and the step size needs to be multiplied by 
2.
+
 ## From 1.1 to 1.2
 
 The only API changes in MLlib v1.2 are in


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-7715] [MLLIB] [ML] [DOC] Updated MLlib programming guide for release 1.4

Reply via email to