spark git commit: [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents

srowen Sat, 11 Jun 2016 04:56:35 -0700

Repository: spark
Updated Branches:
  refs/heads/master 3761330dd -> ad102af16



[SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents

## What changes were proposed in this pull request?

This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, 
this contains some editorial change.

**Fix broken links**
  * mllib-data-types.md
  * mllib-decision-tree.md
  * mllib-ensembles.md
  * mllib-feature-extraction.md
  * mllib-pmml-model-export.md
  * mllib-statistics.md

**Fix malformed section header and scala coding style**
  * mllib-linear-methods.md

**Replace indirect forward links with direct one**
  * ml-classification-regression.md

## How was this patch tested?

Manual tests (with `cd docs; jekyll build`.)

Author: Dongjoon Hyun <[email protected]>

Closes #13608 from dongjoon-hyun/SPARK-15883.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ad102af1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ad102af1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ad102af1

Branch: refs/heads/master
Commit: ad102af169c7344b30d3b84aa16452fcdc22542c
Parents: 3761330
Author: Dongjoon Hyun <[email protected]>
Authored: Sat Jun 11 12:55:38 2016 +0100
Committer: Sean Owen <[email protected]>
Committed: Sat Jun 11 12:55:38 2016 +0100

----------------------------------------------------------------------
 docs/ml-classification-regression.md |  4 ++--
 docs/mllib-data-types.md             | 16 ++++++----------
 docs/mllib-decision-tree.md          |  6 +++---
 docs/mllib-ensembles.md              |  6 +++---
 docs/mllib-feature-extraction.md     |  2 +-
 docs/mllib-linear-methods.md         | 10 +++++-----
 docs/mllib-pmml-model-export.md      |  2 +-
 docs/mllib-statistics.md             |  8 ++++----
 8 files changed, 25 insertions(+), 29 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/ml-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index 88457d4..d7e5521 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -815,7 +815,7 @@ The main differences between this API and the [original 
MLlib ensembles API](mll
 ## Random Forests
 
 [Random forests](http://en.wikipedia.org/wiki/Random_forest)
-are ensembles of [decision trees](ml-decision-tree.html).
+are ensembles of [decision 
trees](ml-classification-regression.html#decision-trees).
 Random forests combine many decision trees in order to reduce the risk of 
overfitting.
 The `spark.ml` implementation supports random forests for binary and 
multiclass classification and for regression,
 using both continuous and categorical features.
@@ -896,7 +896,7 @@ All output columns are optional; to exclude an output 
column, set its correspond
 ## Gradient-Boosted Trees (GBTs)
 
 [Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
-are ensembles of [decision trees](ml-decision-tree.html).
+are ensembles of [decision 
trees](ml-classification-regression.html#decision-trees).
 GBTs iteratively train decision trees in order to minimize a loss function.
 The `spark.ml` implementation supports GBTs for binary classification and for 
regression,
 using both continuous and categorical features.

http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-data-types.md
----------------------------------------------------------------------
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index 2ffe0f1..ef56aeb 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -33,7 +33,7 @@ implementations: 
[`DenseVector`](api/scala/index.html#org.apache.spark.mllib.lin
 using the factory methods implemented in
 [`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to 
create local vectors.
 
-Refer to the [`Vector` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` 
Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for 
details on the API.
+Refer to the [`Vector` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` 
Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for 
details on the API.
 
 {% highlight scala %}
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
@@ -199,7 +199,7 @@ After loading, the feature indices are converted to 
zero-based.
 
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$)
 reads training
 examples stored in LIBSVM format.
 
-Refer to the [`MLUtils` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils) for details on 
the API.
+Refer to the [`MLUtils` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for details on 
the API.
 
 {% highlight scala %}
 import org.apache.spark.mllib.regression.LabeledPoint
@@ -264,7 +264,7 @@ We recommend using the factory methods implemented
 in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) 
to create local
 matrices. Remember, local matrices in MLlib are stored in column-major order.
 
-Refer to the [`Matrix` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and 
[`Matrices` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) for details 
on the API.
+Refer to the [`Matrix` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and 
[`Matrices` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) for details 
on the API.
 
 {% highlight scala %}
 import org.apache.spark.mllib.linalg.{Matrix, Matrices}
@@ -331,7 +331,7 @@ sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
 A distributed matrix has long-typed row and column indices and double-typed 
values, stored
 distributively in one or more RDDs.  It is very important to choose the right 
format to store large
 and distributed matrices.  Converting a distributed matrix to a different 
format may require a
-global shuffle, which is quite expensive.  Three types of distributed matrices 
have been implemented
+global shuffle, which is quite expensive. Four types of distributed matrices 
have been implemented
 so far.
 
 The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented 
distributed
@@ -344,6 +344,8 @@ An `IndexedRowMatrix` is similar to a `RowMatrix` but with 
row indices,
 which can be used for identifying rows and executing joins.
 A `CoordinateMatrix` is a distributed matrix stored in [coordinate list 
(COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) 
format,
 backed by an RDD of its entries.
+A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`
+which is a tuple of `(Int, Int, Matrix)`.
 
 ***Note***
 
@@ -535,12 +537,6 @@ rowsRDD = mat.rows
 
 # Convert to a RowMatrix by dropping the row indices.
 rowMat = mat.toRowMatrix()
-
-# Convert to a CoordinateMatrix.
-coordinateMat = mat.toCoordinateMatrix()
-
-# Convert to a BlockMatrix.
-blockMat = mat.toBlockMatrix()
 {% endhighlight %}
 </div>
 

http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-decision-tree.md
----------------------------------------------------------------------
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 9af4835..11f5de1 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -136,7 +136,7 @@ When tuning these parameters, be careful to validate on 
held-out test data to av
 
 * **`maxDepth`**: Maximum depth of a tree.  Deeper trees are more expressive 
(potentially allowing higher accuracy), but they are also more costly to train 
and are more likely to overfit.
 
-* **`minInstancesPerNode`**: For a node to be split further, each of its 
children must receive at least this number of training instances.  This is 
commonly used with 
[RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) 
since those are often trained deeper than individual trees.
+* **`minInstancesPerNode`**: For a node to be split further, each of its 
children must receive at least this number of training instances.  This is 
commonly used with 
[RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) 
since those are often trained deeper than individual trees.
 
 * **`minInfoGain`**: For a node to be split further, the split must improve at 
least this much (in terms of information gain).
 
@@ -152,13 +152,13 @@ These parameters may be tuned.  Be careful to validate on 
held-out test data whe
   * The default value is conservatively chosen to be 256 MB to allow the 
decision algorithm to work in most scenarios.  Increasing `maxMemoryInMB` can 
lead to faster training (if the memory is available) by allowing fewer passes 
over the data.  However, there may be decreasing returns as `maxMemoryInMB` 
grows since the amount of communication on each iteration can be proportional 
to `maxMemoryInMB`.
   * *Implementation details*: For faster processing, the decision tree 
algorithm collects statistics about groups of nodes to split (rather than 1 
node at a time).  The number of nodes which can be handled in one group is 
determined by the memory requirements (which vary per features).  The 
`maxMemoryInMB` parameter specifies the memory limit in terms of megabytes 
which each worker can use for these statistics.
 
-* **`subsamplingRate`**: Fraction of the training data used for learning the 
decision tree.  This parameter is most relevant for training ensembles of trees 
(using 
[`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) 
and 
[`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)),
 where it can be useful to subsample the original data.  For training a single 
decision tree, this parameter is less useful since the number of training 
instances is generally not the main constraint.
+* **`subsamplingRate`**: Fraction of the training data used for learning the 
decision tree.  This parameter is most relevant for training ensembles of trees 
(using 
[`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$)
 and 
[`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)),
 where it can be useful to subsample the original data.  For training a single 
decision tree, this parameter is less useful since the number of training 
instances is generally not the main constraint.
 
 * **`impurity`**: Impurity measure (discussed above) used to choose between 
candidate splits.  This measure must match the `algo` parameter.
 
 ### Caching and checkpointing
 
-MLlib 1.2 adds several features for scaling up to larger (deeper) trees and 
tree ensembles.  When `maxDepth` is set to be large, it can be useful to turn 
on node ID caching and checkpointing.  These parameters are also useful for 
[RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) 
when `numTrees` is set to be large.
+MLlib 1.2 adds several features for scaling up to larger (deeper) trees and 
tree ensembles.  When `maxDepth` is set to be large, it can be useful to turn 
on node ID caching and checkpointing.  These parameters are also useful for 
[RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) 
when `numTrees` is set to be large.
 
 * **`useNodeIdCache`**: If this is set to true, the algorithm will avoid 
passing the current model (tree or trees) to executors on each iteration.
   * This can be useful with deep trees (speeding up computation on workers) 
and for large Random Forests (reducing communication on each iteration).

http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-ensembles.md
----------------------------------------------------------------------
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 2416b6f..5543262 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -9,7 +9,7 @@ displayTitle: Ensembles - spark.mllib
 
 An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
 is a learning algorithm which creates a model composed of a set of other base 
models.
-`spark.mllib` supports two major ensemble algorithms: 
[`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)
 and 
[`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest).
+`spark.mllib` supports two major ensemble algorithms: 
[`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)
 and 
[`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$).
 Both use [decision trees](mllib-decision-tree.html) as their base models.
 
 ## Gradient-Boosted Trees vs. Random Forests
@@ -96,7 +96,7 @@ The test error is calculated to measure the algorithm 
accuracy.
 <div class="codetabs">
 
 <div data-lang="scala" markdown="1">
-Refer to the [`RandomForest` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and 
[`RandomForestModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) 
for details on the API.
+Refer to the [`RandomForest` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and 
[`RandomForestModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) 
for details on the API.
 
 {% include_example 
scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %}
 </div>
@@ -127,7 +127,7 @@ The Mean Squared Error (MSE) is computed at the end to 
evaluate
 <div class="codetabs">
 
 <div data-lang="scala" markdown="1">
-Refer to the [`RandomForest` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and 
[`RandomForestModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) 
for details on the API.
+Refer to the [`RandomForest` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and 
[`RandomForestModel` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) 
for details on the API.
 
 {% include_example 
scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %}
 </div>

http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 4c027c8..67c033e 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -333,7 +333,7 @@ Details you can read at [dimensionality 
reduction](mllib-dimensionality-reductio
 
 The following code demonstrates how to compute principal components on a 
`Vector`
 and use them to project the vectors into a low-dimensional space while keeping 
associated labels
-for calculation a [Linear Regression]((mllib-linear-methods.html))
+for calculation a [Linear Regression](mllib-linear-methods.html)
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">

http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-linear-methods.md
----------------------------------------------------------------------
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 63665c4..17d781a 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -185,10 +185,10 @@ algorithm for 200 iterations.
 import org.apache.spark.mllib.optimization.L1Updater
 
 val svmAlg = new SVMWithSGD()
-svmAlg.optimizer.
-  setNumIterations(200).
-  setRegParam(0.1).
-  setUpdater(new L1Updater)
+svmAlg.optimizer
+  .setNumIterations(200)
+  .setRegParam(0.1)
+  .setUpdater(new L1Updater)
 val modelL1 = svmAlg.run(training)
 {% endhighlight %}
 
@@ -395,7 +395,7 @@ section of the Spark
 quick-start guide. Be sure to also include *spark-mllib* to your build file as
 a dependency.
 
-###Streaming linear regression
+### Streaming linear regression
 
 When data arrive in a streaming fashion, it is useful to fit regression models 
online,
 updating the parameters of the model as new data arrives. `spark.mllib` 
currently supports

http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-pmml-model-export.md
----------------------------------------------------------------------
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md
index 58ed5a0..7f2347d 100644
--- a/docs/mllib-pmml-model-export.md
+++ b/docs/mllib-pmml-model-export.md
@@ -47,7 +47,7 @@ To export a supported `model` (see table above) to PMML, 
simply call `model.toPM
 
 As well as exporting the PMML model to a String (`model.toPMML` as in the 
example above), you can export the PMML model to other formats.
 
-Refer to the [`KMeans` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and 
[`Vectors` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details 
on the API.
+Refer to the [`KMeans` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and 
[`Vectors` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details 
on the API.
 
 Here a complete example of building a KMeansModel and print it out in PMML 
format:
 {% include_example 
scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}

http://git-wip-us.apache.org/repos/asf/spark/blob/ad102af1/docs/mllib-statistics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 02b81f1..329855e 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -80,7 +80,7 @@ correlation methods are currently Pearson's and Spearman's 
correlation.
 calculate correlations between series. Depending on the type of input, two 
`RDD[Double]`s or
 an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` 
respectively.
 
-Refer to the [`Statistics` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details 
on the API.
+Refer to the [`Statistics` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details 
on the API.
 
 {% include_example 
scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
 </div>
@@ -210,7 +210,7 @@ message.
 run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example 
demonstrates how to run
 and interpret the hypothesis tests.
 
-Refer to the [`Statistics` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details 
on the API.
+Refer to the [`Statistics` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details 
on the API.
 
 {% include_example 
scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala
 %}
 </div>
@@ -277,12 +277,12 @@ uniform, standard normal, or Poisson.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
-[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) 
provides factory
+[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) 
provides factory
 methods to generate random double RDDs or vector RDDs.
 The following example generates a random double RDD, whose values follows the 
standard normal
 distribution `N(0, 1)`, and then map it to `N(1, 4)`.
 
-Refer to the [`RandomRDDs` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) for 
details on the API.
+Refer to the [`RandomRDDs` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) for 
details on the API.
 
 {% highlight scala %}
 import org.apache.spark.SparkContext


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents

Reply via email to