[1/2] spark git commit: [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.

jkbradley Thu, 10 Dec 2015 12:51:13 -0800

Repository: spark
Updated Branches:
  refs/heads/master ec5f9ed5d -> 2ecbe02d5



http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-clustering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 8fbced6..48d64cd 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Clustering - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering
+title: Clustering - spark.mllib
+displayTitle: Clustering - spark.mllib
 ---
 
 [Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) is an 
unsupervised learning problem whereby we aim to group subsets
@@ -10,19 +10,19 @@ often used for exploratory analysis and/or as a component 
of a hierarchical
 [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) 
pipeline (in which distinct classifiers or regression
 models are trained for each cluster).
 
-MLlib supports the following models:
+The `spark.mllib` package supports the following models:
 
 * Table of contents
 {:toc}
 
 ## K-means
 
-[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
+[K-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
 most commonly used clustering algorithms that clusters the data points into a
-predefined number of clusters. The MLlib implementation includes a parallelized
+predefined number of clusters. The `spark.mllib` implementation includes a 
parallelized
 variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
 called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
-The implementation in MLlib has the following parameters:
+The implementation in `spark.mllib` has the following parameters:
 
 * *k* is the number of desired clusters.
 * *maxIterations* is the maximum number of iterations to run.
@@ -171,7 +171,7 @@ sameModel = KMeansModel.load(sc, "myModelPath")
 
 A [Gaussian Mixture 
Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model)
 represents a composite distribution whereby points are drawn from one of *k* 
Gaussian sub-distributions,
-each with its own probability.  The MLlib implementation uses the
+each with its own probability.  The `spark.mllib` implementation uses the
 
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
  algorithm to induce the maximum-likelihood model given a set of samples.  The 
implementation
 has the following parameters:
@@ -308,13 +308,13 @@ graph given pairwise similarties as edge properties,
 described in [Lin and Cohen, Power Iteration 
Clustering](http://www.icml2010.org/papers/387.pdf).
 It computes a pseudo-eigenvector of the normalized affinity matrix of the 
graph via
 [power iteration](http://en.wikipedia.org/wiki/Power_iteration)  and uses it 
to cluster vertices.
-MLlib includes an implementation of PIC using GraphX as its backend.
+`spark.mllib` includes an implementation of PIC using GraphX as its backend.
 It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model 
with the clustering assignments.
 The similarities must be nonnegative.
 PIC assumes that the similarity measure is symmetric.
 A pair `(srcId, dstId)` regardless of the ordering should appear at most once 
in the input data.
 If a pair is missing from input, their similarity is treated as zero.
-MLlib's PIC implementation takes the following (hyper-)parameters:
+`spark.mllib`'s PIC implementation takes the following (hyper-)parameters:
 
 * `k`: number of clusters
 * `maxIterations`: maximum number of power iterations
@@ -323,7 +323,7 @@ MLlib's PIC implementation takes the following 
(hyper-)parameters:
 
 **Examples**
 
-In the following, we show code snippets to demonstrate how to use PIC in MLlib.
+In the following, we show code snippets to demonstrate how to use PIC in 
`spark.mllib`.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -493,7 +493,7 @@ checkpointing can help reduce shuffle file sizes on disk 
and help with
 failure recovery.
 
 
-All of MLlib's LDA models support:
+All of `spark.mllib`'s LDA models support:
 
 * `describeTopics`: Returns topics as arrays of most important terms and
 term weights
@@ -721,7 +721,7 @@ sameModel = LDAModel.load(sc, "myModelPath")
 ## Streaming k-means
 
 When data arrive in a stream, we may want to estimate clusters dynamically,
-updating them as new data arrive. MLlib provides support for streaming k-means 
clustering,
+updating them as new data arrive. `spark.mllib` provides support for streaming 
k-means clustering,
 with parameters to control the decay (or "forgetfulness") of the estimates. 
The algorithm
 uses a generalization of the mini-batch k-means update rule. For each batch of 
data, we assign
 all points to their nearest cluster, compute new cluster centers, then update 
each cluster using:

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/mllib-collaborative-filtering.md 
b/docs/mllib-collaborative-filtering.md
index 7cd1b89..1ebb465 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Collaborative Filtering - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Collaborative Filtering 
+title: Collaborative Filtering - spark.mllib
+displayTitle: Collaborative Filtering - spark.mllib
 ---
 
 * Table of contents
@@ -11,12 +11,12 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - 
Collaborative Filtering
 
 [Collaborative 
filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
 is commonly used for recommender systems.  These techniques aim to fill in the
-missing entries of a user-item association matrix.  MLlib currently supports
+missing entries of a user-item association matrix.  `spark.mllib` currently 
supports
 model-based collaborative filtering, in which users and products are described
 by a small set of latent factors that can be used to predict missing entries.
-MLlib uses the [alternating least squares
+`spark.mllib` uses the [alternating least squares
 (ALS)](http://dl.acm.org/citation.cfm?id=1608614)
-algorithm to learn these latent factors. The implementation in MLlib has the
+algorithm to learn these latent factors. The implementation in `spark.mllib` 
has the
 following parameters:
 
 * *numBlocks* is the number of blocks used to parallelize computation (set to 
-1 to auto-configure).
@@ -34,7 +34,7 @@ The standard approach to matrix factorization based 
collaborative filtering trea
 the entries in the user-item matrix as *explicit* preferences given by the 
user to the item.
 
 It is common in many real-world use cases to only have access to *implicit 
feedback* (e.g. views,
-clicks, purchases, likes, shares etc.). The approach used in MLlib to deal 
with such data is taken
+clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to 
deal with such data is taken
 from
 [Collaborative Filtering for Implicit Feedback 
Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
 Essentially instead of trying to model the matrix of ratings directly, this 
approach treats the data
@@ -119,4 +119,4 @@ a dependency.
 ## Tutorial
 
 The [training 
exercises](https://databricks-training.s3.amazonaws.com/index.html) from the 
Spark Summit 2014 include a hands-on tutorial for
-[personalized movie recommendation with 
MLlib](https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html).
+[personalized movie recommendation with 
`spark.mllib`](https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html).

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-data-types.md
----------------------------------------------------------------------
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index 3c0c047..363dc7c 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -1,7 +1,7 @@
 ---
 layout: global
 title: Data Types - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Data Types
+displayTitle: Data Types - MLlib
 ---
 
 * Table of contents

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-decision-tree.md
----------------------------------------------------------------------
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 77ce34e..a8612b6 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Decision Trees - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Decision Trees
+title: Decision Trees - spark.mllib
+displayTitle: Decision Trees - spark.mllib
 ---
 
 * Table of contents
@@ -15,7 +15,7 @@ feature scaling, and are able to capture non-linearities and 
feature interaction
 algorithms such as random forests and boosting are among the top performers 
for classification and
 regression tasks.
 
-MLlib supports decision trees for binary and multiclass classification and for 
regression,
+`spark.mllib` supports decision trees for binary and multiclass classification 
and for regression,
 using both continuous and categorical features. The implementation partitions 
data by rows,
 allowing distributed training with millions of instances.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-dimensionality-reduction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-dimensionality-reduction.md 
b/docs/mllib-dimensionality-reduction.md
index ac35269..11d8e0b 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Dimensionality Reduction - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
+title: Dimensionality Reduction - spark.mllib
+displayTitle: Dimensionality Reduction - spark.mllib
 ---
 
 * Table of contents
@@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - 
Dimensionality Reduction
 of reducing the number of variables under consideration.
 It can be used to extract latent features from raw and noisy features
 or compress data while maintaining the structure.
-MLlib provides support for dimensionality reduction on the <a 
href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.
+`spark.mllib` provides support for dimensionality reduction on the <a 
href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.
 
 ## Singular value decomposition (SVD)
 
@@ -57,7 +57,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage 
on the driver.
 
 ### SVD Example
  
-MLlib provides SVD functionality to row-oriented matrices, provided in the
+`spark.mllib` provides SVD functionality to row-oriented matrices, provided in 
the
 <a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class. 
 
 <div class="codetabs">
@@ -141,7 +141,7 @@ statistical method to find a rotation such that the first 
coordinate has the lar
 possible, and each succeeding coordinate in turn has the largest variance 
possible. The columns of
 the rotation matrix are called principal components. PCA is used widely in 
dimensionality reduction.
 
-MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format 
and any Vectors.
+`spark.mllib` supports PCA for tall-and-skinny matrices stored in row-oriented 
format and any Vectors.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-ensembles.md
----------------------------------------------------------------------
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 50450e0..2416b6f 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Ensembles - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Ensembles
+title: Ensembles - spark.mllib
+displayTitle: Ensembles - spark.mllib
 ---
 
 * Table of contents
@@ -9,7 +9,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Ensembles
 
 An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
 is a learning algorithm which creates a model composed of a set of other base 
models.
-MLlib supports two major ensemble algorithms: 
[`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)
 and 
[`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest).
+`spark.mllib` supports two major ensemble algorithms: 
[`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)
 and 
[`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest).
 Both use [decision trees](mllib-decision-tree.html) as their base models.
 
 ## Gradient-Boosted Trees vs. Random Forests
@@ -33,9 +33,9 @@ Like decision trees, random forests handle categorical 
features,
 extend to the multiclass classification setting, do not require
 feature scaling, and are able to capture non-linearities and feature 
interactions.
 
-MLlib supports random forests for binary and multiclass classification and for 
regression,
+`spark.mllib` supports random forests for binary and multiclass classification 
and for regression,
 using both continuous and categorical features.
-MLlib implements random forests using the existing [decision 
tree](mllib-decision-tree.html)
+`spark.mllib` implements random forests using the existing [decision 
tree](mllib-decision-tree.html)
 implementation.  Please see the decision tree guide for more information on 
trees.
 
 ### Basic algorithm
@@ -155,9 +155,9 @@ Like decision trees, GBTs handle categorical features,
 extend to the multiclass classification setting, do not require
 feature scaling, and are able to capture non-linearities and feature 
interactions.
 
-MLlib supports GBTs for binary classification and for regression,
+`spark.mllib` supports GBTs for binary classification and for regression,
 using both continuous and categorical features.
-MLlib implements GBTs using the existing [decision 
tree](mllib-decision-tree.html) implementation.  Please see the decision tree 
guide for more information on trees.
+`spark.mllib` implements GBTs using the existing [decision 
tree](mllib-decision-tree.html) implementation.  Please see the decision tree 
guide for more information on trees.
 
 *Note*: GBTs do not yet support multiclass classification.  For multiclass 
problems, please use
 [decision trees](mllib-decision-tree.html) or [Random 
Forests](mllib-ensembles.html#Random-Forest).
@@ -171,7 +171,7 @@ The specific mechanism for re-labeling instances is defined 
by a loss function (
 
 #### Losses
 
-The table below lists the losses currently supported by GBTs in MLlib.
+The table below lists the losses currently supported by GBTs in `spark.mllib`.
 Note that each loss is applicable to one of classification or regression, not 
both.
 
 Notation: $N$ = number of instances. $y_i$ = label of instance $i$.  $x_i$ = 
features of instance $i$.  $F(x_i)$ = model's predicted label for instance $i$.

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-evaluation-metrics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index 6924037..774826c 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -1,20 +1,20 @@
 ---
 layout: global
-title: Evaluation Metrics - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Evaluation Metrics
+title: Evaluation Metrics - spark.mllib
+displayTitle: Evaluation Metrics - spark.mllib
 ---
 
 * Table of contents
 {:toc}
 
-Spark's MLlib comes with a number of machine learning algorithms that can be 
used to learn from and make predictions
+`spark.mllib` comes with a number of machine learning algorithms that can be 
used to learn from and make predictions
 on data. When these algorithms are applied to build machine learning models, 
there is a need to evaluate the performance
-of the model on some criteria, which depends on the application and its 
requirements. Spark's MLlib also provides a
+of the model on some criteria, which depends on the application and its 
requirements. `spark.mllib` also provides a
 suite of metrics for the purpose of evaluating the performance of machine 
learning models.
 
 Specific machine learning algorithms fall under broader types of machine 
learning applications like classification,
 regression, clustering, etc. Each of these types have well established metrics 
for performance evaluation and those
-metrics that are currently available in Spark's MLlib are detailed in this 
section.
+metrics that are currently available in `spark.mllib` are detailed in this 
section.
 
 ## Classification model evaluation
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 5bee170..7796bac 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Feature Extraction and Transformation - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Feature Extraction and 
Transformation
+title: Feature Extraction and Transformation - spark.mllib
+displayTitle: Feature Extraction and Transformation - spark.mllib
 ---
 
 * Table of contents
@@ -31,7 +31,7 @@ The TF-IDF measure is simply the product of TF and IDF:
 TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
 \]`
 There are several variants on the definition of term frequency and document 
frequency.
-In MLlib, we separate TF and IDF to make them flexible.
+In `spark.mllib`, we separate TF and IDF to make them flexible.
 
 Our implementation of term frequency utilizes the
 [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing).
@@ -44,7 +44,7 @@ To reduce the chance of collision, we can increase the target 
feature dimension,
 the number of buckets of the hash table.
 The default feature dimension is `$2^{20} = 1,048,576$`.
 
-**Note:** MLlib doesn't provide tools for text segmentation.
+**Note:** `spark.mllib` doesn't provide tools for text segmentation.
 We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and 
 [scalanlp/chalk](https://github.com/scalanlp/chalk).
 
@@ -86,7 +86,7 @@ val idf = new IDF().fit(tf)
 val tfidf: RDD[Vector] = idf.transform(tf)
 {% endhighlight %}
 
-MLlib's IDF implementation provides an option for ignoring terms which occur 
in less than a
+`spark.mllib`'s IDF implementation provides an option for ignoring terms which 
occur in less than a
 minimum number of documents.  In such cases, the IDF for these terms is set to 
0.  This feature
 can be used by passing the `minDocFreq` value to the IDF constructor.
 
@@ -134,7 +134,7 @@ idf = IDF().fit(tf)
 tfidf = idf.transform(tf)
 {% endhighlight %}
 
-MLLib's IDF implementation provides an option for ignoring terms which occur 
in less than a
+`spark.mllib`'s IDF implementation provides an option for ignoring terms which 
occur in less than a
 minimum number of documents.  In such cases, the IDF for these terms is set to 
0.  This feature
 can be used by passing the `minDocFreq` value to the IDF constructor.
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index fe42896..2c8a8f2 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Frequent Pattern Mining - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Frequent Pattern Mining
+title: Frequent Pattern Mining - spark.mllib
+displayTitle: Frequent Pattern Mining - spark.mllib
 ---
 
 Mining frequent items, itemsets, subsequences, or other substructures is 
usually among the
@@ -9,7 +9,7 @@ first steps to analyze a large-scale dataset, which has been an 
active research
 data mining for years.
 We refer users to Wikipedia's [association rule 
learning](http://en.wikipedia.org/wiki/Association_rule_learning)
 for more information.
-MLlib provides a parallel implementation of FP-growth,
+`spark.mllib` provides a parallel implementation of FP-growth,
 a popular algorithm to mining frequent itemsets.
 
 ## FP-growth
@@ -22,13 +22,13 @@ Different from 
[Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) al
 the second step of FP-growth uses a suffix tree (FP-tree) structure to encode 
transactions without generating candidate sets
 explicitly, which are usually expensive to generate.
 After the second step, the frequent itemsets can be extracted from the FP-tree.
-In MLlib, we implemented a parallel version of FP-growth called PFP,
+In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
 as described in [Li et al., PFP: Parallel FP-growth for query 
recommendation](http://dx.doi.org/10.1145/1454008.1454027).
 PFP distributes the work of growing FP-trees based on the suffices of 
transactions,
 and hence more scalable than a single-machine implementation.
 We refer users to the papers for more details.
 
-MLlib's FP-growth implementation takes the following (hyper-)parameters:
+`spark.mllib`'s FP-growth implementation takes the following 
(hyper-)parameters:
 
 * `minSupport`: the minimum support for an itemset to be identified as 
frequent.
   For example, if an item appears 3 out of 5 transactions, it has a support of 
3/5=0.6.
@@ -126,7 +126,7 @@ PrefixSpan 
Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer
 the reader to the referenced paper for formalizing the sequential
 pattern mining problem.
 
-MLlib's PrefixSpan implementation takes the following parameters:
+`spark.mllib`'s PrefixSpan implementation takes the following parameters:
 
 * `minSupport`: the minimum support required to be considered a frequent
   sequential pattern.

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-guide.md
----------------------------------------------------------------------
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 3bc2b78..7fef6b5 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -66,7 +66,7 @@ We list major functionality from both below, with links to 
detailed guides.
 
 # spark.ml: high-level APIs for ML pipelines
 
-* [Overview: estimators, transformers and pipelines](ml-intro.html)
+* [Overview: estimators, transformers and pipelines](ml-guide.html)
 * [Extracting, transforming and selecting features](ml-features.html)
 * [Classification and regression](ml-classification-regression.html)
 * [Clustering](ml-clustering.html)

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-isotonic-regression.md
----------------------------------------------------------------------
diff --git a/docs/mllib-isotonic-regression.md 
b/docs/mllib-isotonic-regression.md
index 85f9226..8ede440 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Isotonic regression - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Regression
+title: Isotonic regression - spark.mllib
+displayTitle: Regression - spark.mllib
 ---
 
 ## Isotonic regression
@@ -23,7 +23,7 @@ Essentially isotonic regression is a
 [monotonic function](http://en.wikipedia.org/wiki/Monotonic_function)
 best fitting the original data points.
 
-MLlib supports a
+`spark.mllib` supports a
 [pool adjacent violators algorithm](http://doi.org/10.1198/TECH.2010.10111)
 which uses an approach to
 [parallelizing isotonic 
regression](http://doi.org/10.1007/978-3-642-99789-1_10).

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-linear-methods.md
----------------------------------------------------------------------
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 132f8c3..20b3561 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Linear Methods - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Linear Methods
+title: Linear Methods - spark.mllib
+displayTitle: Linear Methods - spark.mllib
 ---
 
 * Table of contents
@@ -41,7 +41,7 @@ the objective function is of the form
 Here the vectors `$\x_i\in\R^d$` are the training data examples, for `$1\le 
i\le n$`, and
 `$y_i\in\R$` are their corresponding labels, which we want to predict.
 We call the method *linear* if $L(\wv; \x, y)$ can be expressed as a function 
of $\wv^T x$ and $y$.
-Several of MLlib's classification and regression algorithms fall into this 
category,
+Several of `spark.mllib`'s classification and regression algorithms fall into 
this category,
 and are discussed here.
 
 The objective function `$f$` has two parts:
@@ -55,7 +55,7 @@ training error) and minimizing model complexity (i.e., to 
avoid overfitting).
 ### Loss functions
 
 The following table summarizes the loss functions and their gradients or 
sub-gradients for the
-methods MLlib supports:
+methods `spark.mllib` supports:
 
 <table class="table">
   <thead>
@@ -83,7 +83,7 @@ methods MLlib supports:
 The purpose of the
 [regularizer](http://en.wikipedia.org/wiki/Regularization_(mathematics)) is to
 encourage simple models and avoid overfitting.  We support the following
-regularizers in MLlib:
+regularizers in `spark.mllib`:
 
 <table class="table">
   <thead>
@@ -115,7 +115,10 @@ especially when the number of training examples is small.
 
 ### Optimization
 
-Under the hood, linear methods use convex optimization methods to optimize the 
objective functions.  MLlib uses two methods, SGD and L-BFGS, described in the 
[optimization section](mllib-optimization.html).  Currently, most algorithm 
APIs support Stochastic Gradient Descent (SGD), and a few support L-BFGS. Refer 
to [this optimization 
section](mllib-optimization.html#Choosing-an-Optimization-Method) for 
guidelines on choosing between optimization methods.
+Under the hood, linear methods use convex optimization methods to optimize the 
objective functions.
+`spark.mllib` uses two methods, SGD and L-BFGS, described in the [optimization 
section](mllib-optimization.html).
+Currently, most algorithm APIs support Stochastic Gradient Descent (SGD), and 
a few support L-BFGS.
+Refer to [this optimization 
section](mllib-optimization.html#Choosing-an-Optimization-Method) for 
guidelines on choosing between optimization methods.
 
 ## Classification
 
@@ -126,16 +129,16 @@ The most common classification type is
 categories, usually named positive and negative.
 If there are more than two categories, it is called
 [multiclass 
classification](http://en.wikipedia.org/wiki/Multiclass_classification).
-MLlib supports two linear methods for classification: linear Support Vector 
Machines (SVMs)
+`spark.mllib` supports two linear methods for classification: linear Support 
Vector Machines (SVMs)
 and logistic regression.
 Linear SVMs supports only binary classification, while logistic regression 
supports both binary and
 multiclass classification problems.
-For both methods, MLlib supports L1 and L2 regularized variants.
+For both methods, `spark.mllib` supports L1 and L2 regularized variants.
 The training data set is represented by an RDD of 
[LabeledPoint](mllib-data-types.html) in MLlib,
 where labels are class indices starting from zero: $0, 1, 2, \ldots$.
 Note that, in the mathematical formulation in this guide, a binary label $y$ 
is denoted as either
 $+1$ (positive) or $-1$ (negative), which is convenient for the formulation.
-*However*, the negative label is represented by $0$ in MLlib instead of $-1$, 
to be consistent with
+*However*, the negative label is represented by $0$ in `spark.mllib` instead 
of $-1$, to be consistent with
 multiclass labeling.
 
 ### Linear Support Vector Machines (SVMs)
@@ -207,7 +210,7 @@ val sameModel = SVMModel.load(sc, "myModelPath")
 The `SVMWithSGD.train()` method by default performs L2 regularization with the
 regularization parameter set to 1.0. If we want to configure this algorithm, we
 can customize `SVMWithSGD` further by creating a new object directly and
-calling setter methods. All other MLlib algorithms support customization in
+calling setter methods. All other `spark.mllib` algorithms support 
customization in
 this way as well. For example, the following code produces an L1 regularized
 variant of SVMs with regularization parameter set to 0.1, and runs the training
 algorithm for 200 iterations.
@@ -293,7 +296,7 @@ public class SVMClassifier {
 The `SVMWithSGD.train()` method by default performs L2 regularization with the
 regularization parameter set to 1.0. If we want to configure this algorithm, we
 can customize `SVMWithSGD` further by creating a new object directly and
-calling setter methods. All other MLlib algorithms support customization in
+calling setter methods. All other `spark.mllib` algorithms support 
customization in
 this way as well. For example, the following code produces an L1 regularized
 variant of SVMs with regularization parameter set to 0.1, and runs the training
 algorithm for 200 iterations.
@@ -375,7 +378,7 @@ Binary logistic regression can be generalized into
 train and predict multiclass classification problems.
 For example, for $K$ possible outcomes, one of the outcomes can be chosen as a 
"pivot", and the
 other $K - 1$ outcomes can be separately regressed against the pivot outcome.
-In MLlib, the first class $0$ is chosen as the "pivot" class.
+In `spark.mllib`, the first class $0$ is chosen as the "pivot" class.
 See Section 4.4 of
 [The Elements of Statistical 
Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
 references.
@@ -726,7 +729,7 @@ a dependency.
 ###Streaming linear regression
 
 When data arrive in a streaming fashion, it is useful to fit regression models 
online,
-updating the parameters of the model as new data arrives. MLlib currently 
supports
+updating the parameters of the model as new data arrives. `spark.mllib` 
currently supports
 streaming linear regression using ordinary least squares. The fitting is 
similar
 to that performed offline, except fitting occurs on each batch of data, so that
 the model continually updates to reflect the data from the stream.
@@ -852,7 +855,7 @@ will get better!
 
 # Implementation (developer)
 
-Behind the scene, MLlib implements a simple distributed version of stochastic 
gradient descent
+Behind the scene, `spark.mllib` implements a simple distributed version of 
stochastic gradient descent
 (SGD), building on the underlying gradient descent primitive (as described in 
the <a
 href="mllib-optimization.html">optimization</a> section).  All provided 
algorithms take as input a
 regularization parameter (`regParam`) along with various parameters associated 
with stochastic

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-migration-guides.md
----------------------------------------------------------------------
diff --git a/docs/mllib-migration-guides.md b/docs/mllib-migration-guides.md
index 774b85d..73e4fdd 100644
--- a/docs/mllib-migration-guides.md
+++ b/docs/mllib-migration-guides.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Old Migration Guides - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Old Migration Guides
+title: Old Migration Guides - spark.mllib
+displayTitle: Old Migration Guides - spark.mllib
 description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
 ---
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-naive-bayes.md
----------------------------------------------------------------------
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index 60ac6c7..d0d594a 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Naive Bayes - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Naive Bayes
+title: Naive Bayes - spark.mllib
+displayTitle: Naive Bayes - spark.mllib
 ---
 
 [Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a simple
@@ -12,7 +12,7 @@ distribution of each feature given label, and then it applies 
Bayes' theorem to
 compute the conditional probability distribution of label given an observation
 and use it for prediction.
 
-MLlib supports [multinomial naive
+`spark.mllib` supports [multinomial naive
 
Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
 and [Bernoulli naive 
Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
 These models are typically used for [document 
classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html).

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-optimization.md
----------------------------------------------------------------------
diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md
index ad7bcd9..f90b66f 100644
--- a/docs/mllib-optimization.md
+++ b/docs/mllib-optimization.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Optimization - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Optimization
+title: Optimization - spark.mllib
+displayTitle: Optimization - spark.mllib
 ---
 
 * Table of contents
@@ -87,7 +87,7 @@ in the `$t$`-th iteration, with the input parameter `$s=$ 
stepSize`. Note that s
 step-size for SGD methods can often be delicate in practice and is a topic of 
active research.
 
 **Gradients.**
-A table of (sub)gradients of the machine learning methods implemented in 
MLlib, is available in
+A table of (sub)gradients of the machine learning methods implemented in 
`spark.mllib`, is available in
 the <a href="mllib-classification-regression.html">classification and 
regression</a> section.
 
 
@@ -140,7 +140,7 @@ other first-order optimization.
 
 ### Choosing an Optimization Method
 
-[Linear methods](mllib-linear-methods.html) use optimization internally, and 
some linear methods in MLlib support both SGD and L-BFGS.
+[Linear methods](mllib-linear-methods.html) use optimization internally, and 
some linear methods in `spark.mllib` support both SGD and L-BFGS.
 Different optimization methods can have different convergence guarantees 
depending on the properties of the objective function, and we cannot cover the 
literature here.
 In general, when L-BFGS is available, we recommend using it instead of SGD 
since L-BFGS tends to converge faster (in fewer iterations).
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-pmml-model-export.md
----------------------------------------------------------------------
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md
index 6152871..b532ad9 100644
--- a/docs/mllib-pmml-model-export.md
+++ b/docs/mllib-pmml-model-export.md
@@ -1,21 +1,21 @@
 ---
 layout: global
-title: PMML model export - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - PMML model export
+title: PMML model export - spark.mllib
+displayTitle: PMML model export - spark.mllib
 ---
 
 * Table of contents
 {:toc}
 
-## MLlib supported models
+## `spark.mllib` supported models
 
-MLlib supports model export to Predictive Model Markup Language 
([PMML](http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language)).
+`spark.mllib` supports model export to Predictive Model Markup Language 
([PMML](http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language)).
 
-The table below outlines the MLlib models that can be exported to PMML and 
their equivalent PMML model.
+The table below outlines the `spark.mllib` models that can be exported to PMML 
and their equivalent PMML model.
 
 <table class="table">
   <thead>
-    <tr><th>MLlib model</th><th>PMML model</th></tr>
+    <tr><th>`spark.mllib` model</th><th>PMML model</th></tr>
   </thead>
   <tbody>
     <tr>

http://git-wip-us.apache.org/repos/asf/spark/blob/2ecbe02d/docs/mllib-statistics.md
----------------------------------------------------------------------
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index de209f6..652d215 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Basic Statistics - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Basic Statistics 
+title: Basic Statistics - spark.mllib
+displayTitle: Basic Statistics - spark.mllib
 ---
 
 * Table of contents
@@ -112,7 +112,7 @@ print(summary.numNonzeros())
 
 ## Correlations
 
-Calculating the correlation between two series of data is a common operation 
in Statistics. In MLlib
+Calculating the correlation between two series of data is a common operation 
in Statistics. In `spark.mllib`
 we provide the flexibility to calculate pairwise correlations among many 
series. The supported 
 correlation methods are currently Pearson's and Spearman's correlation.
  
@@ -209,7 +209,7 @@ print(Statistics.corr(data, method="pearson"))
 
 ## Stratified sampling
 
-Unlike the other statistics functions, which reside in MLlib, stratified 
sampling methods,
+Unlike the other statistics functions, which reside in `spark.mllib`, 
stratified sampling methods,
 `sampleByKey` and `sampleByKeyExact`, can be performed on RDD's of key-value 
pairs. For stratified
 sampling, the keys can be thought of as a label and the value as a specific 
attribute. For example 
 the key can be man or woman, or document ids, and the respective values can be 
the list of ages 
@@ -294,12 +294,12 @@ approxSample = data.sampleByKey(False, fractions);
 ## Hypothesis testing
 
 Hypothesis testing is a powerful tool in statistics to determine whether a 
result is statistically 
-significant, whether this result occurred by chance or not. MLlib currently 
supports Pearson's 
+significant, whether this result occurred by chance or not. `spark.mllib` 
currently supports Pearson's 
 chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input 
data types determine
 whether the goodness of fit or the independence test is conducted. The 
goodness of fit test requires 
 an input type of `Vector`, whereas the independence test requires a `Matrix` 
as input.
 
-MLlib also supports the input type `RDD[LabeledPoint]` to enable feature 
selection via chi-squared 
+`spark.mllib` also supports the input type `RDD[LabeledPoint]` to enable 
feature selection via chi-squared 
 independence tests.
 
 <div class="codetabs">
@@ -438,7 +438,7 @@ for i, result in enumerate(featureTestResults):
 
 </div>
 
-Additionally, MLlib provides a 1-sample, 2-sided implementation of the 
Kolmogorov-Smirnov (KS) test
+Additionally, `spark.mllib` provides a 1-sample, 2-sided implementation of the 
Kolmogorov-Smirnov (KS) test
 for equality of probability distributions. By providing the name of a 
theoretical distribution
 (currently solely supported for the normal distribution) and its parameters, 
or a function to 
 calculate the cumulative distribution according to a given theoretical 
distribution, the user can
@@ -522,7 +522,7 @@ print(testResult) # summary of the test including the 
p-value, test statistic,
 </div>
 
 ### Streaming Significance Testing
-MLlib provides online implementations of some tests to support use cases
+`spark.mllib` provides online implementations of some tests to support use 
cases
 like A/B testing. These tests may be performed on a Spark Streaming
 `DStream[(Boolean,Double)]` where the first element of each tuple
 indicates control group (`false`) or treatment group (`true`) and the
@@ -550,7 +550,7 @@ provides streaming hypothesis testing.
 ## Random data generation
 
 Random data generation is useful for randomized algorithms, prototyping, and 
performance testing.
-MLlib supports generating random RDDs with i.i.d. values drawn from a given 
distribution:
+`spark.mllib` supports generating random RDDs with i.i.d. values drawn from a 
given distribution:
 uniform, standard normal, or Poisson.
 
 <div class="codetabs">


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[1/2] spark git commit: [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.

Reply via email to