spark git commit: [SPARK-15394][ML][DOCS] User guide typos and grammar audit

meng Thu, 19 May 2016 23:30:07 -0700

Repository: spark
Updated Branches:
  refs/heads/master 47a2940da -> 5e203505f



[SPARK-15394][ML][DOCS] User guide typos and grammar audit

## What changes were proposed in this pull request?

Correct some typos and incorrectly worded sentences.

## How was this patch tested?

Doc changes only.

Note that many of these changes were identified by whomfire01

Author: sethah <seth.hendrickso...@gmail.com>

Closes #13180 from sethah/ml_guide_audit.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5e203505
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5e203505
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5e203505

Branch: refs/heads/master
Commit: 5e203505f1a092e5849ebd01d9ff9e4fc6cdc34a
Parents: 47a2940
Author: sethah <seth.hendrickso...@gmail.com>
Authored: Thu May 19 23:29:37 2016 -0700
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Thu May 19 23:29:37 2016 -0700

----------------------------------------------------------------------
 docs/ml-classification-regression.md | 28 +++++++++---------
 docs/ml-clustering.md                |  2 +-
 docs/ml-collaborative-filtering.md   |  6 ++--
 docs/ml-features.md                  | 47 +++++++++++++++----------------
 docs/ml-guide.md                     |  8 +++---
 5 files changed, 45 insertions(+), 46 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/5e203505/docs/ml-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index f6a6937..f1a21f4 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -236,9 +236,9 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 Multilayer perceptron classifier (MLPC) is a classifier based on the 
[feedforward artificial neural 
network](https://en.wikipedia.org/wiki/Feedforward_neural_network). 
 MLPC consists of multiple layers of nodes. 
-Each layer is fully connected to the next layer in the network. Nodes in the 
input layer represent the input data. All other nodes maps inputs to the 
outputs 
-by performing linear combination of the inputs with the node's weights `$\wv$` 
and bias `$\bv$` and applying an activation function. 
-It can be written in matrix form for MLPC with `$K+1$` layers as follows:
+Each layer is fully connected to the next layer in the network. Nodes in the 
input layer represent the input data. All other nodes map inputs to outputs 
+by a linear combination of the inputs with the node's weights `$\wv$` and bias 
`$\bv$` and applying an activation function. 
+This can be written in matrix form for MLPC with `$K+1$` layers as follows:
 `\[
 \mathrm{y}(\x) = \mathrm{f_K}(...\mathrm{f_2}(\wv_2^T\mathrm{f_1}(\wv_1^T 
\x+b_1)+b_2)...+b_K)
 \]`
@@ -252,7 +252,7 @@ Nodes in the output layer use softmax function:
 \]`
 The number of nodes `$N$` in the output layer corresponds to the number of 
classes. 
 
-MLPC employs backpropagation for learning the model. We use logistic loss 
function for optimization and L-BFGS as optimization routine.
+MLPC employs backpropagation for learning the model. We use the logistic loss 
function for optimization and L-BFGS as an optimization routine.
 
 **Example**
 
@@ -311,9 +311,9 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 
 ## Naive Bayes
 
-[Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) are a 
family of simple 
+[Naive Bayes classifiers](http://en.wikipedia.org/wiki/Naive_Bayes_classifier) 
are a family of simple 
 probabilistic classifiers based on applying Bayes' theorem with strong (naive) 
independence 
-assumptions between the features. The spark.ml implementation currently 
supports both [multinomial
+assumptions between the features. The `spark.ml` implementation currently 
supports both [multinomial
 naive 
Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
 and [Bernoulli naive 
Bayes](http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html).
 More information can be found in the section on [Naive Bayes in 
MLlib](mllib-naive-bayes.html#naive-bayes-sparkmllib).
@@ -482,11 +482,11 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.regression.
 
 In `spark.ml`, we implement the [Accelerated failure time 
(AFT)](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) 
 model which is a parametric survival regression model for censored data. 
-It describes a model for the log of survival time, so it's often called 
-log-linear model for survival analysis. Different from 
+It describes a model for the log of survival time, so it's often called a 
+log-linear model for survival analysis. Different from a
 [Proportional 
hazards](https://en.wikipedia.org/wiki/Proportional_hazards_model) model
-designed for the same purpose, the AFT model is more easily to parallelize 
-because each instance contribute to the objective function independently.
+designed for the same purpose, the AFT model is easier to parallelize 
+because each instance contributes to the objective function independently.
 
 Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of 
 subjects i = 1, ..., n, with possible right-censoring, 
@@ -501,10 +501,10 @@ assumes the form:
 
\iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+\delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
 \]`
 Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
-and $f_{0}(\epsilon_{i})$ is corresponding density function.
+and $f_{0}(\epsilon_{i})$ is the corresponding density function.
 
 The most commonly used AFT model is based on the Weibull distribution of the 
survival time. 
-The Weibull distribution for lifetime corresponding to extreme value 
distribution for 
+The Weibull distribution for lifetime corresponds to the extreme value 
distribution for the 
 log of the lifetime, and the $S_{0}(\epsilon)$ function is:
 `\[   
 S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
@@ -513,7 +513,7 @@ the $f_{0}(\epsilon_{i})$ function is:
 `\[
 f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
 \]`
-The log-likelihood function for AFT model with Weibull distribution of 
lifetime is:
+The log-likelihood function for AFT model with a Weibull distribution of 
lifetime is:
 `\[
 \iota(\beta,\sigma)= 
-\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
 \]`
@@ -529,7 +529,7 @@ The gradient functions for $\beta$ and $\log\sigma$ 
respectively are:
 
 The AFT model can be formulated as a convex optimization problem, 
 i.e. the task of finding a minimizer of a convex function 
$-\iota(\beta,\sigma)$ 
-that depends coefficients vector $\beta$ and the log of scale parameter 
$\log\sigma$.
+that depends on the coefficients vector $\beta$ and the log of scale parameter 
$\log\sigma$.
 The optimization algorithm underlying the implementation is L-BFGS.
 The implementation matches the result from R's survival function 
 
[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)

http://git-wip-us.apache.org/repos/asf/spark/blob/5e203505/docs/ml-clustering.md
----------------------------------------------------------------------
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index 33e4b7b..8656eb4 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -89,7 +89,7 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
 ## Latent Dirichlet allocation (LDA)
 
 `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and 
`OnlineLDAOptimizer`,
-and generates a `LDAModel` as the base models. Expert users may cast a 
`LDAModel` generated by
+and generates a `LDAModel` as the base model. Expert users may cast a 
`LDAModel` generated by
 `EMLDAOptimizer` to a `DistributedLDAModel` if needed.
 
 <div class="codetabs">

http://git-wip-us.apache.org/repos/asf/spark/blob/5e203505/docs/ml-collaborative-filtering.md
----------------------------------------------------------------------
diff --git a/docs/ml-collaborative-filtering.md 
b/docs/ml-collaborative-filtering.md
index 4514a35..bd3d527 100644
--- a/docs/ml-collaborative-filtering.md
+++ b/docs/ml-collaborative-filtering.md
@@ -60,7 +60,7 @@ best parameter learned from a sampled subset to the full 
dataset and expect simi
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 
-In the following example, we load rating data from the
+In the following example, we load ratings data from the
 [MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
 consisting of a user, a movie, a rating and a timestamp.
 We then train an ALS model which assumes, by default, that the ratings are
@@ -91,7 +91,7 @@ val als = new ALS()
 
 <div data-lang="java" markdown="1">
 
-In the following example, we load rating data from the
+In the following example, we load ratings data from the
 [MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
 consisting of a user, a movie, a rating and a timestamp.
 We then train an ALS model which assumes, by default, that the ratings are
@@ -122,7 +122,7 @@ ALS als = new ALS()
 
 <div data-lang="python" markdown="1">
 
-In the following example, we load rating data from the
+In the following example, we load ratings data from the
 [MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
 consisting of a user, a movie, a rating and a timestamp.
 We then train an ALS model which assumes, by default, that the ratings are

http://git-wip-us.apache.org/repos/asf/spark/blob/5e203505/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index c44ace9..3db24a3 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -26,7 +26,7 @@ to a document in the corpus. Denote a term by `$t$`, a 
document by `$d$`, and th
 Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in 
document `$d$`, while 
 document frequency `$DF(t, D)$` is the number of documents that contains term 
`$t$`. If we only use 
 term frequency to measure the importance, it is very easy to over-emphasize 
terms that appear very 
-often but carry little information about the document, e.g., "a", "the", and 
"of". If a term appears 
+often but carry little information about the document, e.g. "a", "the", and 
"of". If a term appears 
 very often across the corpus, it means it doesn't carry special information 
about a particular document.
 Inverse document frequency is a numerical measure of how much information a 
term provides:
 `\[
@@ -50,7 +50,7 @@ A raw feature is mapped into an index (term) by applying a 
hash function. Then t
 are calculated based on the mapped indices. This approach avoids the need to 
compute a global 
 term-to-index map, which can be expensive for a large corpus, but it suffers 
from potential hash 
 collisions, where different raw features may become the same term after 
hashing. To reduce the 
-chance of collision, we can increase the target feature dimension, i.e., the 
number of buckets 
+chance of collision, we can increase the target feature dimension, i.e. the 
number of buckets 
 of the hash table. Since a simple modulo is used to transform the hash 
function to a column index, 
 it is advisable to use a power of two as the feature dimension, otherwise the 
features will 
 not be mapped evenly to the columns. The default feature dimension is `$2^{18} 
= 262,144$`. 
@@ -104,7 +104,7 @@ the [IDF Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for mor
 `Word2Vec` is an `Estimator` which takes sequences of words representing 
documents and trains a
 `Word2VecModel`. The model maps each word to a unique fixed-size vector. The 
`Word2VecModel`
 transforms each document into a vector using the average of all words in the 
document; this vector
-can then be used for as features for prediction, document similarity 
calculations, etc.
+can then be used as features for prediction, document similarity calculations, 
etc.
 Please refer to the [MLlib user guide on 
Word2Vec](mllib-feature-extraction.html#word2vec) for more
 details.
 
@@ -140,12 +140,12 @@ for more details on the API.
 
 `CountVectorizer` and `CountVectorizerModel` aim to help convert a collection 
of text documents
  to vectors of token counts. When an a-priori dictionary is not available, 
`CountVectorizer` can
- be used as an `Estimator` to extract the vocabulary and generates a 
`CountVectorizerModel`. The
+ be used as an `Estimator` to extract the vocabulary, and generates a 
`CountVectorizerModel`. The
  model produces sparse representations for the documents over the vocabulary, 
which can then be
  passed to other algorithms like LDA.
 
  During the fitting process, `CountVectorizer` will select the top `vocabSize` 
words ordered by
- term frequency across the corpus. An optional parameter "minDF" also affect 
the fitting process
+ term frequency across the corpus. An optional parameter "minDF" also affects 
the fitting process
  by specifying the minimum number (or fraction if < 1.0) of documents a term 
must appear in to be
  included in the vocabulary.
 
@@ -161,8 +161,8 @@ Assume that we have the following DataFrame with columns 
`id` and `texts`:
 ~~~~
 
 each row in `texts` is a document of type Array[String].
-Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with 
vocabulary (a, b, c),
-then the output column "vector" after transformation contains:
+Invoking fit of `CountVectorizer` produces a `CountVectorizerModel` with 
vocabulary (a, b, c).
+Then the output column "vector" after transformation contains:
 
 ~~~~
  id | texts                           | vector
@@ -171,7 +171,7 @@ then the output column "vector" after transformation 
contains:
  1  | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])
 ~~~~
 
-each vector represents the token counts of the document over the vocabulary.
+Each vector represents the token counts of the document over the vocabulary.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -477,8 +477,7 @@ for more details on the API.
 ## StringIndexer
 
 `StringIndexer` encodes a string column of labels to a column of label indices.
-The indices are in `[0, numLabels)`, ordered by label frequencies.
-So the most frequent label gets index `0`.
+The indices are in `[0, numLabels)`, ordered by label frequencies, so the most 
frequent label gets index `0`.
 If the input column is numeric, we cast it to string and index the string
 values. When downstream pipeline components such as `Estimator` or
 `Transformer` make use of this string-indexed label, you must set the input
@@ -585,7 +584,7 @@ for more details on the API.
 ## IndexToString
 
 Symmetrically to `StringIndexer`, `IndexToString` maps a column of label 
indices
-back to a column containing the original labels as strings. The common use case
+back to a column containing the original labels as strings. A common use case
 is to produce indices from labels with `StringIndexer`, train a model with 
those
 indices and retrieve the original labels from the column of predicted indices
 with `IndexToString`. However, you are free to supply your own labels.
@@ -652,7 +651,7 @@ for more details on the API.
 
 ## OneHotEncoder
 
-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of 
label indices to a column of binary vectors, with at most a single one-value. 
This encoding allows algorithms which expect continuous features, such as 
Logistic Regression, to use categorical features
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of 
label indices to a column of binary vectors, with at most a single one-value. 
This encoding allows algorithms which expect continuous features, such as 
Logistic Regression, to use categorical features.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
@@ -888,7 +887,7 @@ for more details on the API.
 
 * `splits`: Parameter for mapping continuous features into buckets. With n+1 
splits, there are n buckets. A bucket defined by splits x,y holds values in the 
range [x,y) except the last bucket, which also includes y. Splits should be 
strictly increasing. Values at -inf, inf must be explicitly provided to cover 
all Double values; Otherwise, values outside the splits specified will be 
treated as errors. Two examples of `splits` are `Array(Double.NegativeInfinity, 
0.0, 1.0, Double.PositiveInfinity)` and `Array(0.0, 1.0, 2.0)`.
 
-Note that if you have no idea of the upper bound and lower bound of the 
targeted column, you would better add the `Double.NegativeInfinity` and 
`Double.PositiveInfinity` as the bounds of your splits to prevent a potential 
out of Bucketizer bounds exception.
+Note that if you have no idea of the upper and lower bounds of the targeted 
column, you should add `Double.NegativeInfinity` and `Double.PositiveInfinity` 
as the bounds of your splits to prevent a potential out of Bucketizer bounds 
exception.
 
 Note also that the splits that you provided have to be in strictly increasing 
order, i.e. `s0 < s1 < s2 < ... < sn`.
 
@@ -976,7 +975,7 @@ for more details on the API.
 Currently we only support SQL syntax like `"SELECT ... FROM __THIS__ ..."`
 where `"__THIS__"` represents the underlying table of the input dataset.
 The select clause specifies the fields, constants, and expressions to display 
in
-the output, it can be any select clause that Spark SQL supports. Users can also
+the output, and can be any select clause that Spark SQL supports. Users can 
also
 use Spark SQL built-in function and UDFs to operate on these selected columns.
 For example, `SQLTransformer` supports statements like:
 
@@ -1121,7 +1120,7 @@ Assume that we have a DataFrame with the columns `id`, 
`hour`:
 ~~~
 
 `hour` is a continuous feature with `Double` type. We want to turn the 
continuous feature into
-categorical one. Given `numBuckets = 3`, we should get the following DataFrame:
+a categorical one. Given `numBuckets = 3`, we should get the following 
DataFrame:
 
 ~~~
  id | hour | result
@@ -1171,19 +1170,19 @@ for more details on the API.
 `VectorSlicer` is a transformer that takes a feature vector and outputs a new 
feature vector with a
 sub-array of the original features. It is useful for extracting features from 
a vector column.
 
-`VectorSlicer` accepts a vector column with a specified indices, then outputs 
a new vector column
+`VectorSlicer` accepts a vector column with specified indices, then outputs a 
new vector column
 whose values are selected via those indices. There are two types of indices,
 
- 1. Integer indices that represents the indices into the vector, 
`setIndices()`;
+ 1. Integer indices that represent the indices into the vector, `setIndices()`.
 
- 2. String indices that represents the names of features into the vector, 
`setNames()`.
+ 2. String indices that represent the names of features into the vector, 
`setNames()`.
  *This requires the vector column to have an `AttributeGroup` since the 
implementation matches on
  the name field of an `Attribute`.*
 
 Specification by integer and string are both acceptable. Moreover, you can use 
integer index and
 string name simultaneously. At least one feature must be selected. Duplicate 
features are not
 allowed, so there can be no overlap between selected indices and names. Note 
that if names of
-features are selected, an exception will be threw out when encountering with 
empty input attributes.
+features are selected, an exception will be thrown if empty input attributes 
are encountered.
 
 The output vector will order features with the selected indices first (in the 
order given),
 followed by the selected names (in the order given).
@@ -1198,8 +1197,8 @@ Suppose that we have a DataFrame with the column 
`userFeatures`:
  [0.0, 10.0, 0.5]
 ~~~
 
-`userFeatures` is a vector column that contains three user features. Assuming 
that the first column
-of `userFeatures` are all zeros, so we want to remove it and only the last two 
columns are selected.
+`userFeatures` is a vector column that contains three user features. Assume 
that the first column
+of `userFeatures` are all zeros, so we want to remove it and select only the 
last two columns.
 The `VectorSlicer` selects the last two elements with `setIndices(1, 2)` then 
produces a new vector
 column named `features`:
 
@@ -1209,7 +1208,7 @@ column named `features`:
  [0.0, 10.0, 0.5] | [10.0, 0.5]
 ~~~
 
-Suppose also that we have a potential input attributes for the `userFeatures`, 
i.e.
+Suppose also that we have potential input attributes for the `userFeatures`, 
i.e.
 `["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them.
 
 ~~~
@@ -1337,8 +1336,8 @@ id | features              | clicked
  9 | [1.0, 0.0, 15.0, 0.1] | 0.0
 ~~~
 
-If we use `ChiSqSelector` with a `numTopFeatures = 1`, then according to our 
label `clicked` the
-last column in our `features` chosen as the most useful feature:
+If we use `ChiSqSelector` with `numTopFeatures = 1`, then according to our 
label `clicked` the
+last column in our `features` is chosen as the most useful feature:
 
 ~~~
 id | features              | clicked | selectedFeatures

http://git-wip-us.apache.org/repos/asf/spark/blob/5e203505/docs/ml-guide.md
----------------------------------------------------------------------
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index cc353df..dae86d8 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -47,7 +47,7 @@ mostly inspired by the 
[scikit-learn](http://scikit-learn.org/) project.
   E.g., a `DataFrame` could have different columns storing text, feature 
vectors, true labels, and predictions.
 
 * **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an 
algorithm which can transform one `DataFrame` into another `DataFrame`.
-E.g., an ML model is a `Transformer` which transforms `DataFrame` with 
features into a `DataFrame` with predictions.
+E.g., an ML model is a `Transformer` which transforms a `DataFrame` with 
features into a `DataFrame` with predictions.
 
 * **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm 
which can be fit on a `DataFrame` to produce a `Transformer`.
 E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and 
produces a model.
@@ -292,13 +292,13 @@ However, it is also a well-established method for 
choosing parameters which is m
 
 ## Example: model selection via train validation split
 In addition to  `CrossValidator` Spark also offers `TrainValidationSplit` for 
hyper-parameter tuning.
-`TrainValidationSplit` only evaluates each combination of parameters once as 
opposed to k times in
- case of `CrossValidator`. It is therefore less expensive,
+`TrainValidationSplit` only evaluates each combination of parameters once, as 
opposed to k times in
+ the case of `CrossValidator`. It is therefore less expensive,
  but will not produce as reliable results when the training dataset is not 
sufficiently large.
 
 `TrainValidationSplit` takes an `Estimator`, a set of `ParamMap`s provided in 
the `estimatorParamMaps` parameter,
 and an `Evaluator`.
-It begins by splitting the dataset into two parts using `trainRatio` parameter
+It begins by splitting the dataset into two parts using the `trainRatio` 
parameter
 which are used as separate training and test datasets. For example with 
`$trainRatio=0.75$` (default),
 `TrainValidationSplit` will generate a training and test dataset pair where 
75% of the data is used for training and 25% for validation.
 Similar to `CrossValidator`, `TrainValidationSplit` also iterates through the 
set of `ParamMap`s.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15394][ML][DOCS] User guide typos and grammar audit

Reply via email to