spark git commit: [SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idf

mlnick Tue, 17 May 2016 11:45:24 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 af37bdd3a -> 025b3e9f1



[SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idf

## What changes were proposed in this pull request?

We should now begin copying algorithm details from the spark.mllib guide to 
spark.ml as needed, rather than just linking back to the corresponding 
algorithms in the spark.mllib user guide.

## How was this patch tested?

manual review for doc.

Author: Yuhao Yang <hhb...@gmail.com>
Author: Yuhao Yang <yuhao.y...@intel.com>

Closes #12957 from hhbyyh/tfidfdoc.

(cherry picked from commit 3308a862ba0983268c9d5acf9e2a7d2b62d3ec27)
Signed-off-by: Nick Pentreath <ni...@za.ibm.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/025b3e9f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/025b3e9f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/025b3e9f

Branch: refs/heads/branch-2.0
Commit: 025b3e9f17d511b1768282d9635145fa87378b5b
Parents: af37bdd
Author: Yuhao Yang <hhb...@gmail.com>
Authored: Tue May 17 20:44:19 2016 +0200
Committer: Nick Pentreath <ni...@za.ibm.com>
Committed: Tue May 17 20:44:34 2016 +0200

----------------------------------------------------------------------
 docs/ml-features.md              | 51 ++++++++++++++++++++++++++++-------
 docs/mllib-feature-extraction.md |  3 +++
 2 files changed, 45 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/025b3e9f/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index c79bcac..c44ace9 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -18,27 +18,60 @@ This section covers algorithms for working with features, 
roughly divided into t
 
 # Feature Extractors
 
-## TF-IDF (HashingTF and IDF)
-
-[Term Frequency-Inverse Document Frequency 
(TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text 
pre-processing step.  In Spark ML, TF-IDF is separate into two parts: TF 
(+hashing) and IDF.
+## TF-IDF
+
+[Term frequency-inverse document frequency 
(TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) 
+is a feature vectorization method widely used in text mining to reflect the 
importance of a term 
+to a document in the corpus. Denote a term by `$t$`, a document by `$d$`, and 
the corpus by `$D$`.
+Term frequency `$TF(t, d)$` is the number of times that term `$t$` appears in 
document `$d$`, while 
+document frequency `$DF(t, D)$` is the number of documents that contains term 
`$t$`. If we only use 
+term frequency to measure the importance, it is very easy to over-emphasize 
terms that appear very 
+often but carry little information about the document, e.g., "a", "the", and 
"of". If a term appears 
+very often across the corpus, it means it doesn't carry special information 
about a particular document.
+Inverse document frequency is a numerical measure of how much information a 
term provides:
+`\[
+IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1},
+\]`
+where `$|D|$` is the total number of documents in the corpus. Since logarithm 
is used, if a term 
+appears in all documents, its IDF value becomes 0. Note that a smoothing term 
is applied to avoid 
+dividing by zero for terms outside the corpus. The TF-IDF measure is simply 
the product of TF and IDF:
+`\[
+TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D).
+\]`
+There are several variants on the definition of term frequency and document 
frequency.
+In MLlib, we separate TF and IDF to make them flexible.
 
 **TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the 
term frequency vectors. 
 
 `HashingTF` is a `Transformer` which takes sets of terms and converts those 
sets into 
 fixed-length feature vectors.  In text processing, a "set of terms" might be a 
bag of words.
-The algorithm combines Term Frequency (TF) counts with the 
-[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for 
dimensionality reduction.
+`HashingTF` utilizes the [hashing 
trick](http://en.wikipedia.org/wiki/Feature_hashing).
+A raw feature is mapped into an index (term) by applying a hash function. Then 
term frequencies 
+are calculated based on the mapped indices. This approach avoids the need to 
compute a global 
+term-to-index map, which can be expensive for a large corpus, but it suffers 
from potential hash 
+collisions, where different raw features may become the same term after 
hashing. To reduce the 
+chance of collision, we can increase the target feature dimension, i.e., the 
number of buckets 
+of the hash table. Since a simple modulo is used to transform the hash 
function to a column index, 
+it is advisable to use a power of two as the feature dimension, otherwise the 
features will 
+not be mapped evenly to the columns. The default feature dimension is `$2^{18} 
= 262,144$`. 
 
 `CountVectorizer` converts text documents to vectors of term counts. Refer to 
[CountVectorizer
 ](ml-features.html#countvectorizer) for more details.
 
 **IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an 
`IDFModel`.  The 
-`IDFModel` takes feature vectors (generally created from `HashingTF` or 
`CountVectorizer`) and scales each column.  
-Intuitively, it down-weights columns which appear frequently in a corpus.
+`IDFModel` takes feature vectors (generally created from `HashingTF` or 
`CountVectorizer`) and 
+scales each column. Intuitively, it down-weights columns which appear 
frequently in a corpus.
 
-Please refer to the [MLlib user guide on 
TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term 
Frequency and Inverse Document Frequency.
+**Note:** `spark.ml` doesn't provide tools for text segmentation.
+We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and 
+[scalanlp/chalk](https://github.com/scalanlp/chalk).
+
+**Examples**
 
-In the following code segment, we start with a set of sentences.  We split 
each sentence into words using `Tokenizer`.  For each sentence (bag of words), 
we use `HashingTF` to hash the sentence into a feature vector.  We use `IDF` to 
rescale the feature vectors; this generally improves performance when using 
text as features.  Our feature vectors could then be passed to a learning 
algorithm.
+In the following code segment, we start with a set of sentences.  We split 
each sentence into words 
+using `Tokenizer`.  For each sentence (bag of words), we use `HashingTF` to 
hash the sentence into 
+a feature vector.  We use `IDF` to rescale the feature vectors; this generally 
improves performance 
+when using text as features.  Our feature vectors could then be passed to a 
learning algorithm.
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">

http://git-wip-us.apache.org/repos/asf/spark/blob/025b3e9f/docs/mllib-feature-extraction.md
----------------------------------------------------------------------
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 7a97285..4c027c8 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -10,6 +10,9 @@ displayTitle: Feature Extraction and Transformation - 
spark.mllib
 
 ## TF-IDF
 
+**Note** We recommend using the DataFrame-based API, which is detailed in the 
[ML user guide on 
+TF-IDF](ml-features.html#tf-idf).
+
 [Term frequency-inverse document frequency 
(TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a feature 
 vectorization method widely used in text mining to reflect the importance of a 
term to a document in the corpus.
 Denote a term by `$t$`, a document by `$d$`, and the corpus by `$D$`.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idf

Reply via email to