Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/6181#discussion_r30457642
--- Diff: docs/ml-features.md ---
@@ -106,6 +106,84 @@ for features_label in featurized.select("features",
"label").take(3):
</div>
</div>
+## Word2Vec
+
+`Word2Vec` is an `Estimator` which takes sequences of words that
represents documents and trains a `Word2VecModel`. The model is a `Map(String,
Vector)` essentially, which maps each word to an unique fix-sized vector. The
`Word2VecModel` transforms each documents into a vector using the average of
all words in the document, which aims to other computations of documents such
as similarity calculation consequencely. Please refer to the [MLlib user guide
on Word2Vec](mllib-feature-extraction.html#Word2Vec) for more details on
Word2Vec.
+
+Word2Vec is implemented in
[Word2Vec](api/scala/index.html#org.apache.spark.ml.feature.Word2Vec). In the
following code segment, we start with a set of documents, each of them is
represented as a sequence of words. For each document, we transform it into a
feature vector. This feature vector could then be passed to a learning
algorithm.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+import org.apache.spark.ml.feature.Word2Vec
+
+val documentDF = sqlContext.createDataFrame(Seq(
--- End diff --
Add comment in line above:
```
Input data: Each row is a bag of words from a sentence or document.
```
(Please add to other examples too.)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]