[GitHub] spark pull request: [SPARK-9888][MLlib]User guide for new LDA feat...

jkbradley Tue, 25 Aug 2015 11:05:45 -0700

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/8254#discussion_r37897963

--- Diff: docs/mllib-clustering.md ---
@@ -438,28 +438,125 @@ sameModel = PowerIterationClusteringModel.load(sc,
"myModelPath")
is a topic model which infers topics from a collection of text documents.
LDA can be thought of as a clustering algorithm as follows:

-* Topics correspond to cluster centers, and documents correspond to
examples (rows) in a dataset.
-* Topics and documents both exist in a feature space, where feature
vectors are vectors of word counts.
-* Rather than estimating a clustering using a traditional distance, LDA
uses a function based
- on a statistical model of how text documents are generated.
-
-LDA takes in a collection of documents as vectors of word counts.
-It supports different inference algorithms via `setOptimizer` function.
EMLDAOptimizer learns clustering using
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
-on the likelihood function and yields comprehensive results, while
OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational
inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
and is generally memory friendly. After fitting on the documents, LDA provides:
-
-* Topics: Inferred topics, each of which is a probability distribution
over terms (words).
-* Topic distributions for documents: For each non empty document in the
training set, LDA gives a probability distribution over topics. (EM only). Note
that for empty documents, we don't create the topic distributions. (EM only)
+* Topics correspond to cluster centers, and documents correspond to
+examples (rows) in a dataset.
+* Topics and documents both exist in a feature space, where feature
+vectors are vectors of word counts (bag of words).
+* Rather than estimating a clustering using a traditional distance, LDA
+uses a function based on a statistical model of how text documents are
+generated.
+
+LDA supports different inference algorithms via `setOptimizer` function.
+`EMLDAOptimizer` learns clustering using

+[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
+on the likelihood function and yields comprehensive results, while
+`OnlineLDAOptimizer` uses iterative mini-batch sampling for [online
+variational

+inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)
+and is generally memory friendly.

-LDA takes the following parameters:
+LDA takes in a collection of documents as vectors of word counts and the
+following parameters (set using the builder pattern):

* `k`: Number of topics (i.e., cluster centers)
-* `maxIterations`: Limit on the number of iterations of EM used for
learning
-* `docConcentration`: Hyperparameter for prior over documents'
distributions over topics. Currently must be > 1, where larger values encourage
smoother inferred distributions.
-* `topicConcentration`: Hyperparameter for prior over topics'
distributions over terms (words). Currently must be > 1, where larger values
encourage smoother inferred distributions.
-* `checkpointInterval`: If using checkpointing (set in the Spark
configuration), this parameter specifies the frequency with which checkpoints
will be created. If `maxIterations` is large, using checkpointing can help
reduce shuffle file sizes on disk and help with failure recovery.
-
-*Note*: LDA is a new feature with some missing functionality. In
particular, it does not yet
-support prediction on new documents, and it does not have a Python API.
These will be added in the future.
+* `optimizer`: Optimizer to use for learning the LDA model, either
+`EMLDAOptimizer` or `OnlineLDAOptimizer`
+* `docConcentration`: Dirichlet parameter for prior over documents'
+distributions over topics. Larger values encourage smoother inferred
+distributions.
+* `topicConcentration`: Dirichlet parameter for prior over topics'
+distributions over terms (words). Larger values encourage smoother
+inferred distributions.
+* `maxIterations`: Limit on the number of iterations.
+* `checkpointInterval`: If using checkpointing (set in the Spark
+configuration), this parameter specifies the frequency with which
+checkpoints will be created. If `maxIterations` is large, using
+checkpointing can help reduce shuffle file sizes on disk and help with
+failure recovery.
+
+
+All of MLlib's LDA models support:
+
+* `describeTopics`: Returns the top terms and their weights for each topics
+* `topicsMatrix`: Returns a `vocabSize` by `k` matrix where each column
+is a topic
+
+*Note*: LDA is still an experimental feature under active development.
+As a result, certain features are only available in one of the two
+optimizers / models generated by the optimizer. Currently, a distributed
+model can be converted into a local model (during which we assume a
--- End diff --

This isn't really a new assumption; it's what was specified by the EM
parameters. I'd remove the parenthesized comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9888][MLlib]User guide for new LDA feat...

Reply via email to