[GitHub] spark pull request: [SPARK-9888][MLlib]User guide for new LDA feat...

feynmanliang Tue, 25 Aug 2015 15:55:29 -0700

Github user feynmanliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/8254#discussion_r37929604
  
    --- Diff: docs/mllib-clustering.md ---
    @@ -438,28 +438,125 @@ sameModel = PowerIterationClusteringModel.load(sc, 
"myModelPath")
     is a topic model which infers topics from a collection of text documents.
     LDA can be thought of as a clustering algorithm as follows:
     
    -* Topics correspond to cluster centers, and documents correspond to 
examples (rows) in a dataset.
    -* Topics and documents both exist in a feature space, where feature 
vectors are vectors of word counts.
    -* Rather than estimating a clustering using a traditional distance, LDA 
uses a function based
    - on a statistical model of how text documents are generated.
    -
    -LDA takes in a collection of documents as vectors of word counts.
    -It supports different inference algorithms via `setOptimizer` function. 
EMLDAOptimizer learns clustering using 
[expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm)
    -on the likelihood function and yields comprehensive results, while 
OnlineLDAOptimizer uses iterative mini-batch sampling for [online variational 
inference](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) 
and is generally memory friendly. After fitting on the documents, LDA provides:
    -
    -* Topics: Inferred topics, each of which is a probability distribution 
over terms (words).
    -* Topic distributions for documents: For each non empty document in the 
training set, LDA gives a probability distribution over topics. (EM only). Note 
that for empty documents, we don't create the topic distributions. (EM only)
    +* Topics correspond to cluster centers, and documents correspond to
    --- End diff --
    
    OK



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-9888][MLlib]User guide for new LDA feat...

Reply via email to