[ 
https://issues.apache.org/jira/browse/MAHOUT-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasil Vasilev updated MAHOUT-683:
---------------------------------

    Attachment:     (was: MAHOUT-683.patch)

> LDA Vectorization
> -----------------
>
>                 Key: MAHOUT-683
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-683
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: LDA., Vectorization
>         Attachments: MAHOUT-683.patch
>
>
> Currently the result of LDA clustering algorithm is a state which describes 
> the probability of words, part of a corpus of documents, to belong to given 
> topics. This probability is calculated for the whole corpus
> It is interesting, however, what is the average number of words of a given 
> document that comes from a given topic. This information comes from the gamma 
> vector in the LDA inference process. This vector can be used as 
> representation of the given document for further clustering purposes (using 
> algorithms like KMeans, Dirichlet, etc.). In this manner the dimensions of a 
> document get reduced to the number of topics that is specified to the LDA 
> clustering algorithm.
> With the proposed implementation from a corpus of documents described as 
> vectors and from the last state of LDA inference process a set of vectors 
> with reduced dimensions is produced (a vector per a document) which represent 
> the set of documents

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to