LDA Vectorization
-----------------

                 Key: MAHOUT-683
                 URL: https://issues.apache.org/jira/browse/MAHOUT-683
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
            Reporter: Vasil Vasilev
            Priority: Minor


Currently the result of LDA clustering algorithm is a state which describes the 
probability of words, part of a corpus of documents, to belong to given topics. 
This probability is calculated for the whole corpus
It is interesting, however, what is the average number of words of a given 
document that comes from a given topic. This information comes from the gamma 
vector in the LDA inference process. This vector can be used as representation 
of the given document for further clustering purposes (using algorithms like 
KMeans, Dirichlet, etc.). In this manner the dimensions of a document get 
reduced to the number of topics that is specified to the LDA clustering 
algorithm.
With the proposed implementation from a corpus of documents described as 
vectors and from the last state of LDA inference process a set of vectors with 
reduced dimensions is produced (a vector per a document) which represent the 
set of documents

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to