LDA Vectorization
-----------------
Key: MAHOUT-683
URL: https://issues.apache.org/jira/browse/MAHOUT-683
Project: Mahout
Issue Type: Improvement
Components: Clustering
Reporter: Vasil Vasilev
Priority: Minor
Currently the result of LDA clustering algorithm is a state which describes the
probability of words, part of a corpus of documents, to belong to given topics.
This probability is calculated for the whole corpus
It is interesting, however, what is the average number of words of a given
document that comes from a given topic. This information comes from the gamma
vector in the LDA inference process. This vector can be used as representation
of the given document for further clustering purposes (using algorithms like
KMeans, Dirichlet, etc.). In this manner the dimensions of a document get
reduced to the number of topics that is specified to the LDA clustering
algorithm.
With the proposed implementation from a corpus of documents described as
vectors and from the last state of LDA inference process a set of vectors with
reduced dimensions is produced (a vector per a document) which represent the
set of documents
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira