[
https://issues.apache.org/jira/browse/MAHOUT-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896645#action_12896645
]
Himanshu Gahlot commented on MAHOUT-458:
----------------------------------------
Rough patch for the improvement -> Mahout-458
This is a rough patch which outputs the gamma values per document (i.e. p(z|d))
in the LDA implementation. The current LDA implementation outputs only the
topics and their words but many times LDA is also used to reduce the
dimensionality of documents, which requires each document to be represented by
a vector of size equal to number of topics and values representing the
probability of each topic given the document.
Implementation:
A key called GAMMA_KEY is set in LDADriver with value = -3. In LDAMapper when
the inferred document is obtained the gamma values of this document are
normalized and written to the output stream using the context object and the
GAMMA_KEY. In LDAReducer these values are written to the output sequence file.
Finally, in LDAPrintTopics these values are outputted by accepting an
additional command line option of '-dtpo' having the name of the file in which
the document-topic-probabilities should be printed.
The command to run LDA remains the same:
$ ./bin/mahout lda -i examples/bin/sparse-format/tf-vectors -o
examples/bin/lda-out -v 10000 -k 20 -ow -x 5
but there is a slight change in the LDAPrintTopics command. The new command
will now be:
$ ./bin/mahout ldatopics -i examples/bin/lda-out/state-5 -d
examples/bin/sparse-format/dictionary.file-0 -dt sequencefile -dtpo
examples/bin/dtpo.txt -two examples/bin/topics
where, 'lda-out' is the directory obtained as output through LDA,
'sparse-format' is the directory containing the sparse-format-files (as
obtained by seq2sparse command), '-dtpo' option is used to supply the name of
the file in which the gamma values should be printed and '-two' option is used
to supply the name of the directory in which the topic-word probability should
be outputted.
Notes:
Since I have added a new key hence, the number of keys changed which led to the
failure of org.apache.mahout.clustering.lda.TestMapReduce test. Also, in this
implementation I am assuming that the document IDs are integers and hence I
made the following changes in this test:
1. In line 101,
original code:
EasyMock.expectLastCall().times(myNumWords * NUM_TOPICS + NUM_TOPICS + 1);
changed code:
EasyMock.expectLastCall().times(myNumWords * NUM_TOPICS + 2*NUM_TOPICS + 1);
2. In line 104,
original code:
mapper.map(new Text("tstMapper"), vw, mock);
changed code:
mapper.map(new Text("/23456.txt"), vw, mock);
The second change shows that I am assuming that the filenames of the input
documents are of the format: <integer>.<extension>
This is my first submission to open source so kindly let me know if this patch
is of any worth. If it is then kindly let me know the improvements and
suggestions that should be done.
Thanks
> The LDA output does not include the topic-probability distribution per
> document (p(z|d)). It outputs only the topics and corresponding words.
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-458
> URL: https://issues.apache.org/jira/browse/MAHOUT-458
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Affects Versions: 0.4
> Reporter: Himanshu Gahlot
> Fix For: 0.4
>
>
> The current implementation of LDA outputs only topics and their words. Many
> applications need the p(z|d) values of a document to use this vector as a
> reduced representation of the document (dimensionality reduction of
> document). We need to introduce a new key which would keep track of the gamma
> values for each document (as obtained from the document.infer() method) and
> writes these to the output stream and finally, PrintLDATopics should output
> these values per document id. Also, outputting the probabilities of words in
> a topic would also provide a more meaningful output.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.