[ 
https://issues.apache.org/jira/browse/MAHOUT-458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896645#action_12896645
 ] 

Himanshu Gahlot commented on MAHOUT-458:
----------------------------------------

Rough patch for the improvement -> Mahout-458

This is a rough patch which outputs the gamma values per document (i.e. p(z|d)) 
in the LDA implementation. The current LDA implementation outputs only the 
topics and their words but many times LDA is also used to reduce the 
dimensionality of documents, which requires each document to be represented by 
a vector of size equal to number of topics and values representing the 
probability of each topic given the document.

Implementation:

A key called GAMMA_KEY is set in LDADriver with value = -3. In LDAMapper when 
the inferred document is obtained the gamma values of this document are 
normalized and written to the output stream using the context object and the 
GAMMA_KEY. In LDAReducer these values are written to the output sequence file. 
Finally, in LDAPrintTopics these values are outputted by accepting an 
additional command line option of '-dtpo' having the name of the file in which 
the document-topic-probabilities should be printed.

The command to run LDA remains the same:
$ ./bin/mahout lda -i examples/bin/sparse-format/tf-vectors -o 
examples/bin/lda-out -v 10000 -k 20 -ow -x 5

but there is a slight change in the LDAPrintTopics command. The new command 
will now be:

$ ./bin/mahout ldatopics -i examples/bin/lda-out/state-5 -d 
examples/bin/sparse-format/dictionary.file-0 -dt sequencefile -dtpo 
examples/bin/dtpo.txt -two examples/bin/topics

where, 'lda-out' is the directory obtained as output through LDA, 
'sparse-format' is the directory containing the sparse-format-files (as 
obtained by seq2sparse command), '-dtpo' option is used to supply the name of 
the file in which the gamma values should be printed and '-two' option is used 
to supply the name of the directory in which the topic-word probability should 
be outputted.

Notes:
Since I have added a new key hence, the number of keys changed which led to the 
failure of org.apache.mahout.clustering.lda.TestMapReduce test. Also, in this 
implementation I am assuming that the document IDs are integers and hence I 
made the following changes in this test:

1. In line 101,
original code:
EasyMock.expectLastCall().times(myNumWords * NUM_TOPICS + NUM_TOPICS + 1);

changed code:
EasyMock.expectLastCall().times(myNumWords * NUM_TOPICS + 2*NUM_TOPICS + 1);

2. In line 104,
original code:
mapper.map(new Text("tstMapper"), vw, mock);

changed code:
mapper.map(new Text("/23456.txt"), vw, mock);

The second change shows that I am assuming that the filenames of the input 
documents are of the format: <integer>.<extension>

This is my first submission to open source so kindly let me know if this patch 
is of any worth. If it is then kindly let me know the improvements and 
suggestions that should be done.

Thanks

> The LDA output does not include the topic-probability distribution per 
> document (p(z|d)). It outputs only the topics and corresponding words.
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-458
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-458
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Himanshu Gahlot
>             Fix For: 0.4
>
>
> The current implementation of LDA outputs only topics and their words. Many 
> applications need the p(z|d) values of a document to use this vector as a 
> reduced representation of the document (dimensionality reduction of 
> document). We need to introduce a new key which would keep track of the gamma 
> values for each document (as obtained from the document.infer() method) and 
> writes these to the output stream and finally, PrintLDATopics should output 
> these values per document id. Also, outputting the probabilities of words in 
> a topic would also provide a more meaningful output.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to