[ 
https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasil Vasilev updated MAHOUT-684:
---------------------------------

    Attachment: MAHOUT-684.patch

Implementation of topics regularization.
The result of the implementation, when tested upon the Reuters data set with 20 
iterations is as follows:
1. Execution of the old LDA algorithm (currently implemented in Mahout)
 
Initial LL: -9562019.417431084
LL on 19-th iteration: -7308477.774407879
LL on 20-th iteration: -7304552.275676554
Rel Change: 5.3711577875640971379064282317523e-4
Execution time: 1h 3m

2. Execution of the new LDA algorithm (with alpha estimation) with fixed 
initial alpha:

Initial LL: -9563341.06579829
LL on 19-th iteration: -7259696.726658782
LL on 20-th iteration: -7206975.427212661
Rel Change: 0.0072621903408884631309388965350663
Execution time: 57m

3. Execution of the new LDA algorithm (with alpha estimation) with randomized 
initial alpha:

Initial LL: -9591528.727727547
LL on 19-th iteration: -6982928.298954172
LL on 20-th iteration: -6975124.693072233
Rel Change: 0.0011175262794990663549897755026221
Execution time: 1h 13m

The difference between 2 and 3 is that in 3 alphas for the topics are taken as 
a random number between 0 and the topic smoothing parameter.
What makes an impression is that 2 and 3 reach lower levels of LL then 1 and in 
addition due to the bigger relative change at the end (compared with 1) has 
potential for reaching even lower LL values.

Selecting randomized alpha shows better results, probably because that 
randomized alphas account for the fact that not all topics generate the same 
number of words in the corpus. As variant 3 shows better results it is 
preferred in the final implementation

Also the following was observed about the alphas:
- if alpha is selected to be small (50/numOfTopics) then during execution it 
decreases
- if alpha is selected high (100.0 for example) then during execution it 
increases. The algorithm converges faster but on much higher LL. The resulting 
topics are not very meaningful
- for randomly selected alpha, even though in the initial set the alphas may 
differ a lot, after the 20-th iteration they are very close (range between 0.04 
and 1.1).

One more thing which was needed to implement the algorithm was the introduction 
of LDAInference.digammaPrim function. It was derived from the implementation of 
LDAInference.digamma function by differentiating it.

> Topics regularization for LDA
> -----------------------------
>
>                 Key: MAHOUT-684
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-684
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: LDA.
>         Attachments: MAHOUT-684.patch
>
>
> Implementation provided for the alpha parameters estimation as described in 
> the paper of Blei, Ng and Jordan 
> (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
> Remark: there is a mistake in the last formula in A.4.2 (the signs are 
> wrong). The correct version is described here: 
> http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to