[
https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vasil Vasilev updated MAHOUT-684:
---------------------------------
Attachment: MAHOUT-684.patch
Implementation of topics regularization.
The result of the implementation, when tested upon the Reuters data set with 20
iterations is as follows:
1. Execution of the old LDA algorithm (currently implemented in Mahout)
Initial LL: -9562019.417431084
LL on 19-th iteration: -7308477.774407879
LL on 20-th iteration: -7304552.275676554
Rel Change: 5.3711577875640971379064282317523e-4
Execution time: 1h 3m
2. Execution of the new LDA algorithm (with alpha estimation) with fixed
initial alpha:
Initial LL: -9563341.06579829
LL on 19-th iteration: -7259696.726658782
LL on 20-th iteration: -7206975.427212661
Rel Change: 0.0072621903408884631309388965350663
Execution time: 57m
3. Execution of the new LDA algorithm (with alpha estimation) with randomized
initial alpha:
Initial LL: -9591528.727727547
LL on 19-th iteration: -6982928.298954172
LL on 20-th iteration: -6975124.693072233
Rel Change: 0.0011175262794990663549897755026221
Execution time: 1h 13m
The difference between 2 and 3 is that in 3 alphas for the topics are taken as
a random number between 0 and the topic smoothing parameter.
What makes an impression is that 2 and 3 reach lower levels of LL then 1 and in
addition due to the bigger relative change at the end (compared with 1) has
potential for reaching even lower LL values.
Selecting randomized alpha shows better results, probably because that
randomized alphas account for the fact that not all topics generate the same
number of words in the corpus. As variant 3 shows better results it is
preferred in the final implementation
Also the following was observed about the alphas:
- if alpha is selected to be small (50/numOfTopics) then during execution it
decreases
- if alpha is selected high (100.0 for example) then during execution it
increases. The algorithm converges faster but on much higher LL. The resulting
topics are not very meaningful
- for randomly selected alpha, even though in the initial set the alphas may
differ a lot, after the 20-th iteration they are very close (range between 0.04
and 1.1).
One more thing which was needed to implement the algorithm was the introduction
of LDAInference.digammaPrim function. It was derived from the implementation of
LDAInference.digamma function by differentiating it.
> Topics regularization for LDA
> -----------------------------
>
> Key: MAHOUT-684
> URL: https://issues.apache.org/jira/browse/MAHOUT-684
> Project: Mahout
> Issue Type: Improvement
> Components: Clustering
> Reporter: Vasil Vasilev
> Priority: Minor
> Labels: LDA.
> Attachments: MAHOUT-684.patch
>
>
> Implementation provided for the alpha parameters estimation as described in
> the paper of Blei, Ng and Jordan
> (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
> Remark: there is a mistake in the last formula in A.4.2 (the signs are
> wrong). The correct version is described here:
> http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira