[
https://issues.apache.org/jira/browse/MAHOUT-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179670#comment-13179670
]
Jake Mannix commented on MAHOUT-399:
------------------------------------
Haven't really looked at it. I'd say that the original Mahout LDA (David
Hall's version) has corner cases where it doesn't converge properly, even on a
clearly defined topic-derived small corpus. This test passes correctly for the
new LDA impl (CVB0). We can close this one as "fixed in one impl, won't fix in
another" and open another JIRA ticket for 0.7 which is "remove old LDA" once we
verify that users have tried the new one on a variety of data sets and like it
better. Right now we're going on the fact that I (and my coworkers) have used
this well in-house. Not a lot of verification to go on, but I'd even feel
comfortable removing the old LDA in 0.7 even if we don't get a lot of test
feedback from other people, but I'm open to discussion on that.
> LDA on Mahout 0.3 does not converge to correct solution for overlapping
> pyramids toy problem.
> ---------------------------------------------------------------------------------------------
>
> Key: MAHOUT-399
> URL: https://issues.apache.org/jira/browse/MAHOUT-399
> Project: Mahout
> Issue Type: Bug
> Components: Classification
> Affects Versions: 0.3, 0.4, 0.5
> Environment: Mac OS X 10.6.2, Hadoop 0.20.2, Mahout 0.3.
> Reporter: Michael Lazarus
> Assignee: Jake Mannix
> Labels: lda, mahout
> Fix For: 0.6
>
> Attachments: 1000docs_26terms_5topics.jpg, MAHOUT-399.diff,
> Overlapping Pyramids Toy Dataset.pdf, olt.tar
>
>
> Hello,
> Apologies if I have not labeled this correctly.
> I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test
> Blei's c version of LDA that he posts on his site. It has an exact solution
> that the LDA should converge to. Please see attached PDF that describes the
> intended output.
> Is LDA working? The following output indicates some sort of collapsing
> behavior to me.
> T0 T1 T2 T3 T4
> x w x u x
> u u g j n
> l r i m l
> j q h h p
> v p e i q
> e t f g v
> d s d f o
> b c b n k
> y f c l m
> w v u v u
> c d p y t
> k o l r r
> i b j k j
> f e k e f
> g x y s y
> t y w b w
> h i s p s
> o l v x d
> q j t d i
> n k o t b
> The intended output is (again, please see attached):
> D I N S X
> d i n s x
> c h m t y
> e j o r w
> b k l u v
> f g p q a
> a f k p b
> g l q v u
> h m j w t
> y u r o c
> n s d d i
> s e x f f
> r q i i n
> m v w c o
> o w u a h
> q n s h g
> p t c x d
> t x f e l
> x d e j s
> w y g b j
> i r y n r
> u o h y m
> k b t l e
> v c a m k
> j a b g p
> l p v k q
> What tests do you run to make sure the output is correct?
> Thank you,
> Mike.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira