[ 
https://issues.apache.org/jira/browse/MAHOUT-399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Mannix updated MAHOUT-399:
-------------------------------

    Attachment: MAHOUT-399.diff

This adds a full end-to-end "unit" test which verifies correctness of the 
current LDA code, in that the (self-reported) perplexity is lowest, when using 
this kind of synthetic data set, when the number of topics is equal to the 
number of generating topics.

The test is highly parametrizable: chose number of terms, number of generating 
topics, number of documents in the test corpus, number of topics per document, 
and size of each document, as well as hook to put in the functional form of 
"decay" in the generating model depending on how you want the test model to 
look.

New test currently passes on 26 terms, 5 topics, and 500 documents with one 
topic per doc.
                
> LDA on Mahout 0.3 does not converge to correct solution for overlapping 
> pyramids toy problem.
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-399
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-399
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.3, 0.4, 0.5
>         Environment: Mac OS X 10.6.2, Hadoop 0.20.2, Mahout 0.3.
>            Reporter: Michael Lazarus
>            Assignee: Grant Ingersoll
>              Labels: lda, mahout
>             Fix For: 0.6
>
>         Attachments: 1000docs_26terms_5topics.jpg, MAHOUT-399.diff, 
> Overlapping Pyramids Toy Dataset.pdf, olt.tar
>
>
> Hello,
> Apologies if I have not labeled this correctly.
> I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test 
> Blei's c version of LDA that he posts on his site. It has an exact solution 
> that the LDA should converge to.  Please see attached PDF that describes the 
> intended output.
> Is LDA working?  The following output indicates some sort of collapsing 
> behavior to me.
> T0    T1      T2      T3      T4
> x     w       x       u       x
> u     u       g       j       n
> l     r       i       m       l
> j     q       h       h       p
> v     p       e       i       q
> e     t       f       g       v
> d     s       d       f       o
> b     c       b       n       k
> y     f       c       l       m
> w     v       u       v       u
> c     d       p       y       t
> k     o       l       r       r
> i     b       j       k       j
> f     e       k       e       f
> g     x       y       s       y
> t     y       w       b       w
> h     i       s       p       s
> o     l       v       x       d
> q     j       t       d       i
> n     k       o       t       b
> The intended output is (again, please see attached):
> D     I       N       S       X
> d     i       n       s       x
> c     h       m       t       y
> e     j       o       r       w
> b     k       l       u       v
> f     g       p       q       a
> a     f       k       p       b
> g     l       q       v       u
> h     m       j       w       t
> y     u       r       o       c
> n     s       d       d       i
> s     e       x       f       f
> r     q       i       i       n
> m     v       w       c       o
> o     w       u       a       h
> q     n       s       h       g
> p     t       c       x       d
> t     x       f       e       l
> x     d       e       j       s
> w     y       g       b       j
> i     r       y       n       r
> u     o       h       y       m
> k     b       t       l       e
> v     c       a       m       k
> j     a       b       g       p
> l     p       v       k       q
> What tests do you run to make sure the output is correct?
> Thank you,
> Mike.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to