[ 
https://issues.apache.org/jira/browse/MAHOUT-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894640#action_12894640
 ] 

Ted Dunning commented on MAHOUT-399:
------------------------------------

I think that this needs more study.  I got email from Mike and it does seem 
that there is a reasonable likelihood that there is still a serious problem.  
The problem is that I respect both Mike and David's opinions pretty highly and 
they seem to draw incompatible conclusions.  That still leaves me with the 
feeling that a problem is reasonably likely (> 10% chance at least).

{quote}
Hi Ted,

I have implemented a parallel version of LDA in C# that separates the 
processing, but not the data.  It is based on collapsed Gibbs sampling.  And it 
converges to the correct solution on the overlapping pyramids dataset.

The last e-mail from David Hall indicated to me that he did not think the 
result for the dataset was conclusive evidence there is a bug.  I disagree.  
The statistics of the dataset are overwhelming.  And when you look at the 
computed likelihood of the corpus it typically reaches its maximum at 5 topics. 
 

It took me a while to get hadoop up and running on ec2 and then to get the 
Mahout examples running.  After David's e-mail indicating he did not think the 
result was conclusive, I decided to implement something for the environment I 
am working in.

I did not see much in the way of documentation for the Mahout implementation, 
but my guess at the algorithm was that it was using a variational method.  
Since I have not implemented that approach, I do not have an idea where the bug 
is yet.

Blei's C version implementation does converge as well.  On rare occasion it 
does not converge, but rerunning it will almost always yield convergence.

I have run David Hall's implementation for different numbers of topics and 
repeatedly for each number of topics.  It has never converged.

I did send a document along describing the dataset and providing a sample so 
that someone else could corroborate the result.  I may have made a procedural 
error in running LDA even though I think I ran everything correctly.  

I would be interested in looking at the variational approach and then trying to 
debug the current algorithm, but I do not have time to do that at the moment.  
Another option would be to convince David Hall to take a second look.

I hope that helps a little.  I would be happy to talk to anyone in more detail.

Thanks,
Mike.
{quote}

> LDA on Mahout 0.3 does not converge to correct solution for overlapping 
> pyramids toy problem.
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-399
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-399
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.3
>         Environment: Mac OS X 10.6.2, Hadoop 0.20.2, Mahout 0.3.
>            Reporter: Michael Lazarus
>            Priority: Critical
>         Attachments: olt.tar, Overlapping Pyramids Toy Dataset.pdf
>
>
> Hello,
> Apologies if I have not labeled this correctly.
> I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test 
> Blei's c version of LDA that he posts on his site. It has an exact solution 
> that the LDA should converge to.  Please see attached PDF that describes the 
> intended output.
> Is LDA working?  The following output indicates some sort of collapsing 
> behavior to me.
> T0    T1      T2      T3      T4
> x     w       x       u       x
> u     u       g       j       n
> l     r       i       m       l
> j     q       h       h       p
> v     p       e       i       q
> e     t       f       g       v
> d     s       d       f       o
> b     c       b       n       k
> y     f       c       l       m
> w     v       u       v       u
> c     d       p       y       t
> k     o       l       r       r
> i     b       j       k       j
> f     e       k       e       f
> g     x       y       s       y
> t     y       w       b       w
> h     i       s       p       s
> o     l       v       x       d
> q     j       t       d       i
> n     k       o       t       b
> The intended output is (again, please see attached):
> D     I       N       S       X
> d     i       n       s       x
> c     h       m       t       y
> e     j       o       r       w
> b     k       l       u       v
> f     g       p       q       a
> a     f       k       p       b
> g     l       q       v       u
> h     m       j       w       t
> y     u       r       o       c
> n     s       d       d       i
> s     e       x       f       f
> r     q       i       i       n
> m     v       w       c       o
> o     w       u       a       h
> q     n       s       h       g
> p     t       c       x       d
> t     x       f       e       l
> x     d       e       j       s
> w     y       g       b       j
> i     r       y       n       r
> u     o       h       y       m
> k     b       t       l       e
> v     c       a       m       k
> j     a       b       g       p
> l     p       v       k       q
> What tests do you run to make sure the output is correct?
> Thank you,
> Mike.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to