Github user jegonzal commented on the pull request:

    https://github.com/apache/spark/pull/476#issuecomment-41755910
  
    I would be happy to talk more about this after the OSDI deadline.  As far 
as storing the model (or more precisely the counts and samples) as an a RDD, I 
think this really is necessary.  The model in this case should be on the order 
of the size of the data.  
    
    Essentially what you want is the ability to join the term topic counts with 
the document topic counts for each token in a given document.  Given these two 
counts tables (along with the background distribution of topics in the entire 
corpus) you can compute the new topic assignment.  
    
    Here is an implementation of the collapsed Gibbs sampler for LDA using 
GraphX:  https://github.com/amplab/graphx/pull/113
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to