[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

jkbradley Thu, 02 Apr 2015 10:57:43 -0700

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4807#issuecomment-88991763
  
    @EntilZha I made a quick pass to get a sense of the structure.  My main 
comment is that these changes seem to mix the concepts of algorithm and model 
together.  I think that's why @witgo was confused about putting describeTopics 
within the learning state; that confused me a lot too.  You are right that we 
need some changes to the code to support other algorithms, but I'd like to keep 
the separation between algorithms and models.  Here's what I'd propose:
    
    Models:
    * abstract class LDAModel
    * class LocalLDAModel extends LDAModel
      * *This should be the de facto local LDAModel.  If algorithms store extra 
info with a local model, they can extend this class.*
    * trait WithTrainingResults
      * *This trait indicates that a model stores results for the whole 
training corpus.  This is important to store of course since recomputing it 
will not necessarily get the same results.  But it will also be useful to allow 
users to thrown out this extra info if they don't want to store it.*
    * class EMDistributedLDAModel extends LDAModel with WithTrainingResults
      * This (and other extensions of LDAModel) can be converted to a 
LocalLDAModel.
    * class GibbsDistributedLDAModel extends LDAModel with WithTrainingResults
    * *Notes*
      * I'm not listing a DistributedLDAModel trait since I don't think it 
would share anything across different algorithms' models.
    
    For algorithms, I like keeping everything in class LDA.  That can make 
calls out to Optimizers (1 for each algorithm), and the Optimizers can return 
an instance of LDAModel.  I'd eliminate LearningState and instead put 
everything in Optimizer, including initialization.  I don't see a need for 
multiple classes since they belong under the same concept.
    
    I'm sorry I took so long to respond, but I'll keep up-to-date from here 
out.  I hadn't realized quite so much was blocking on this PR.
    
    Also, I know I'm suggesting significant changes to the PR, but it should 
actually require fewer changes to master and still allow multiple algorithms.
    
    CC: @hhbyyh since you have investment in this via 
[https://github.com/apache/spark/pull/4419].  I believe OnlineLDA could fit 
under these abstractions.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor L...

Reply via email to