Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/4419#issuecomment-87821996
  
    @hhbyyh 
    
    **API**
    
    I agree it will be important to allow users to use Online LDA in an online 
mini-batch setting, but I think the appropriate API for that is Spark 
Streaming.  I'd prefer to provide 2 public APIs: batch (as you have) and online 
(but using Streaming).  The method you wrote which takes a mini-batch can be 
kept private.
    
    If users have a model and want to update it with new data, we can provide 
an initialModel parameter they can use to warm-start training with their old 
model.
    
    **Builder Pattern and Parameter parity**
    
    Sounds good: You're right that we should figure out the general API first.
    
    **Testing**
    
    For correctness, a nice dataset is the 20 newsgroups dataset.  The Stanford 
NLP group provides a great tutorial on using it:
    [http://nlp.stanford.edu/wiki/Software/Classifier/20_Newsgroups]
    * I recommend we follow the steps in the first section ("Getting set up 
with the data") to pre-process it.
    
    For scaling, you could just duplicate 20 newsgroups, or you could use a 
bigger dataset.
    * Evan Sparks provided a Wikipedia dump here: 
[s3://files.sparks.requester.pays/enwiki_category_text/ ] ("All in all there 
are 181 ~50mb files (actually closer to 10GB).")
    * The ACL machine translation workshops provide some big datasets. E.g.: 
[http://www.statmt.org/wmt14/translation-task.html]



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to