Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4419#issuecomment-87821996
@hhbyyh
**API**
I agree it will be important to allow users to use Online LDA in an online
mini-batch setting, but I think the appropriate API for that is Spark
Streaming. I'd prefer to provide 2 public APIs: batch (as you have) and online
(but using Streaming). The method you wrote which takes a mini-batch can be
kept private.
If users have a model and want to update it with new data, we can provide
an initialModel parameter they can use to warm-start training with their old
model.
**Builder Pattern and Parameter parity**
Sounds good: You're right that we should figure out the general API first.
**Testing**
For correctness, a nice dataset is the 20 newsgroups dataset. The Stanford
NLP group provides a great tutorial on using it:
[http://nlp.stanford.edu/wiki/Software/Classifier/20_Newsgroups]
* I recommend we follow the steps in the first section ("Getting set up
with the data") to pre-process it.
For scaling, you could just duplicate 20 newsgroups, or you could use a
bigger dataset.
* Evan Sparks provided a Wikipedia dump here:
[s3://files.sparks.requester.pays/enwiki_category_text/ ] ("All in all there
are 181 ~50mb files (actually closer to 10GB).")
* The ACL machine translation workshops provide some big datasets. E.g.:
[http://www.statmt.org/wmt14/translation-task.html]
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]