Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4807#issuecomment-88991763
@EntilZha I made a quick pass to get a sense of the structure. My main
comment is that these changes seem to mix the concepts of algorithm and model
together. I think that's why @witgo was confused about putting describeTopics
within the learning state; that confused me a lot too. You are right that we
need some changes to the code to support other algorithms, but I'd like to keep
the separation between algorithms and models. Here's what I'd propose:
Models:
* abstract class LDAModel
* class LocalLDAModel extends LDAModel
* *This should be the de facto local LDAModel. If algorithms store extra
info with a local model, they can extend this class.*
* trait WithTrainingResults
* *This trait indicates that a model stores results for the whole
training corpus. This is important to store of course since recomputing it
will not necessarily get the same results. But it will also be useful to allow
users to thrown out this extra info if they don't want to store it.*
* class EMDistributedLDAModel extends LDAModel with WithTrainingResults
* This (and other extensions of LDAModel) can be converted to a
LocalLDAModel.
* class GibbsDistributedLDAModel extends LDAModel with WithTrainingResults
* *Notes*
* I'm not listing a DistributedLDAModel trait since I don't think it
would share anything across different algorithms' models.
For algorithms, I like keeping everything in class LDA. That can make
calls out to Optimizers (1 for each algorithm), and the Optimizers can return
an instance of LDAModel. I'd eliminate LearningState and instead put
everything in Optimizer, including initialization. I don't see a need for
multiple classes since they belong under the same concept.
I'm sorry I took so long to respond, but I'll keep up-to-date from here
out. I hadn't realized quite so much was blocking on this PR.
Also, I know I'm suggesting significant changes to the PR, but it should
actually require fewer changes to master and still allow multiple algorithms.
CC: @hhbyyh since you have investment in this via
[https://github.com/apache/spark/pull/4419]. I believe OnlineLDA could fit
under these abstractions.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]