Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4807#issuecomment-89374472
Here's a proposal. Let me know what you think!
@hhbyyh
> 1. Should different algorithms have different entrance in LDA, like
runGibbs, runOnline, runEM? I kinda like it as the separation looks simple and
clear.
Multiple run methods do make that separation clearer, but they also force
beginner users (who don't know what these algorithms are) to choose an
algorithm before they can try LDA. I'd prefer to keep a single run() method
and specify the algorithm as a String parameter.
One con of a single run() method is that users will get back an LDAModel
which they will need to cast to a more specific type (if they want to use the
specialization's extra functionality). I think we could eliminate this issue
later on by opening up each algorithm as its own Estimator (so that LDA would
become a meta-Estimator, if you will).
> 2. Online LDA have several specific arguments. What's the recommended
place to put them and their getter/setter, in LDA or optimizer ?
That is an issue, for sure. I'd propose:
```
trait Optimizer // no public API
class EMOptimizer extends Optimizer {
// public API: getters/setters for EM-specific parameters
// private[mllib] API: methods for learning
}
class LDA {
def setOptimizer(optimizer: String) // takes "EM" / "Gibbs" / "online"
def setOptimizer(optimizer: Optimizer) // takes Optimizer instance which
user can configure beforehand
def getOptimizer: Optimizer
}
```
For users, Optimizer classes simply store algorithm-specific parameters.
Users can use the default Optimizer, or they can specify the optimizer via
String (with default algorithm parameters) or via Optimizer (with configured
algorithm parameters).
@EntilZha It might be easiest to revert to master (to make diffs easier),
but you can decide. That would be great if you have time to work on it in the
next couple of days, thanks. I'll be out of town (but online) Wednesday
unfortunately, but I hope it goes well!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]