Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/4419#issuecomment-94041947
Responding to comments in [https://github.com/apache/spark/pull/4807]:
{quote}
Another question is about existing parameters in LDA:
Except K, all other parameters (Alpha, Beta, Maxiteration, seed,
checkPointInterval) are useless or have different default values for Online
LDA. I'm not sure if we should move all those parameters to EM optimizer.
{quote}
--> I disagree. OnlineLDA could take most of these parameters, with
caveats:
* alpha, beta: These are hyperparameters of LDA. EM does not estimate
these, but it could be modified to estimate them. The Online LDA algorithm you
are following estimates these. I'd recommend:
* LDA takes these parameters as fixed values.
* Online LDA takes a special parameter ```estimateAlphaBeta: Boolean```
which indicates whether or not it should estimate these hyperparameters. In
the implementation, it should be easy to update or not update these values.
* maxIteration
* As I suggested before, I'd recommend that OnlineLDA take
```numIterations``` and ```miniBatchFraction``` instead of ```batchNumber```
(to mimic GradientDescent). ```numIterations``` will be shared by all LDA
algorithms, but ```miniBatchFraction``` will be specific to OnlineLDA.
* seed: OnlineLDA uses randomness in sampling and should use a random seed.
I agree that ```checkpointInterval``` is not applicable to Online LDA.
{quote}
Actually I find LDA and OnlineLDA share quite few things and it's kind of
difficult to merge them together. Maybe for OnlineLDA, separating it to another
File is a better choice . (Later I'll provide an interface / example for
stream).
{quote}
I agree that having the interface and the different algorithms in separate
files is probably best.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]