Github user avulanov commented on the pull request:
https://github.com/apache/spark/pull/1290#issuecomment-101762356
@wangzk Optimizer is set in trainer: `trainer.LBFGSOptimizer` or
`trainer.SGDOptimizer`.
Below are my suggestions with regards to the parameters, most of which are
based on my experience rather than strong theoretical assumptions.
LBFGS usually converges faster (i.e. needs much less iterations) than batch
gradient descent because the former is quasi Newton method. Also, when the time
needed make an iteration on the whole data (i.e. epoch) is small then LBFGS
usually converges faster (less iterations and less time) than SGD. I would
suggest to use LBFGS for smaller data or simpler models.
However, it was shown that for larger data, SGD is superior because the
time needed for its convergence does not depend on the size of data as opposed
to batch methods such as LBFGS
(http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf).
With regards to SGD parameters, the number of iterations is hard to pick.
Usually, one uses a validation set instead. Training is stopped when accuracy
of the model reaches satisfying value or starts decreasing on this set. Another
rule of thumb is that the number of iterations should not be smaller than the
size of data otherwise there is a risk to skip a lot of samples while training.
Step size for SGD is `0.03` as a rule of thumb value to start with, which
can also be chosen with the use of validation set. Though there are more
interesting strategies when the step decreases each iteration.
There might be some confusion between `SGDOptimizer.setMiniBatchFraction`
and `trainer.batchSize`. The former is the minibatch size for SGD, i.e. the
fraction of data samples used in one iteration. The latter is the size of batch
for data processing when data samples are stacked into matrix to take advantage
of faster matrix-matrix operations in BLAS. You might want to have these
parameters to produce equally sized data batches.
However, it was shown that increasing the minibatch size leads to slower
convergence (http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf). At the same
time batch processing enables to process more samples per second. So, it worth
find a balance between them, for example with validation set. A good value for
minibatch to start with is in between 100-1000, having in mind that
`miniBatchFraction` is minibatch/data.size.
Also, I would like to mention that "stochastic" part of SGD in MLlib is
implemented with `RDD.sample`, which might be too expensive to perform each
iteration especially for larger data.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]