Github user avulanov commented on the pull request:

    https://github.com/apache/spark/pull/1290#issuecomment-101762356
  
    @wangzk Optimizer is set in trainer: `trainer.LBFGSOptimizer` or 
`trainer.SGDOptimizer`. 
    
    Below are my suggestions with regards to the parameters, most of which are 
based on my experience rather than strong theoretical assumptions.
    
    LBFGS usually converges faster (i.e. needs much less iterations) than batch 
gradient descent because the former is quasi Newton method. Also, when the time 
needed make an iteration on the whole data (i.e. epoch) is small then LBFGS 
usually converges faster (less iterations and less time) than SGD. I would 
suggest to use LBFGS for smaller data or simpler models.
    
    However, it was shown that for larger data, SGD is superior because the 
time needed for its convergence does not depend on the size of data as opposed 
to batch methods such as LBFGS 
(http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scale-learning.pdf). 
    
    With regards to SGD parameters, the number of iterations is hard to pick. 
Usually, one uses a validation set instead. Training is stopped when accuracy 
of the model reaches satisfying value or starts decreasing on this set. Another 
rule of thumb is that the number of iterations should not be smaller than the 
size of data otherwise there is a risk to skip a lot of samples while training.
    
    Step size for SGD is `0.03` as a rule of thumb value to start with, which 
can also be chosen with the use of validation set. Though there are more 
interesting strategies when the step decreases each iteration. 
    
    There might be some confusion between `SGDOptimizer.setMiniBatchFraction` 
and `trainer.batchSize`. The former is the minibatch size for SGD, i.e. the 
fraction of data samples used in one iteration. The latter is the size of batch 
for data processing when data samples are stacked into matrix to take advantage 
of faster matrix-matrix operations in BLAS. You might want to have these 
parameters to produce equally sized data batches. 
    
    However, it was shown that increasing the minibatch size leads to slower 
convergence (http://www.cs.cmu.edu/~muli/file/minibatch_sgd.pdf). At the same 
time batch processing enables to process more samples per second. So, it worth 
find a balance between them, for example with validation set. A good value for 
minibatch to start with is in between 100-1000, having in mind that 
`miniBatchFraction` is minibatch/data.size.
    
    Also, I would like to mention that "stochastic" part of SGD in MLlib is 
implemented with `RDD.sample`, which might be too expensive to perform each 
iteration especially for larger data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to