[ 
https://issues.apache.org/jira/browse/MADLIB-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15732994#comment-15732994
 ] 

Xiaocheng Tang commented on MADLIB-1049:
----------------------------------------

- stepsize; it is also called learning rate. If not chosen properly, it might 
cause the learning to diverge. It is often hard to tell which value is the 
best, but at least some warning need to be provided when divergence is 
detected. For example, one could monitor the loss value and make sure that it 
is not increasing to an abnormally large value
- squared hinge loss; it behaves quite differently than the standard hinge loss 
due to the squared smoothing effect. Hence a stepsize that is good for svm or 
logreg might not be good here.
- nEpoch; a proper value depends on the buffer size, e.g., how many training 
data are in one learning buffer. The larger a buffer is, the larger number 
nEpoch can take before overfitting. A safer choice would be <10 from my 
experience. More experiments will be helpful before concrete suggestions can be 
provided
- intercept; due to regularization the intercept term needs to be handled 
explicitly such that it is not regularized as the weights do.
- trans(x); the buffer is transposed and copied before feed into the training 
algorithm. The layout of the model (along with implementations of 
lossAndGradient) need to be changed accordingly if transpose and copy are to be 
avoided, i.e., to use `MappedMatrix`.
- labels; assumed to be consecutive nonnegative integer starting from 0. The 
assumption should be verified before calling the UDA training function.
- batch_size; a larger batch means more accurate gradient update but also takes 
more time to compute. When you put m examples in a minibatch, you need to do 
O(m) computation and use O(m) memory, but you reduce the amount of uncertainty 
in the gradient by a factor of only O(sqrt(m)). In other words, there are 
diminishing marginal returns to putting more examples in the minibatch. The 
theoretical reason behind the benefit of using mini-batch is still an active 
research topic and has something to do with large-batch methods often 
converging to sharp minima that lead to [poor 
generalization](https://arxiv.org/abs/1609.04836)



> Create generic multi-class classifier
> -------------------------------------
>
>                 Key: MADLIB-1049
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1049
>             Project: Apache MADlib
>          Issue Type: New Feature
>          Components: Module:  Multiclass Classifier
>            Reporter: Frank McQuillan
>             Fix For: v1.10
>
>
> C++ part
> Single model that supports loss function as a parameter.  
> Loss functions to support: squared hinge loss (SVM) and cross entropy 
> (multinomial logistic regression).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to