Github user dongwang218 commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-46527471
  
    @mengxr, per your comments on stochastic update performance:
    1. Training data are reordered. The effect on runtime is quite small.
    2. Regularize weights after each update is indeed expensive. To avoid this, 
LazySquaredL2Updater and LazyL1Updater are added. During regularization, both 
lazy updater will accumulate weightShrinkage and weightTruncation. These two 
are applied to the sparse data when gradients are computed, which is 
implemented in computeDotProduct.
    
    I measured the runtime, quality on the both rcv1.binary's training and 
testing data. I will report training on the testing, as it is much bigger. 
--miniBatchFraction select between batch, minibatch and stochastic. All runtime 
is using local[5], miniBatch using 10% data.
    
    For L2 = 0.01 regularization
    | Method  | numIterations | stepSize | AUC | PR-AUC | Real Time |
    | ------------- | ------------- | ------------- | ------------- | 
------------- | ------------- |
    | batch  | 16  | 6.4 | 0.974        | 0.9771 | 1m 9s |
    | miniBatch  | 16  | 6.4 | 0.974 | 0.977 | 1m 2s |
    | stochastic | 1 | 0.2 | 0.974 | 0.9773 | 1m 6s|
    
    For L1 = 0.001 regularization
    | Method  | numIterations | stepSize | AUC | PR-AUC | Real Time | Nonzero 
Features|
    | ------------- | ------------- | ------------- | ------------- | 
------------- | ------------- |------------- |
    | batch  | 16  | 6.4 | 0.944 | 0.950 | 53s | 201 | 
    | miniBatch  | 16  | 6.4 | 0.944 | 0.950 | 39s | 202 |
    | stochastic | 1 | 0.2 | 0.942 | 0.949 | 46s | 383 |
    
    Per instance stochastic update has similar quality and performance compared 
with batch and minibatch. Note that the last two used much larger stepSize in 
order to converge in small num of iterations to be competitive. Stochastic 
update is also applicable for training on stream data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to