Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/643#issuecomment-42691596
  
    @dongwang218 There are two issues with stochastic updates:
    
    1. It depends on the ordering of the training examples. Users are not 
instructed to randomize the training data. In many cases, positives and 
negatives are generated in different ways and the training dataset is a simple 
union of them. Could you try ordering the labels before training and see how it 
affects the performance?
    2. We use sparse vectors to take advantage of both storage and computation. 
If we apply the updater for every example, we lose the latter unless we do not 
put any regularization. Could you try training `rcv1.binary` and see how it 
affects the running time?
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to