Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/643#issuecomment-42691596
@dongwang218 There are two issues with stochastic updates:
1. It depends on the ordering of the training examples. Users are not
instructed to randomize the training data. In many cases, positives and
negatives are generated in different ways and the training dataset is a simple
union of them. Could you try ordering the labels before training and see how it
affects the performance?
2. We use sparse vectors to take advantage of both storage and computation.
If we apply the updater for every example, we lose the latter unless we do not
put any regularization. Could you try training `rcv1.binary` and see how it
affects the running time?
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---