Following up on Neal's question and Olivier's response:
I myself have a regression problem too, though a version is a
classification problem (same regressors, outcome coarsened to binary), both
suitable for Stochastic Gradient Descent. The data is ready to be imported
from CSV files to numpy arrays, potentially line by line.
What I am confused about is whether the necessary scaling and (for
classification) shuffling could occur if you iterate over minibatches. I
mean, probably I could apply any scaler, but I could not follow the SGD
recommendations about fitting a StandardScaler on the training data if the
training data is read in line by line, right? Or can I train StandardScaler
online as well? (Probably not within the same loop, but a separate loop to
learn how to scale each line/minibatch for the SGD loop.)
And I am not even sure how cross-validation would work online, while I
think it is eminently recommended for SGD as well (for regularization term
alpha, or also for l1_ratio if I hope to use elasticnet over my
multicollinear regressors).
Of course, even I get some help on this, maybe writing the minibatch
iterators and the loop would be too much for me at this stage.
Thanks a lot!
Laszlo
------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general