2014-07-04 11:21 GMT+02:00 László Sándor <[email protected]>: > Following up on Neal's question and Olivier's response: > > I myself have a regression problem too, though a version is a classification > problem (same regressors, outcome coarsened to binary), both suitable for > Stochastic Gradient Descent. The data is ready to be imported from CSV files > to numpy arrays, potentially line by line.
Line by line is too granular: your computational cost will be dominated by the overhead of Python boiler plate instead of the efficient inner Cython code that actually does the learning in SGDClassifier. Load large chunks at at time, e.g. 100MB or more at a time. > What I am confused about is whether the necessary scaling and (for > classification) shuffling could occur if you iterate over minibatches. I > mean, probably I could apply any scaler, but I could not follow the SGD > recommendations about fitting a StandardScaler on the training data if the > training data is read in line by line, right? Or can I train StandardScaler > online as well? (Probably not within the same loop, but a separate loop to > learn how to scale each line/minibatch for the SGD loop.) Either you do a first pass of your full data for fitting an online scaler with a partial_fit (not in scikit-learn right now but can be implement by maintaining streaming estimate of the per-features means and variances). Or you just load the first couple of GB or your data (as much as can fit in memoyr) and call StandardScaler().fit() on that subsample and ignore the rest. It's likely that the mean and std estimate of that subsample are good enough to summarize your whole dataset. > And I am not even sure how cross-validation would work online You have to implement your own cross-validation tool. The sklearn CV tools only work for batch, in memory learning. -- Olivier ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
