2014-07-04 11:21 GMT+02:00 László Sándor <[email protected]>:
> Following up on Neal's question and Olivier's response:
>
> I myself have a regression problem too, though a version is a classification
> problem (same regressors, outcome coarsened to binary), both suitable for
> Stochastic Gradient Descent. The data is ready to be imported from CSV files
> to numpy arrays, potentially line by line.

Line by line is too granular: your computational cost will be dominated
by the overhead of Python boiler plate instead of the efficient inner Cython
code that actually does the learning in SGDClassifier. Load large chunks
at at time, e.g. 100MB or more at a time.

> What I am confused about is whether the necessary scaling and (for
> classification) shuffling could occur if you iterate over minibatches. I
> mean, probably I could apply any scaler, but I could not follow the SGD
> recommendations about fitting a StandardScaler on the training data if the
> training data is read in line by line, right? Or can I train StandardScaler
> online as well? (Probably not within the same loop, but a separate loop to
> learn how to scale each line/minibatch for the SGD loop.)

Either you do a first pass of your full data for fitting an online
scaler with a partial_fit (not in scikit-learn right now but can be
implement by maintaining streaming estimate of the per-features means
and variances).

Or you just load the first couple of GB or your data (as much as can
fit in memoyr) and call StandardScaler().fit() on that subsample and
ignore the rest. It's likely that the mean and std estimate of that
subsample are good enough to summarize your whole dataset.

> And I am not even sure how cross-validation would work online

You have to implement your own cross-validation tool. The sklearn CV
tools only work for batch, in memory learning.

-- 
Olivier

------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to