2012/1/23 Mathieu Blondel <[email protected]>: > On Mon, Jan 23, 2012 at 7:24 PM, Olivier Grisel > <[email protected]> wrote: >> Have a look at `sklearn.linear_model.SGDClassifier` that supports a >> partial_fit method in master that you can call several times with >> slices of data. >> >> BTW: what is the structure of you data in PyTables? Is is mapped to a >> scipy.sparse Compressed Sparse Row datastructure? How many features do >> you have in your dataset? > > Olivier, it would be nice if you could create a large scale sparse > example using Numpy's memory mapped arrays. If I remember correctly, > you mentioned those in the past but I never saw them actually used in > combination with scikit-learn.
Hehe, that would be nice but I am affraid Gael won't let me do this as part of the main scikit repository: large scale examples mean largescale datasets ;) Would be great to have a multi-label RCV1 loader though: http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html Anyway, to work with memmaped CSR data, assuming you can load it in memory at least once, you can use joblib to serialize it in memmapable format: >>> from sklearn.datasets import fetch_20newsgroups_vectorized >>> twenty = fetch_20newsgroups_vectorized() >>> twenty.data <11314x107428 sparse matrix of type '<type 'numpy.float64'>' with 1727676 stored elements in Compressed Sparse Row format> # create a workspace on the filesystem >>> import os >>> os.makedirs('/tmp/joblib') # serialize the CSR structure into the FS workspace >>> from sklearn.externals import joblib >>> joblib.dump(twenty.data, '/tmp/joblib/20newsgroups_csr') ['/tmp/joblib/20newsgroups_csr', '/tmp/joblib/20newsgroups_csr_01.npy', '/tmp/joblib/20newsgroups_csr_02.npy', '/tmp/joblib/20newsgroups_csr_03.npy'] # mmap view on the serialized data wrapped into a CSR. >>> twenty_mmap = joblib.load('/tmp/joblib/20newsgroups_csr', mmap_mode='r') >>> twenty_mmap <11314x107428 sparse matrix of type '<type 'numpy.float64'>' with 1727676 stored elements in Compressed Sparse Row format> >>> twenty_mmap.data memmap([ 0.05360563, 0.05360563, 0.16081688, ..., 0.07256885, 0.02418962, 0.04837923]) >>> twenty_mmap.indices memmap([ 176, 1408, 1590, ..., 104327, 107241, 107259], dtype=int32) -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
