Re: [Scikit-learn-general] : FIT() using PyTables with very hight scalable data

Olivier Grisel Mon, 23 Jan 2012 05:18:01 -0800

2012/1/23 Mathieu Blondel <[email protected]>:
> On Mon, Jan 23, 2012 at 7:24 PM, Olivier Grisel
> <[email protected]> wrote:
>> Have a look at `sklearn.linear_model.SGDClassifier` that supports a
>> partial_fit method in master that you can call several times with
>> slices of data.
>>
>> BTW: what is the structure of you data in PyTables? Is is mapped to a
>> scipy.sparse Compressed Sparse Row datastructure? How many features do
>> you have in your dataset?
>
> Olivier, it would be nice if you could create a large scale sparse
> example using Numpy's memory mapped arrays. If I remember correctly,
> you mentioned those in the past but I never saw them actually used in
> combination with scikit-learn.


Hehe, that would be nice but I am affraid Gael won't let me do this as
part of the main scikit repository: large scale examples mean
largescale datasets ;)

Would be great to have a multi-label RCV1 loader though:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html

Anyway, to work with memmaped CSR data, assuming you can load it in
memory at least once, you can use joblib to serialize it in memmapable
format:

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> twenty = fetch_20newsgroups_vectorized()
>>> twenty.data
<11314x107428 sparse matrix of type '<type 'numpy.float64'>'
        with 1727676 stored elements in Compressed Sparse Row format>

# create a workspace on the filesystem

>>> import os
>>> os.makedirs('/tmp/joblib')

# serialize the CSR structure into the FS workspace

>>> from sklearn.externals import joblib
>>> joblib.dump(twenty.data, '/tmp/joblib/20newsgroups_csr')
['/tmp/joblib/20newsgroups_csr',
'/tmp/joblib/20newsgroups_csr_01.npy',
'/tmp/joblib/20newsgroups_csr_02.npy',
'/tmp/joblib/20newsgroups_csr_03.npy']

# mmap view on the serialized data wrapped into a CSR.

>>> twenty_mmap = joblib.load('/tmp/joblib/20newsgroups_csr', mmap_mode='r')
>>> twenty_mmap
<11314x107428 sparse matrix of type '<type 'numpy.float64'>'
        with 1727676 stored elements in Compressed Sparse Row format>
>>> twenty_mmap.data
memmap([ 0.05360563,  0.05360563,  0.16081688, ...,  0.07256885,
        0.02418962,  0.04837923])
>>> twenty_mmap.indices
memmap([   176,   1408,   1590, ..., 104327, 107241, 107259], dtype=int32)


-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] : FIT() using PyTables with very hight scalable data

Reply via email to