Re: [Scikit-learn-general] Sub sampling large datasets

Olivier Grisel Wed, 11 Apr 2012 02:44:20 -0700

Le 11 avril 2012 10:55, Jean-Louis Durrieu <[email protected]> a écrit :
> Hi all,
>
> On Feb 7, 2012, at 8:47 AM, Olivier Grisel wrote:
>
>> 2012/2/6 Shishir Pandey <[email protected]>:
>>>
>>> I am working with a dataset which too big to fit in the memory. Is there a
>>> way in scikits-learn to sub sample the existing dataset maintaining its
>>> properties so that I can load it in my RAM?
>>
>> We don't have any "smart" subsampler in scikit-learn (like a GMM core
>> set extractor for instance). Do you have any specific algorithm in
>> mind?
>
> I was thinking it would be a good idea to include in gmm.py such a mechanism. 
> One solution would be to load files (with features stored in npz files, for 
> instance), and "accumulate" the sufficient statistics. As a matter of fact, 
> hmm.py includes code that would make this very easy to implement (instead of 
> a loop over the sequences in obs, one could loop over files in a directory).
>
> A further improvement would be to include some supervision, and train 
> specific components by loading only the data with the correct label (in an 
> HTK fashion).
>
> Not sure when I can find time to do anything like that, though... That also 
> means quite some refactoring for gmm.py, but I think that's worth it!


The fit method should not be changed. HMM is a very special case in
scikit-learn and the fact that samples are variable length should not
be abused to implement out-of-core tricks. Instead out-of-core support
should be implement using a new `partial_fit` method that accumulates
that over a small chunk of data represented as an in memory numpy
array.

I personally don't want to have file IO logic inside the estimator
class itself since data stream (file, pipe, database or network)
reading, buffering, parsing and all that is application specific. We
could provide some generic utilities in scikit to loop over a bunch of
sorted files on a filesystem, load them using some parser (e.g. our
own svmlight parser, the scipy.io utils, pytables or HDF5) and then
iterate over the chunks of data to call partial_fit until some
convergence criterion is met or some Core Set that fits in memory is
built.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Better than sec? Nothing is better than sec when it comes to
monitoring Big Data applications. Try Boundary one-second 
resolution app monitoring today. Free.
http://p.sf.net/sfu/Boundary-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Sub sampling large datasets

Reply via email to