Le 11 avril 2012 10:55, Jean-Louis Durrieu <[email protected]> a écrit : > Hi all, > > On Feb 7, 2012, at 8:47 AM, Olivier Grisel wrote: > >> 2012/2/6 Shishir Pandey <[email protected]>: >>> >>> I am working with a dataset which too big to fit in the memory. Is there a >>> way in scikits-learn to sub sample the existing dataset maintaining its >>> properties so that I can load it in my RAM? >> >> We don't have any "smart" subsampler in scikit-learn (like a GMM core >> set extractor for instance). Do you have any specific algorithm in >> mind? > > I was thinking it would be a good idea to include in gmm.py such a mechanism. > One solution would be to load files (with features stored in npz files, for > instance), and "accumulate" the sufficient statistics. As a matter of fact, > hmm.py includes code that would make this very easy to implement (instead of > a loop over the sequences in obs, one could loop over files in a directory). > > A further improvement would be to include some supervision, and train > specific components by loading only the data with the correct label (in an > HTK fashion). > > Not sure when I can find time to do anything like that, though... That also > means quite some refactoring for gmm.py, but I think that's worth it!
The fit method should not be changed. HMM is a very special case in scikit-learn and the fact that samples are variable length should not be abused to implement out-of-core tricks. Instead out-of-core support should be implement using a new `partial_fit` method that accumulates that over a small chunk of data represented as an in memory numpy array. I personally don't want to have file IO logic inside the estimator class itself since data stream (file, pipe, database or network) reading, buffering, parsing and all that is application specific. We could provide some generic utilities in scikit to loop over a bunch of sorted files on a filesystem, load them using some parser (e.g. our own svmlight parser, the scipy.io utils, pytables or HDF5) and then iterate over the chunks of data to call partial_fit until some convergence criterion is met or some Core Set that fits in memory is built. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
