Hi Vinay. There are some out of core algorithms in sklearn, but they are unfortunately not very easy to find. The SGDClassifier and MiniBatchKMeans support a "partial_fit" method, which makes out-of-core and online learning possible.
I'm not sure there are any examples for that. The naive Bayes classifiers will probably also get this interface soon. I think we really need some better docs / examples for that :-/ Cheers, Andy On 02/05/2013 02:48 PM, Vinay B, wrote: > Hi, > >From my newbie experiments last week, it appears that scikit loads all > documents into memory (classification(training & testing) and > clustering. This approach might not scale to the millions of (text) > docs that I want to process. > > 1. Is there a recommended way to deal with large datasets? Examples ? > 2. I've also been looking at gensim, which offers a memory efficient > way to ingest large datasets. Is there a way to > a. use the same approach with scikit AND / OR > b. Use the gensim models with scikit's clustering and classification > capabilities? > > > Thanks for all the hep so far. It has been very useful. > > ------------------------------------------------------------------------------ > Free Next-Gen Firewall Hardware Offer > Buy your Sophos next-gen firewall before the end March 2013 > and get the hardware for free! Learn more. > http://p.sf.net/sfu/sophos-d2d-feb > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general