Hi Vinay.
There are some out of core algorithms in sklearn, but they are unfortunately
not very easy to find.
The SGDClassifier and MiniBatchKMeans support a "partial_fit"
method, which makes out-of-core and online learning possible.

I'm not sure there are any examples for that.
The naive Bayes classifiers will probably also get this interface soon.
I think we really need some better docs / examples for that :-/

Cheers,
Andy

On 02/05/2013 02:48 PM, Vinay B, wrote:
> Hi,
> >From my newbie experiments last week, it appears that scikit loads all
> documents into memory (classification(training & testing) and
> clustering. This approach might not scale to the millions of (text)
> docs that I want to process.
>
> 1. Is there a recommended way to deal with large datasets? Examples ?
> 2. I've also been looking at gensim, which offers  a memory efficient
> way to ingest large datasets. Is there a way to
> a. use the same approach with scikit AND / OR
> b. Use the gensim models with scikit's clustering and classification
> capabilities?
>
>
> Thanks for all the hep so far. It has been very useful.
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to