Hi,
>From my newbie experiments last week, it appears that scikit loads all
documents into memory (classification(training & testing) and
clustering. This approach might not scale to the millions of (text)
docs that I want to process.

1. Is there a recommended way to deal with large datasets? Examples ?
2. I've also been looking at gensim, which offers  a memory efficient
way to ingest large datasets. Is there a way to
a. use the same approach with scikit AND / OR
b. Use the gensim models with scikit's clustering and classification
capabilities?


Thanks for all the hep so far. It has been very useful.

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to