Hi, >From my newbie experiments last week, it appears that scikit loads all documents into memory (classification(training & testing) and clustering. This approach might not scale to the millions of (text) docs that I want to process.
1. Is there a recommended way to deal with large datasets? Examples ? 2. I've also been looking at gensim, which offers a memory efficient way to ingest large datasets. Is there a way to a. use the same approach with scikit AND / OR b. Use the gensim models with scikit's clustering and classification capabilities? Thanks for all the hep so far. It has been very useful. ------------------------------------------------------------------------------ Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general