2013/2/6 Vinay B, <vybe3...@gmail.com>: > Hi > Almost there (I hope) , but not quite: > I put my code up at https://gist.github.com/balamuru/4726232 for > readability. Reading a directory of text files in chunks of 5, and returning > them in a dictionary (key= filename, value= text contents) > > I wanted to perform a clustering operation (haven't witten that part yet) > but from my output, it looks like I'm not incrementing the vectorizer. From > your previous response, were you thinking I was trying to classify the > output into predetermined categories? > > See my 2 questions in the code i.e. > #Question 1: I don't know class information because this is an > unsupervised learning (clustering) operation. Hence I can't perform a > partial_fit
MiniBatchKMeans is unsupervised (clustering) and supports incremental out of core learning with partial_fit. > #Question2 : WRT Question 1, What should I be passing into the > clustering algorithm. I would first have to incrementally accumulate data in > the vectorizer The HashingVectorizer does not accumulate anything: that's the all point of streaming data through it! If you really are in a large scale situation then you should not make the assumption that the whole dataset (vectorized or not) can fit in memory. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general