2013/2/6 Vinay B, <vybe3...@gmail.com>:
> Hi
> Almost there (I hope) , but not quite:
> I put my code up at https://gist.github.com/balamuru/4726232 for
> readability. Reading a directory of text files in chunks of 5, and returning
> them in a dictionary (key= filename, value= text contents)
>
> I wanted to perform a clustering operation (haven't witten that part yet)
> but from my output, it looks like I'm not incrementing the vectorizer. From
> your previous response, were you thinking I was trying to classify the
> output into predetermined categories?
>
> See my 2 questions in the code i.e.
>     #Question 1: I don't know class information because this is an
> unsupervised learning (clustering) operation. Hence I can't perform a
> partial_fit

MiniBatchKMeans is unsupervised (clustering) and supports incremental
out of core learning with partial_fit.

>     #Question2 : WRT Question 1, What should I be passing into the
> clustering algorithm. I would first have to incrementally accumulate data in
> the vectorizer

The HashingVectorizer does not accumulate anything: that's the all
point of streaming data through it! If you really are in a large scale
situation then you should not make the assumption that the whole
dataset (vectorized or not) can fit in memory.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to