So I tried your recommendations. The partial fit seems to operate to an
extent. Then BOOM! It looks very similar to the example in
http://scikit-learn.org/dev/auto_examples/document_clustering.html#example-document-clustering-py
.
Wonder what I'm doing wrong this time?
.....
Relevant code

vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       non_negative=False, norm='l2',
                                       binary=False)

num_clusters = 5

km = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1,
    init_size=1000,
    batch_size=1000, verbose=1)

for doc_dict in
iter_documents("/home/vinayb/data/reuters-21578-subset-4315",
files_per_chunk):
    # add the docs in chunks of size 'files_per_chunk'
    X_transform_counts = vectorizer.transform(doc_dict.values())
    #X_fit_transform_counts = vectorizer.fit_transform(doc_dict.values())
NOT NEEDED

    #fit this chunk of data
    km.partial_fit(X_transform_counts) #<================ Error Here

    print "## counts: " + str(X_transform_counts.shape) + " "    #<== I
wont know the document class in advance for a clustering operation


Output

## counts: (10, 10000)
## counts: (10, 10000)
## counts: (10, 10000)
[_mini_batch_step] Reassigning 3 cluster centers.
Traceback (most recent call last):
  File
"/home/vinayb/workspace/LearnSciKitLearn/examples/ScalableClusteringApp.py",
line 109, in <module>
    km.partial_fit(X_transform_counts)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line
1280, in partial_fit
    verbose=self.verbose)
  File
"/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line
888, in _mini_batch_step
    centers[to_reassign] = new_centers
ValueError: setting an array element with a sequence.

On Thu, Feb 7, 2013 at 4:06 AM, Olivier Grisel <olivier.gri...@ensta.org>wrote:

> 2013/2/6 Vinay B, <vybe3...@gmail.com>:
> > Hi
> > Almost there (I hope) , but not quite:
> > I put my code up at https://gist.github.com/balamuru/4726232 for
> > readability. Reading a directory of text files in chunks of 5, and
> returning
> > them in a dictionary (key= filename, value= text contents)
> >
> > I wanted to perform a clustering operation (haven't witten that part yet)
> > but from my output, it looks like I'm not incrementing the vectorizer.
> From
> > your previous response, were you thinking I was trying to classify the
> > output into predetermined categories?
> >
> > See my 2 questions in the code i.e.
> >     #Question 1: I don't know class information because this is an
> > unsupervised learning (clustering) operation. Hence I can't perform a
> > partial_fit
>
> MiniBatchKMeans is unsupervised (clustering) and supports incremental
> out of core learning with partial_fit.
>
> >     #Question2 : WRT Question 1, What should I be passing into the
> > clustering algorithm. I would first have to incrementally accumulate
> data in
> > the vectorizer
>
> The HashingVectorizer does not accumulate anything: that's the all
> point of streaming data through it! If you really are in a large scale
> situation then you should not make the assumption that the whole
> dataset (vectorized or not) can fit in memory.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to