please check out current master, there was a bug in minibatch k means in the
release.
"Vinay B," <vybe3...@gmail.com> schrieb:
>So I tried your recommendations. The partial fit seems to operate to an
>extent. Then BOOM! It looks very similar to the example in
>http://scikit-learn.org/dev/auto_examples/document_clustering.html#example-document-clustering-py
>.
>Wonder what I'm doing wrong this time?
>.....
>Relevant code
>
>vectorizer = HashingVectorizer(n_features=opts.n_features,
> stop_words='english',
> non_negative=False, norm='l2',
> binary=False)
>
>num_clusters = 5
>
>km = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++',
>n_init=1,
> init_size=1000,
> batch_size=1000, verbose=1)
>
>for doc_dict in
>iter_documents("/home/vinayb/data/reuters-21578-subset-4315",
>files_per_chunk):
> # add the docs in chunks of size 'files_per_chunk'
> X_transform_counts = vectorizer.transform(doc_dict.values())
> #X_fit_transform_counts = vectorizer.fit_transform(doc_dict.values())
>NOT NEEDED
>
> #fit this chunk of data
> km.partial_fit(X_transform_counts) #<================ Error Here
>
> print "## counts: " + str(X_transform_counts.shape) + " " #<== I
>wont know the document class in advance for a clustering operation
>
>
>Output
>
>## counts: (10, 10000)
>## counts: (10, 10000)
>## counts: (10, 10000)
>[_mini_batch_step] Reassigning 3 cluster centers.
>Traceback (most recent call last):
> File
>"/home/vinayb/workspace/LearnSciKitLearn/examples/ScalableClusteringApp.py",
>line 109, in <module>
> km.partial_fit(X_transform_counts)
> File
>"/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py",
>line
>1280, in partial_fit
> verbose=self.verbose)
> File
>"/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py",
>line
>888, in _mini_batch_step
> centers[to_reassign] = new_centers
>ValueError: setting an array element with a sequence.
>
>On Thu, Feb 7, 2013 at 4:06 AM, Olivier Grisel
><olivier.gri...@ensta.org>wrote:
>
>> 2013/2/6 Vinay B, <vybe3...@gmail.com>:
>> > Hi
>> > Almost there (I hope) , but not quite:
>> > I put my code up at https://gist.github.com/balamuru/4726232 for
>> > readability. Reading a directory of text files in chunks of 5, and
>> returning
>> > them in a dictionary (key= filename, value= text contents)
>> >
>> > I wanted to perform a clustering operation (haven't witten that
>part yet)
>> > but from my output, it looks like I'm not incrementing the
>vectorizer.
>> From
>> > your previous response, were you thinking I was trying to classify
>the
>> > output into predetermined categories?
>> >
>> > See my 2 questions in the code i.e.
>> > #Question 1: I don't know class information because this is an
>> > unsupervised learning (clustering) operation. Hence I can't perform
>a
>> > partial_fit
>>
>> MiniBatchKMeans is unsupervised (clustering) and supports incremental
>> out of core learning with partial_fit.
>>
>> > #Question2 : WRT Question 1, What should I be passing into the
>> > clustering algorithm. I would first have to incrementally
>accumulate
>> data in
>> > the vectorizer
>>
>> The HashingVectorizer does not accumulate anything: that's the all
>> point of streaming data through it! If you really are in a large
>scale
>> situation then you should not make the assumption that the whole
>> dataset (vectorized or not) can fit in memory.
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>>
>>
>------------------------------------------------------------------------------
>> Free Next-Gen Firewall Hardware Offer
>> Buy your Sophos next-gen firewall before the end March 2013
>> and get the hardware for free! Learn more.
>> http://p.sf.net/sfu/sophos-d2d-feb
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>------------------------------------------------------------------------
>
>------------------------------------------------------------------------------
>Free Next-Gen Firewall Hardware Offer
>Buy your Sophos next-gen firewall before the end March 2013
>and get the hardware for free! Learn more.
>http://p.sf.net/sfu/sophos-d2d-feb
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Scikit-learn-general mailing list
>Scikit-learn-general@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general