I updated scikit to the latest version.
The bug I reported earlier no longer exists. Now the minibatch k means
completes. Now, I have an error printing out the docs per cluster.
Complete code at https://gist.github.com/balamuru/4734765
Thanks in advance
Output
.
.
## counts: (10, 10000)
## counts: (5, 10000)
HashingVectorizer(analyzer=word, binary=False, charset=utf-8,
charset_error=strict, dtype=<type 'numpy.float64'>, input=content,
lowercase=True, n_features=10000, ngram_range=(1, 1),
non_negative=False, norm=l2, preprocessor=None,
stop_words=english, strip_accents=None,
token_pattern=(?u)\b\w\w+\b, tokenizer=None)
Indices (array([], dtype=int64),)
Traceback (most recent call last):
File
"/home/vinayb/workspace/LearnSciKitLearn/examples/ScalableClusteringApp.py",
line 129, in <module>
cluster_doc_filenames = file_names[np.where(km.labels_ == cluster_id)]
TypeError: list indices must be integers, not tuple
Code Segment
for cluster_id in range(0, km.n_clusters):
indices = np.where(km.labels_ == cluster_id)
if len(indices) > 0:
print "Indices " + str(indices)
cluster_doc_filenames = file_names[np.where(km.labels_ ==
cluster_id)] #<===========FAILS HERE
for cluster_doc_filename in cluster_doc_filenames:
print str(cluster_id) +" : " + cluster_doc_filename
else:
print "empty indices"
On Thu, Feb 7, 2013 at 3:09 PM, <amuel...@ais.uni-bonn.de> wrote:
> please check out current master, there was a bug in minibatch k means in
> the release.
>
>
>
> "Vinay B," <vybe3...@gmail.com> schrieb:
>>
>> So I tried your recommendations. The partial fit seems to operate to an
>> extent. Then BOOM! It looks very similar to the example in
>> http://scikit-learn.org/dev/auto_examples/document_clustering.html#example-document-clustering-py
>> .
>> Wonder what I'm doing wrong this time?
>> .....
>> Relevant code
>>
>> vectorizer = HashingVectorizer(n_features=opts.n_features,
>> stop_words='english',
>> non_negative=False, norm='l2',
>> binary=False)
>>
>> num_clusters = 5
>>
>> km = MiniBatchKMeans(n_clusters=num_clusters, init='k-means++', n_init=1,
>> init_size=1000,
>> batch_size=1000, verbose=1)
>>
>> for doc_dict in
>> iter_documents("/home/vinayb/data/reuters-21578-subset-4315",
>> files_per_chunk):
>> # add the docs in chunks of size 'files_per_chunk'
>> X_transform_counts = vectorizer.transform(doc_dict.values())
>> #X_fit_transform_counts = vectorizer.fit_transform(doc_dict.values())
>> NOT NEEDED
>>
>> #fit this chunk of data
>> km.partial_fit(X_transform_counts) #<================ Error Here
>>
>> print "## counts: " + str(X_transform_counts.shape) + " " #<== I
>> wont know the document class in advance for a clustering operation
>>
>>
>> Output
>>
>> ## counts: (10, 10000)
>> ## counts: (10, 10000)
>> ## counts: (10, 10000)
>> [_mini_batch_step] Reassigning 3 cluster centers.
>> Traceback (most recent call last):
>> File
>> "/home/vinayb/workspace/LearnSciKitLearn/examples/ScalableClusteringApp.py",
>> line 109, in <module>
>> km.partial_fit(X_transform_counts)
>> File
>> "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line
>> 1280, in partial_fit
>> verbose=self.verbose)
>> File
>> "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line
>> 888, in _mini_batch_step
>> centers[to_reassign] = new_centers
>> ValueError: setting an array element with a sequence.
>>
>> On Thu, Feb 7, 2013 at 4:06 AM, Olivier Grisel
>> <olivier.gri...@ensta.org>wrote:
>>
>>> 2013/2/6 Vinay B, <vybe3...@gmail.com>:
>>> > Hi
>>> > Almost there (I hope) , but not quite:
>>> > I put my code up at https://gist.github.com/balamuru/4726232 for
>>> > readability. Reading a directory of text files in chunks of 5, and
>>> returning
>>> > them in a dictionary (key= filename, value= text contents)
>>> >
>>> > I wanted to perform a clustering operation (haven't witten that part
>>> yet)
>>> > but from my output, it looks like I'm not incrementing the vectorizer.
>>> From
>>> > your previous response, were you thinking I was trying to classify the
>>> > output into predetermined categories?
>>> >
>>> > See my 2 questions in the code i.e.
>>> > #Question 1: I don't know class information because this is an
>>> > unsupervised learning (clustering) operation. Hence I can't perform a
>>> > partial_fit
>>>
>>> MiniBatchKMeans is unsupervised (clustering) and supports incremental
>>> out of core learning with partial_fit.
>>>
>>> > #Question2 : WRT Question 1, What should I be passing into the
>>> > clustering algorithm. I would first have to incrementally accumulate
>>> data in
>>> > the vectorizer
>>>
>>> The HashingVectorizer does not accumulate anything: that's the all
>>> point of streaming data through it! If you really are in a large scale
>>> situation then you should not make the assumption that the whole
>>> dataset (vectorized or not) can fit in memory.
>>>
>>> --
>>> Olivier
>>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Free Next-Gen Firewall Hardware Offer
>>> Buy your Sophos next-gen firewall before the end March 2013
>>> and get the hardware for free! Learn more.
>>> http://p.sf.net/sfu/sophos-d2d-feb
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>> ------------------------------
>>
>> Free Next-Gen Firewall Hardware Offer
>> Buy your Sophos next-gen firewall before the end March 2013
>> and get the hardware for free! Learn more.
>> http://p.sf.net/sfu/sophos-d2d-feb
>>
>> ------------------------------
>>
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> gesendet.
>
>
> ------------------------------------------------------------------------------
> Free Next-Gen Firewall Hardware Offer
> Buy your Sophos next-gen firewall before the end March 2013
> and get the hardware for free! Learn more.
> http://p.sf.net/sfu/sophos-d2d-feb
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general