>From the scikit script at
http://scikit-learn.org/dev/_downloads/document_clustering.py , it
appears as the number of clusters are set to the number of newsgroups
subfolders. I'm guessing that's done more out of convenience . On the
other hand, the users should be able to set an arbitrary number of
clusters for better or worse, depending on the desired cluster
granularity.

But if I increase the number of clusters to a moderately large size as
shown below, I see some errors. Code changes and output below

Thanks
.
.
.
true_k = 200; #<============EXPLICITLY SET NUMBER OF DESIRED CLUSTERS
# Do the actual clustering
if opts.minibatch:
    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000,
                         batch_size=1000, verbose=1)
else:
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=1)
.
.
.

Output:


None
Usage: ClusteringApp.py [options]

Options:
  -h, --help            show this help message and exit
  --no-minibatch        Use ordinary k-means algorithm (in batch mode).
  --no-idf              Disable Inverse Document Frequency feature weighting.
  --use-hashing         Use a hashing feature vectorizer
  --n-features=N_FEATURES
                        Maximum number of features (dimensions)to extract from
                        text.
Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
3387 documents
4 categories

Extracting features from the training dataset using a sparse vectorizer
done in 3.176239s
n_samples: 3387, n_features: 10000

Clustering sparse data with MiniBatchKMeans(batch_size=1000,
compute_labels=True, init=k-means++,
        init_size=1000, k=None, max_iter=100, max_no_improvement=10,
        n_clusters=200, n_init=1, random_state=None,
        reassignment_ratio=0.01, tol=0.0, verbose=1)
Init 1/1 with method: k-means++
Inertia for init 1/1: 654.117274
Minibatch iteration 1/400:mean batch inertia: 0.863513, ewa inertia: 0.863513
Minibatch iteration 2/400:mean batch inertia: 0.813080, ewa inertia: 0.833741
Minibatch iteration 3/400:mean batch inertia: 0.815186, ewa inertia: 0.822788
Minibatch iteration 4/400:mean batch inertia: 0.801274, ewa inertia: 0.810088
Minibatch iteration 5/400:mean batch inertia: 0.800503, ewa inertia: 0.804430
Minibatch iteration 6/400:mean batch inertia: 0.802421, ewa inertia: 0.803244
Minibatch iteration 7/400:mean batch inertia: 0.789954, ewa inertia: 0.795398
Minibatch iteration 8/400:mean batch inertia: 0.793326, ewa inertia: 0.794175
Minibatch iteration 9/400:mean batch inertia: 0.792347, ewa inertia: 0.793096
[_mini_batch_step] Reassigning 124 cluster centers.
Traceback (most recent call last):
  File "/home/vinayb/python/HelloPython/examples/ClusteringApp.py",
line 114, in <module>
    km.fit(X)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py",
line 1221, in fit
    verbose=self.verbose)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py",
line 888, in _mini_batch_step
    centers[to_reassign] = new_centers
ValueError: setting an array element with a sequence.

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to