I thought I'd clarify this question in a separate thread
Each individual cluster is usually associated with a set of
significant terms. For example, a mahout kmeans cluster operation of
the reuters-21578 dataset yields output like this
:VL-21566{n=2 c=[1,000:2.589, 1.9:2.974, 10:2.289, 14:1.568, 16:2.000,
19:1.526, 1986:2.796, 20:1.450
Top Terms:
smithkline => 19.37364673614502
kline => 14.453418731689453
beckman => 10.067719459533691
smith => 9.676518440246582
pharmaceutical => 9.275004863739014
tianjin => 9.033519744873047
skb => 8.494523048400879
tagamet => 8.30789852142334
laboratories => 7.682400465011597
allergan => 6.986793041229248
antiulcer => 6.986793041229248
venture => 6.036190032958984
french => 5.92253041267395
skin => 5.897533416748047
joint => 5.545732021331787
testing => 5.528974533081055
eye => 5.433120250701904
plant => 5.26419734954834
capsules => 4.940408706665039
521.1 => 4.940408706665039
Weight : [props - optional]: Point:
1.0: [1,000:5.177, 1.9:5.949, 10:4.578, 16:4.001, 1986:5.592,
20:2.900, 24:3.633, 25:3.448, 3:1.119, 3.6:6.127, 373:9.593,
433:9.034, 50:3.677, 52.05:9.593, 521.1:9.881, 6.78:9.370,
about:2.838, achieve:6.127, acquisitions:5.550,
1.0: [14:3.135, 19:3.051, 200:4.899, 3:1.119, 30:3.104,
56.94:8.900, 8.5:6.537, beckman:11.795, billion:2.839,
capability:7.647, capsules:9.881, chemical:5.396, china:7.466,
co:2.865, combines:8.146, company:2.449, corp:2.432, dlr
:VL-21565{n=2 c=[00:2.340, 1:1.459, 1.66:7.721, 10:1.869, 10.20:4.594,
10.6:3.387, 11:1.357, 17:1.526
Top Terms:
gm => 20.328017234802246
h => 12.566249370574951
buyback => 12.333285808563232
repurchase => 11.563349723815918
class => 11.257688760757446
.................. etc
Does scikit-learn have similar functionality ? Thanks
For reference, I'm "hacking" at the de-facto clustering sample code at
http://scikit-learn.org/dev/_downloads/document_clustering.py printing
out data right after the kmeans algorithm is invoked
for cluster_id in range(0, km.n_clusters):
#TODO: print out significant terms for cluster_id
<=================================THIS IS WHAT I WANT
cluster_doc_filenames = dataset.filenames[np.where(km.labels_ ==
cluster_id)]
for cluster_doc_filename in cluster_doc_filenames:
print str(cluster_id) +" : " + cluster_doc_filename
print
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general