I thought I'd clarify this question in a separate thread

Each individual cluster is usually associated with a set of
significant terms. For example, a mahout kmeans cluster operation of
the reuters-21578 dataset yields output like this


:VL-21566{n=2 c=[1,000:2.589, 1.9:2.974, 10:2.289, 14:1.568, 16:2.000,
19:1.526, 1986:2.796, 20:1.450
        Top Terms:
                smithkline                              =>   19.37364673614502
                kline                                   =>  14.453418731689453
                beckman                                 =>  10.067719459533691
                smith                                   =>   9.676518440246582
                pharmaceutical                          =>   9.275004863739014
                tianjin                                 =>   9.033519744873047
                skb                                     =>   8.494523048400879
                tagamet                                 =>    8.30789852142334
                laboratories                            =>   7.682400465011597
                allergan                                =>   6.986793041229248
                antiulcer                               =>   6.986793041229248
                venture                                 =>   6.036190032958984
                french                                  =>    5.92253041267395
                skin                                    =>   5.897533416748047
                joint                                   =>   5.545732021331787
                testing                                 =>   5.528974533081055
                eye                                     =>   5.433120250701904
                plant                                   =>    5.26419734954834
                capsules                                =>   4.940408706665039
                521.1                                   =>   4.940408706665039
        Weight : [props - optional]:  Point:
        1.0: [1,000:5.177, 1.9:5.949, 10:4.578, 16:4.001, 1986:5.592,
20:2.900, 24:3.633, 25:3.448, 3:1.119, 3.6:6.127, 373:9.593,
433:9.034, 50:3.677, 52.05:9.593, 521.1:9.881, 6.78:9.370,
about:2.838, achieve:6.127, acquisitions:5.550,
        1.0: [14:3.135, 19:3.051, 200:4.899, 3:1.119, 30:3.104,
56.94:8.900, 8.5:6.537, beckman:11.795, billion:2.839,
capability:7.647, capsules:9.881, chemical:5.396, china:7.466,
co:2.865, combines:8.146, company:2.449, corp:2.432, dlr
:VL-21565{n=2 c=[00:2.340, 1:1.459, 1.66:7.721, 10:1.869, 10.20:4.594,
10.6:3.387, 11:1.357, 17:1.526
        Top Terms:
                gm                                      =>  20.328017234802246
                h                                       =>  12.566249370574951
                buyback                                 =>  12.333285808563232
                repurchase                              =>  11.563349723815918
                class                                   =>  11.257688760757446

.................. etc

Does scikit-learn have similar functionality ? Thanks

For reference, I'm "hacking" at the de-facto clustering sample code at
http://scikit-learn.org/dev/_downloads/document_clustering.py printing
out data right after the kmeans algorithm is invoked

for cluster_id in range(0, km.n_clusters):
    #TODO: print out significant terms for cluster_id
<=================================THIS IS WHAT I WANT
    cluster_doc_filenames = dataset.filenames[np.where(km.labels_ ==
cluster_id)]
    for cluster_doc_filename in cluster_doc_filenames:
        print str(cluster_id) +" : " + cluster_doc_filename
    print

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to