Hi,
i am trying to cluster a list of text docs based on similarity by first
identifying the clusters using PCA and then proceeding with a kmeans using
the results of PCA as shown below. tHE PROBLEM is that the kmeans does
output the 3 clusters but the plot function fails to display the clustering
results. The plot only shows one dot on which all cluster centers
overlapping. What i want is a simple cluster diagram visualizing the
clusters representing each doc as a dot on the plot and each dot labelled
with the doc name or number.

vectorizer = TfidfVectorizer(max_features=50, max_df=0.5,

stop_words='english',charset_error='ignore')
    proptable = vectorizer.fit_transform([eereader.raw(f) for f in
eereader.fileids()])
    X=proptable.todense()

    pca = PCA(n_components=2).fit(X)
    X_pca = pca.transform(X)
    print
pca.components_,'\nvar=',pca.explained_variance_,'\nratio=\n',pca.explained_variance_ratio_

    kmeans = KMeans(3).fit(X_pca)

    print 'clusters:',kmeans.cluster_centers_

    plot_2D(X_pca, [1,2,3], ['group1','group2','group4'])




def plot_2D(data, target, target_names):
     colors = cycle('rgbcmykw')
     target_ids = range(len(target_names))
     pylab.figure()
     for i, c, label in zip(target_ids, colors, target_names):
         pylab.scatter(data[target == i, 0], data[target == i, 1],
                    c=c, label=label)
     pylab.legend()
     pylab.show()


OUTPUT:

Extracting features from the data using a sparse vectorizer
[[ 0.14105305 -0.20034487 -0.1912926   0.02347529  0.00299301  0.0280584
  -0.01589214  0.02919959  0.13153599 -0.02447637 -0.01072041 -0.02426257
  -0.01744275 -0.03273191  0.04737182  0.01406418  0.27986182 -0.03035009
   0.17899916  0.01626151 -0.04246621 -0.01866941 -0.00590641  0.00317066
   0.0418383  -0.03831861 -0.08019927  0.00630325 -0.01176941 -0.14963421
  -0.21225598  0.69031543 -0.1459101   0.03313911  0.021619   -0.02447684
   0.00176668 -0.26605141 -0.04966191 -0.0697525   0.04789344 -0.02619828
  -0.01606536 -0.01425646  0.06923716  0.05731882 -0.03961027 -0.30013799
  -0.04170057  0.06679463]
 [-0.4707685   0.11994748  0.12739998 -0.02955514 -0.10282393 -0.04558902
  -0.00201896 -0.06452393 -0.09678896  0.00859977 -0.0320254  -0.05632518
   0.00499451 -0.03602518 -0.08868867 -0.05774644  0.21703822  0.0240018
  -0.38697589 -0.04245502 -0.00827574  0.02205609 -0.06697286 -0.06791001
   0.01894483  0.02734696 -0.05706086 -0.01495967  0.01067317 -0.01118144
   0.10826709  0.48139829  0.05905815 -0.0242033  -0.01604092 -0.02242377
  -0.02201726  0.21672965 -0.1202854  -0.04660305 -0.15009812 -0.07914119
  -0.04942812 -0.02883664 -0.05863836 -0.06470306  0.01419617  0.29820317
  -0.05949394 -0.21315757]]
var= [ 0.08107589  0.06955495]
ratio=
[ 0.10771234  0.09240633]
clusters: [[-0.21168571  0.11777839]
 [ 0.51002578  0.29041671]
 [ 0.06960558 -0.27884175]]



Best Wishes

-- 
*


Aliabbas Petiwala| Phd Scholar|Interdisciplinary Program in Education
Technology**|IIT *
**
**
*Bombay|+919664867707 | http://home.iitb.ac.in/~aliabbas/*
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to