Apart from the above problem , can anyone suggest how to extract cluster
information from dendrogram in scikit, more specifically i want the
clusters to be returned as lists of file names of the documents?
Thanks
On Sun, Sep 9, 2012 at 4:50 PM, Aliabbas Petiwala <[email protected]>wrote:
> Thanks Olivier that helped to show me the output, but for the same code as
> given before i am not getting proper clusters as shown in the plot below
> there are no clearly disparate clusters , the points seems to overlap. But
> using heirarchical clustering on same dataset i did find about 7 disparate
> clusters.
>
> output for kmeans:https://www.sugarsync.com/pf/D6041365_9522679_000759
> output for dendrogram:https://www.sugarsync.com/pf/D6041365_9522679_002521
>
>
> On Sun, Sep 9, 2012 at 2:48 AM, Olivier Grisel
> <[email protected]>wrote:
>
>> 2012/9/8 Aliabbas Petiwala <[email protected]>:
>> > Hi,
>> > i am trying to cluster a list of text docs based on similarity by first
>> > identifying the clusters using PCA and then proceeding with a kmeans
>> using
>> > the results of PCA as shown below. tHE PROBLEM is that the kmeans does
>> > output the 3 clusters but the plot function fails to display the
>> clustering
>> > results. The plot only shows one dot on which all cluster centers
>> > overlapping. What i want is a simple cluster diagram visualizing the
>> > clusters representing each doc as a dot on the plot and each dot
>> labelled
>> > with the doc name or number.
>> >
>> > vectorizer = TfidfVectorizer(max_features=50, max_df=0.5,
>> >
>> > stop_words='english',charset_error='ignore')
>> > proptable = vectorizer.fit_transform([eereader.raw(f) for f in
>> > eereader.fileids()])
>> > X=proptable.todense()
>> >
>> > pca = PCA(n_components=2).fit(X)
>> > X_pca = pca.transform(X)
>> > print
>> >
>> pca.components_,'\nvar=',pca.explained_variance_,'\nratio=\n',pca.explained_variance_ratio_
>> >
>> > kmeans = KMeans(3).fit(X_pca)
>> >
>> > print 'clusters:',kmeans.cluster_centers_
>> >
>> > plot_2D(X_pca, [1,2,3], ['group1','group2','group4'])
>> >
>> > def plot_2D(data, target, target_names):
>> > colors = cycle('rgbcmykw')
>> > target_ids = range(len(target_names))
>> > pylab.figure()
>> > for i, c, label in zip(target_ids, colors, target_names):
>> > pylab.scatter(data[target == i, 0], data[target == i, 1],
>> > c=c, label=label)
>> > pylab.legend()
>> > pylab.show()
>>
>> You pass `target=[1, 2, 3]` instead of `target=kmeans.labels_` to your
>> plot function as target should have the same shape[0] as data.
>>
>> Furthermore:
>>
>> target_ids = range(len(target_names))
>>
>> is equivalent to:
>>
>> target_ids = [0, 1, 2]
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
>
> --
> *
>
>
> Aliabbas Petiwala| Phd Scholar|Interdisciplinary Program in Education
> Technology**|IIT *
> **
> **
> *Bombay|+919664867707 | http://home.iitb.ac.in/~aliabbas/*
>
>
>
--
*
Aliabbas Petiwala| Phd Scholar|Interdisciplinary Program in Education
Technology**|IIT *
**
**
*Bombay|+919664867707 | http://home.iitb.ac.in/~aliabbas/*
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general