Re: [Scikit-learn-general] Question about naming a clustering algorithm

Robert Layton Mon, 09 Sep 2013 03:24:55 -0700

I haven't yet compared against scipy's implementation. The main reason for
this is that they are different types of clusterers (with the MSTCluster
here generating flat clusters). That said, they are easily convertible.


Perhaps we should just drop the separate class altogether, and add an
ability to do a threshold cut in the hcluster PR?


On 9 September 2013 18:31, Andreas Mueller <[email protected]> wrote:

> On 09/08/2013 06:51 PM, Olivier Grisel wrote:
> > I just had a look at the results section and it looks very
> > interesting, in particular in its ability to bring noise robustness to
> > single linkage. Have you tried to compare it with ward?
> FYI the output of "examples.py" for the smaller datasets. You can run it
> for the rest if you want.
>
> Dataset Iris Plants Database samples: 147, features: 4, clusters: 3
> ======================================================================
> ITM             ARI: 0.882, AMI: 0.866, NMI: 0.868 objective: 1.237
> time:0.21
> ITM ID          ARI: 0.882, AMI: 0.866, NMI: 0.868 objective: 1.237
> time:0.08
> Ward            ARI: 0.737, AMI: 0.762, NMI: 0.774 objective: 1.195
> time:0.01
> KMeans          ARI: 0.737, AMI: 0.753, NMI: 0.762 objective: 1.197
> time:0.05
> GT objective: 1.178
>
>
> Dataset mldata.org dataset: vehicle samples: 846, features: 18, clusters:
> 4
> ======================================================================
> ITM             ARI: 0.141, AMI: 0.166, NMI: 0.170 objective: 8.426
> time:0.46
> ITM ID          ARI: 0.113, AMI: 0.145, NMI: 0.148 objective: 8.425
> time:0.54
> Ward            ARI: 0.098, AMI: 0.122, NMI: 0.128 objective: 8.308
> time:0.75
> KMeans          ARI: 0.076, AMI: 0.096, NMI: 0.100 objective: 8.097
> time:0.35
> GT objective: 6.924
>
>
> Dataset mldata.org dataset: vowel samples: 990, features: 10, clusters: 11
> ======================================================================
> ITM             ARI: 0.195, AMI: 0.385, NMI: 0.403 objective: 8.512
> time:0.72
> ITM ID          ARI: 0.209, AMI: 0.385, NMI: 0.401 objective: 8.510
> time:0.80
> Ward            ARI: 0.155, AMI: 0.346, NMI: 0.367 objective: 8.309
> time:1.09
> KMeans          ARI: 0.161, AMI: 0.348, NMI: 0.365 objective: 7.947
> time:0.39
> GT objective: 7.994
>
>
> Dataset  Optical Recognition of Handwritten Digits Data Set samples:
> 1797, features: 64, clusters: 10
> ======================================================================
> ITM             ARI: 0.838, AMI: 0.883, NMI: 0.886 objective: -186.152
> time:2.15
> ITM ID          ARI: 0.674, AMI: 0.785, NMI: 0.793 objective: -186.248
> time:3.26
> Ward            ARI: 0.794, AMI: 0.856, NMI: 0.868 objective: -186.240
> time:9.22
> KMeans          ARI: 0.667, AMI: 0.739, NMI: 0.746 objective: -187.357
> time:1.32
> GT objective: -186.250
>
>
> Dataset Modified Olivetti faces dataset. samples: 400, features: 4096,
> clusters: 40
> ======================================================================
> /home/local/lamueller/checkout/information_theoretic_mst/itm.py:87:
> UserWarning: Got dataset with n_samples < n_features. Setting intrinsic
> dimensionality to n_samples. This is most likely to high, leading to
> uneven clusters. It is recommendet to set infer_dimensionality=True.
>    warnings.warn("Got dataset with n_samples < n_features. Setting"
> ITM             ARI: 0.162, AMI: 0.475, NMI: 0.719 objective: -6622.173
> time:5.50
> ITM ID          ARI: 0.549, AMI: 0.705, NMI: 0.832 objective: -6691.920
> time:8.37
> Ward            ARI: 0.491, AMI: 0.670, NMI: 0.813 objective: -6702.053
> time:0.78
> KMeans          ARI: 0.458, AMI: 0.620, NMI: 0.780 objective: -6805.311
> time:29.97
> GT objective: -6787.981
>
> No parameters were adjusted for any algorithm. By showing ITM and ITM ID
> I obviously make my life easier by not picking a single setting.
> Still, ITM ID wins against ward 4 out of 5 times. The disclaimer is that
> this is evaluation
> of clustering algorithms using classification datasets and I leave it to
> you
> to decide if this is meaningful ;)
>
>
> andy
>
>
>
>
> ------------------------------------------------------------------------------
> Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
> Discover the easy way to master current and previous Microsoft technologies
> and advance your career. Get an incredible 1,500+ hours of step-by-step
> tutorial videos with LearnDevNow. Subscribe today and save!
> http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 

Public key at: http://pgp.mit.edu/ Search for this email address and select
the key from "2011-08-19" (key id: 54BA8735)

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question about naming a clustering algorithm

Reply via email to