Hi Alexandre,

I run the silhouette_score_block on my dataset, and this is the result

dataset size |X| = 260486, dimension 40, RAM 4GB

    Trial Original Ward (whole data)(1) *Original Ward
(sub_sample=50K)(2)*  Silhouette
Score Time(s) Silhouette Score Time(s)  1st 0.19045893 6250.758648
0.189+/-0.002 225.4+/-0.25  2nd 0.18690761 6254.378874 0.1859+/-0.001
224.37+/0.13
Note (1): run the silhouette_score_block on the whole data
        (2): run the silhouette score function silhouette_score_block on a
sub_sample=50K, with many iterations

Conclusion: run silhouette_score_block on a sub_sample with k iterations (k
x sub_sample > dataset size) is the best choice when the dataset is large.
The mean score is an approximation of the real score but the time cost
decreases dramatically.

Thank you very much for your help.

Regards,

T.Bao


On Thu, May 9, 2013 at 12:29 PM, Bao Thien <ntba...@gmail.com> wrote:

> HI Alexandre,
>
> Thank you very much for your help. This is absolutely the thing that fits
> my problem. Your help is very appreciate.
> I am also running the sampling method as Robert suggested.
> I will try with block version, and compare the results. Then, I will let
> all you guys know the results as soon as possible.
>
> Regards,
>
>
>
> On Thu, May 9, 2013 at 12:18 PM, Alexandre ABRAHAM <
> abraham.alexan...@gmail.com> wrote:
>
>> Hi Bao,
>>
>> Sorry for late reply, I've set up some code yesterday evening and my post
>> got blocked because of its size. The code is really simple and I kept the
>> scikit formalism so if you lookes at the scikit function, this should be
>> familiar to you.
>>
>> Gist : https://gist.github.com/AlexandreAbraham/5544803
>>
>> Methods :
>> - *_slow : these functions implement the "compute distance on the fly
>> method".
>> - *_block : the smarter method. Basically, distance matrices are computed
>> per cluster.
>>
>> Benches:
>> - small data (look at the main of the gist) :
>>     Scikit silhouette (1s): -0.002484
>>     Slow silhouette (154s): -0.002484
>>     Block silhouette (2s): -0.002484
>> - big data (X = np.random.random((20000, 1000)), y =
>> np.repeat(np.arange(100), 200)):
>>     Scikit silhouette (585.857552s): -0.003101, memory usage: about 4GB
>>     Block silhouette (633.306765s): -0.003101, memory usage: about 200MB
>>
>> Conclusion:
>> - you should *not* use the slow version. It is deadly slow.
>> - block method is a little slower but uses far less memory. This,
>> obviously, depends on your cluster sizes.
>>
>> I would advise you to try the block version and, if your data do not fit
>> in memory, then try sampling as Robert said (this option is available with
>> the block approach in my code).
>>
>> Alexandre.
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and
>> their applications. This 200-page book is written by three acclaimed
>> leaders in the field. The early access version is available now.
>> Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Nguyen Thien Bao
>
> NeuroInformatics Laboratory (NILab),
> Fondazione Bruno Kessler (FBK), Trento, Italy
> Centro Interdipartimentale Mente e Cervello (CIMeC)
> Universit`a degli Studi di Trento, Italy
> Email: ntba...@gmail.com  or  ntbao...@yahoo.com
> Cellphone: +39.345.293.1006 (Italy)
> Cellphone: +84.996.352.452 (VietNam)
>



-- 
Nguyen Thien Bao

NeuroInformatics Laboratory (NILab),
Fondazione Bruno Kessler (FBK), Trento, Italy
Centro Interdipartimentale Mente e Cervello (CIMeC)
Universit`a degli Studi di Trento, Italy
Email: ntba...@gmail.com  or  ntbao...@yahoo.com
Cellphone: +39.345.293.1006 (Italy)
Cellphone: +84.996.352.452 (VietNam)
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and 
their applications. This 200-page book is written by three acclaimed 
leaders in the field. The early access version is available now. 
Download your free book today! http://p.sf.net/sfu/neotech_d2d_may
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to